Artemis Scalability Thoughts

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Artemis Scalability Thoughts

nigro_franz
HI folks,

I'm writing here to share some thoughts related to the Artemis threading model and how it affects broker scalability.

Currently (on 2.7.0) we relies on a shared thread pool ie ActiveMQThreadPoolExecutor backed by a LinkedBlockingQueue-ish queue to process tasks.
Thanks to the the Actor abstraction we use a lock-free queue to serialize tasks (or items), 
processing them in batch in the shared thread pool, awaking a consumer thread only if needed (the logic is contained in ProcessorBase).
The awaking operation (ie ProcessorBase::onAddedTaskIfNotRunning) will execute on the shared thread pool a specific task to drain and execute a batch of tasks only if necessary, not on every added task/item.

Looking at the contention graphs of the broker (ie the bar width are the nanoseconds before entering into a lock) is quite clear the limitation of the current implementation:



In violet are shown the offer and poll operations on the LinkedBlockingQueue of the shared thread pool, happening from any thread of the pool (the thread is the base of each bar, in red).
The LinkedBlockingQueue indeed has a ReentrantLock to protect any operation on the linked q and is clear that having a giant lock in front of high contention point won't scale.

The above graph has been obtained with a single producer/single consumer/single queue/not-persistent run, but I don't have enough resources to check what could happen with more and more producers/consumers/queues.
The critical part is the offering/polling of tasks on the shared thread pool and in theory a maxed-out broker shouldn't have many idle threads to be awaken, but given that more producers/consumers/queues means many different Actors, in order to guarantee each actor tasks to be executed, the shared thread pool will need to process many unnecessary "awake" tasks, creating lot of contention on the blocking linked q, slowing down the entire broker.

In the past I've tried to replace the current shared thread pool implementation with a ForkJoinPool or (the most recent attempt) by using a lock-free q instead of BlockingLinkedQueue, with no success (https://github.com/apache/activemq-artemis/pull/2582).

Below the contention graph using a lock-free q in the shared thread pool:



In violet now we have QueueImpl::deliver and RefsOperation::afterCommit that are contending QueueImpl lock, but the numbers for each bar are very different: in the previous graph the contention on the shared thread pool lock is of 600 ns, while here is 20-80 ns and it can scale with number of queues, while the previous version not.

All green right? So, why I've reverted the lock-free thread pool?

Because with a low utilization of the broker (ie 1 producer/1 consumer/1 queue) the latencies and throughput were actually worse: cpu utilization graphs were showing that ProcessorBase::onAddedTaskIfNotRunning was spending most of its time by awaking the shared thread pool. The same was happening with a ForkJoin pool, sadly.
It seems (and it is just a guess) that, given that tasks get consumed faster (there is no lock preventing them to get polled and executed), the thread pool is getting idle sooner (the default thread pool size is of 30 and I have a machine with just 8 real cores), forcing any new task submission to awake any of the thread pool to process incoming tasks.

What are your thoughts on this? 
I don't want to trade so much the "low utilization" performance for the scaling TBH, that's why I've preferred to revert the change.
Note that other applications with scalability needs (eg Cassandra) have changed their shared pool approach based on SEDA to a thread-per-pool architecture for this same reason.

Cheers,
Franz





 


Reply | Threaded
Open this post in threaded view
|

Re: Artemis Scalability Thoughts

nigro_franz
I can see that the images are not visible in the post from the forum, but
maybe via email it's working.
Let me know if I need to attach them



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-Dev-f2368404.html
Reply | Threaded
Open this post in threaded view
|

Re: Artemis Scalability Thoughts

christopher.l.shannon
In reply to this post by nigro_franz
I don't think sacrificing low utilization is a good idea.  That being said
is there an actual real world throughput issue here? In general, I don't
know that I see much value in over engineering and micro managing this
stuff unless there's a real world measurable benefit to be gained vs just
theoretical benchmarks as it's just going to make things harder to maintain
and mistakes easier to make in the future.

On Wed, Mar 20, 2019 at 6:51 AM Francesco Nigro <[hidden email]> wrote:

> HI folks,
>
> I'm writing here to share some thoughts related to the Artemis threading
> model and how it affects broker scalability.
>
> Currently (on 2.7.0) we relies on a shared thread pool ie
> ActiveMQThreadPoolExecutor backed by a LinkedBlockingQueue-ish queue to
> process tasks.
> Thanks to the the Actor abstraction we use a lock-free queue to serialize
> tasks (or items),
> processing them in batch in the shared thread pool, awaking a consumer
> thread only if needed (the logic is contained in ProcessorBase).
> The awaking operation (ie ProcessorBase::onAddedTaskIfNotRunning) will
> execute on the shared thread pool a specific task to drain and execute a
> batch of tasks only if necessary, not on every added task/item.
>
> Looking at the contention graphs of the broker (ie the bar width are the
> nanoseconds before entering into a lock) is quite clear the limitation of
> the current implementation:
>
> [image: image.png]
>
> In violet are shown the offer and poll operations on the
> LinkedBlockingQueue of the shared thread pool, happening from any thread of
> the pool (the thread is the base of each bar, in red).
> The LinkedBlockingQueue indeed has a ReentrantLock to protect any
> operation on the linked q and is clear that having a giant lock in front of
> high contention point won't scale.
>
> The above graph has been obtained with a single producer/single
> consumer/single queue/not-persistent run, but I don't have enough resources
> to check what could happen with more and more producers/consumers/queues.
> The critical part is the offering/polling of tasks on the shared thread
> pool and in theory a maxed-out broker shouldn't have many idle threads to
> be awaken, but given that more producers/consumers/queues means many
> different Actors, in order to guarantee each actor tasks to be executed,
> the shared thread pool will need to process many unnecessary "awake" tasks,
> creating lot of contention on the blocking linked q, slowing down the
> entire broker.
>
> In the past I've tried to replace the current shared thread pool
> implementation with a ForkJoinPool or (the most recent attempt) by using a
> lock-free q instead of BlockingLinkedQueue, with no success (
> https://github.com/apache/activemq-artemis/pull/2582).
>
> Below the contention graph using a lock-free q in the shared thread pool:
>
> [image: image.png]
>
> In violet now we have QueueImpl::deliver and RefsOperation::afterCommit
> that are contending QueueImpl lock, but the numbers for each bar are very
> different: in the previous graph the contention on the shared thread pool
> lock is of 600 ns, while here is 20-80 ns and it can scale with number of
> queues, while the previous version not.
>
> All green right? So, why I've reverted the lock-free thread pool?
>
> Because with a low utilization of the broker (ie 1 producer/1 consumer/1
> queue) the latencies and throughput were actually worse: cpu utilization
> graphs were showing that ProcessorBase::onAddedTaskIfNotRunning was
> spending most of its time by awaking the shared thread pool. The same was
> happening with a ForkJoin pool, sadly.
> It seems (and it is just a guess) that, given that tasks get consumed
> faster (there is no lock preventing them to get polled and executed), the
> thread pool is getting idle sooner (the default thread pool size is of 30
> and I have a machine with just 8 real cores), forcing any new task
> submission to awake any of the thread pool to process incoming tasks.
>
> What are your thoughts on this?
> I don't want to trade so much the "low utilization" performance for the
> scaling TBH, that's why I've preferred to revert the change.
> Note that other applications with scalability needs (eg Cassandra) have
> changed their shared pool approach based on SEDA to a thread-per-pool
> architecture for this same reason.
>
> Cheers,
> Franz
>
>
>
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis Scalability Thoughts

nigro_franz
> That being said
is there an actual real world throughput issue here?

Yes and no: it's a chance to improve things, especially for cloud uses: it
is a fact that now that Specter and Meltdown are there we don't want to
waste CPU time
sitting idle/on contention if isn't needed and as I've said "is a giant
lock on any task submitted".
IMO having a talk on how to improve it is not over-engineering, but just
engineering, given that scaling non-persistent messages (or persistent with
very fast disks)
is something that we expect from a broker: from a commercial point of view
is nice that we could scale by adding brokers, but if you can save 2
machines to get
the same throughput I think is a nice improvement for (m)any users.

> , I don't
know that I see much value in over engineering and micro managing this
stuff unless there's a real world measurable benefit to be gained vs just
theoretical benchmarks as it's just going to make things harder to maintain
and mistakes easier to make in the future.

Cassandra from Datastax has gained about 2X throughput by solving this, but
it can be said that's a "different scenario" too: as an engineer I can say
no, is not.
I've "recently" addressed with the client team a similar "issue" on
qpid-jms, getting near 2X throughput (nudge nudge Robbie Gemmel/Tim Bish).
And this "issue" (actually, a "chance to improve things") has been well
hidden altough in front of anyone from a long time:
https://issues.apache.org/jira/browse/QPIDJMS-396.

The reason why I've written on the dev list is to understand if anyone has
had the chance to measure in a real load scenario something like this.


Il giorno mer 20 mar 2019 alle ore 12:07 Christopher Shannon <
[hidden email]> ha scritto:

> I don't think sacrificing low utilization is a good idea.  That being said
> is there an actual real world throughput issue here? In general, I don't
> know that I see much value in over engineering and micro managing this
> stuff unless there's a real world measurable benefit to be gained vs just
> theoretical benchmarks as it's just going to make things harder to maintain
> and mistakes easier to make in the future.
>
> On Wed, Mar 20, 2019 at 6:51 AM Francesco Nigro <[hidden email]>
> wrote:
>
> > HI folks,
> >
> > I'm writing here to share some thoughts related to the Artemis threading
> > model and how it affects broker scalability.
> >
> > Currently (on 2.7.0) we relies on a shared thread pool ie
> > ActiveMQThreadPoolExecutor backed by a LinkedBlockingQueue-ish queue to
> > process tasks.
> > Thanks to the the Actor abstraction we use a lock-free queue to serialize
> > tasks (or items),
> > processing them in batch in the shared thread pool, awaking a consumer
> > thread only if needed (the logic is contained in ProcessorBase).
> > The awaking operation (ie ProcessorBase::onAddedTaskIfNotRunning) will
> > execute on the shared thread pool a specific task to drain and execute a
> > batch of tasks only if necessary, not on every added task/item.
> >
> > Looking at the contention graphs of the broker (ie the bar width are the
> > nanoseconds before entering into a lock) is quite clear the limitation of
> > the current implementation:
> >
> > [image: image.png]
> >
> > In violet are shown the offer and poll operations on the
> > LinkedBlockingQueue of the shared thread pool, happening from any thread
> of
> > the pool (the thread is the base of each bar, in red).
> > The LinkedBlockingQueue indeed has a ReentrantLock to protect any
> > operation on the linked q and is clear that having a giant lock in front
> of
> > high contention point won't scale.
> >
> > The above graph has been obtained with a single producer/single
> > consumer/single queue/not-persistent run, but I don't have enough
> resources
> > to check what could happen with more and more producers/consumers/queues.
> > The critical part is the offering/polling of tasks on the shared thread
> > pool and in theory a maxed-out broker shouldn't have many idle threads to
> > be awaken, but given that more producers/consumers/queues means many
> > different Actors, in order to guarantee each actor tasks to be executed,
> > the shared thread pool will need to process many unnecessary "awake"
> tasks,
> > creating lot of contention on the blocking linked q, slowing down the
> > entire broker.
> >
> > In the past I've tried to replace the current shared thread pool
> > implementation with a ForkJoinPool or (the most recent attempt) by using
> a
> > lock-free q instead of BlockingLinkedQueue, with no success (
> > https://github.com/apache/activemq-artemis/pull/2582).
> >
> > Below the contention graph using a lock-free q in the shared thread pool:
> >
> > [image: image.png]
> >
> > In violet now we have QueueImpl::deliver and RefsOperation::afterCommit
> > that are contending QueueImpl lock, but the numbers for each bar are very
> > different: in the previous graph the contention on the shared thread pool
> > lock is of 600 ns, while here is 20-80 ns and it can scale with number of
> > queues, while the previous version not.
> >
> > All green right? So, why I've reverted the lock-free thread pool?
> >
> > Because with a low utilization of the broker (ie 1 producer/1 consumer/1
> > queue) the latencies and throughput were actually worse: cpu utilization
> > graphs were showing that ProcessorBase::onAddedTaskIfNotRunning was
> > spending most of its time by awaking the shared thread pool. The same was
> > happening with a ForkJoin pool, sadly.
> > It seems (and it is just a guess) that, given that tasks get consumed
> > faster (there is no lock preventing them to get polled and executed), the
> > thread pool is getting idle sooner (the default thread pool size is of 30
> > and I have a machine with just 8 real cores), forcing any new task
> > submission to awake any of the thread pool to process incoming tasks.
> >
> > What are your thoughts on this?
> > I don't want to trade so much the "low utilization" performance for the
> > scaling TBH, that's why I've preferred to revert the change.
> > Note that other applications with scalability needs (eg Cassandra) have
> > changed their shared pool approach based on SEDA to a thread-per-pool
> > architecture for this same reason.
> >
> > Cheers,
> > Franz
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis Scalability Thoughts

michael.andre.pearce
So Franz.




If youre talking about the solution that you regressed just before we released. Then we did test it in our real testing env. I didnt notice and negative impact for our use cases.






Regards to code, i actually thought what you had it made code cleaner.




Im not sure what use case your concerned about with your change, generally i expect brokers to have more than just one queue and just one producer/consumer in real world use cases. I think having just one producer, one consumer and one queue on a whole broker is very academic and not a typical real world case. I think we engineer for multiple consumers/multiple producers and should test with such setup.




Get Outlook for Android







On Wed, Mar 20, 2019 at 1:15 PM +0100, "Francesco Nigro" <[hidden email]> wrote:










> That being said
is there an actual real world throughput issue here?

Yes and no: it's a chance to improve things, especially for cloud uses: it
is a fact that now that Specter and Meltdown are there we don't want to
waste CPU time
sitting idle/on contention if isn't needed and as I've said "is a giant
lock on any task submitted".
IMO having a talk on how to improve it is not over-engineering, but just
engineering, given that scaling non-persistent messages (or persistent with
very fast disks)
is something that we expect from a broker: from a commercial point of view
is nice that we could scale by adding brokers, but if you can save 2
machines to get
the same throughput I think is a nice improvement for (m)any users.

> , I don't
know that I see much value in over engineering and micro managing this
stuff unless there's a real world measurable benefit to be gained vs just
theoretical benchmarks as it's just going to make things harder to maintain
and mistakes easier to make in the future.

Cassandra from Datastax has gained about 2X throughput by solving this, but
it can be said that's a "different scenario" too: as an engineer I can say
no, is not.
I've "recently" addressed with the client team a similar "issue" on
qpid-jms, getting near 2X throughput (nudge nudge Robbie Gemmel/Tim Bish).
And this "issue" (actually, a "chance to improve things") has been well
hidden altough in front of anyone from a long time:
https://issues.apache.org/jira/browse/QPIDJMS-396.

The reason why I've written on the dev list is to understand if anyone has
had the chance to measure in a real load scenario something like this.


Il giorno mer 20 mar 2019 alle ore 12:07 Christopher Shannon <
[hidden email]> ha scritto:

> I don't think sacrificing low utilization is a good idea.  That being said
> is there an actual real world throughput issue here? In general, I don't
> know that I see much value in over engineering and micro managing this
> stuff unless there's a real world measurable benefit to be gained vs just
> theoretical benchmarks as it's just going to make things harder to maintain
> and mistakes easier to make in the future.
>
> On Wed, Mar 20, 2019 at 6:51 AM Francesco Nigro
> wrote:
>
> > HI folks,
> >
> > I'm writing here to share some thoughts related to the Artemis threading
> > model and how it affects broker scalability.
> >
> > Currently (on 2.7.0) we relies on a shared thread pool ie
> > ActiveMQThreadPoolExecutor backed by a LinkedBlockingQueue-ish queue to
> > process tasks.
> > Thanks to the the Actor abstraction we use a lock-free queue to serialize
> > tasks (or items),
> > processing them in batch in the shared thread pool, awaking a consumer
> > thread only if needed (the logic is contained in ProcessorBase).
> > The awaking operation (ie ProcessorBase::onAddedTaskIfNotRunning) will
> > execute on the shared thread pool a specific task to drain and execute a
> > batch of tasks only if necessary, not on every added task/item.
> >
> > Looking at the contention graphs of the broker (ie the bar width are the
> > nanoseconds before entering into a lock) is quite clear the limitation of
> > the current implementation:
> >
> > [image: image.png]
> >
> > In violet are shown the offer and poll operations on the
> > LinkedBlockingQueue of the shared thread pool, happening from any thread
> of
> > the pool (the thread is the base of each bar, in red).
> > The LinkedBlockingQueue indeed has a ReentrantLock to protect any
> > operation on the linked q and is clear that having a giant lock in front
> of
> > high contention point won't scale.
> >
> > The above graph has been obtained with a single producer/single
> > consumer/single queue/not-persistent run, but I don't have enough
> resources
> > to check what could happen with more and more producers/consumers/queues.
> > The critical part is the offering/polling of tasks on the shared thread
> > pool and in theory a maxed-out broker shouldn't have many idle threads to
> > be awaken, but given that more producers/consumers/queues means many
> > different Actors, in order to guarantee each actor tasks to be executed,
> > the shared thread pool will need to process many unnecessary "awake"
> tasks,
> > creating lot of contention on the blocking linked q, slowing down the
> > entire broker.
> >
> > In the past I've tried to replace the current shared thread pool
> > implementation with a ForkJoinPool or (the most recent attempt) by using
> a
> > lock-free q instead of BlockingLinkedQueue, with no success (
> > https://github.com/apache/activemq-artemis/pull/2582).
> >
> > Below the contention graph using a lock-free q in the shared thread pool:
> >
> > [image: image.png]
> >
> > In violet now we have QueueImpl::deliver and RefsOperation::afterCommit
> > that are contending QueueImpl lock, but the numbers for each bar are very
> > different: in the previous graph the contention on the shared thread pool
> > lock is of 600 ns, while here is 20-80 ns and it can scale with number of
> > queues, while the previous version not.
> >
> > All green right? So, why I've reverted the lock-free thread pool?
> >
> > Because with a low utilization of the broker (ie 1 producer/1 consumer/1
> > queue) the latencies and throughput were actually worse: cpu utilization
> > graphs were showing that ProcessorBase::onAddedTaskIfNotRunning was
> > spending most of its time by awaking the shared thread pool. The same was
> > happening with a ForkJoin pool, sadly.
> > It seems (and it is just a guess) that, given that tasks get consumed
> > faster (there is no lock preventing them to get polled and executed), the
> > thread pool is getting idle sooner (the default thread pool size is of 30
> > and I have a machine with just 8 real cores), forcing any new task
> > submission to awake any of the thread pool to process incoming tasks.
> >
> > What are your thoughts on this?
> > I don't want to trade so much the "low utilization" performance for the
> > scaling TBH, that's why I've preferred to revert the change.
> > Note that other applications with scalability needs (eg Cassandra) have
> > changed their shared pool approach based on SEDA to a thread-per-pool
> > architecture for this same reason.
> >
> > Cheers,
> > Franz
> >
> >
> >
> >
> >
> >
> >
> >
> >
>





Reply | Threaded
Open this post in threaded view
|

Re: Artemis Scalability Thoughts

nigro_franz
@michael

> If youre talking about the solution that you regressed just before we
released. Then we did test it in our real testing env. I didnt notice and
negative impact for our use cases.

That's nice to be heard, MIchael. Probably I could have avoided to revert
it, but consider that the accademic use case ie 1 P/1 C/1 Q is just a way
to emulate a "low-utilization" case.
As Christopher has said, there are users that don't want to trade low
utilization performance...and I understand it.
The thing is, have you seen if using TranferLinkedQueue has improved over
the original thread pool with many producers/consumers/queues scenario?

> Regards to code, i actually thought what you had it made code cleaner.

That's another benefit, but the change was meant to provide only benefits,
that's why I've reverted it...

>  think we engineer for multiple consumers/multiple producers and should
test with such setup.
I agree with you, but it hurts my heart to deliver a patch that cause perf
regressions without providing a huge benefit ie proven in other cases...

Thank guys,
Franz




Il giorno mer 20 mar 2019 alle ore 13:34
<[hidden email]> ha scritto:

> So Franz.
>
>
>
>
> If youre talking about the solution that you regressed just before we
> released. Then we did test it in our real testing env. I didnt notice and
> negative impact for our use cases.
>
>
>
>
>
>
> Regards to code, i actually thought what you had it made code cleaner.
>
>
>
>
> Im not sure what use case your concerned about with your change, generally
> i expect brokers to have more than just one queue and just one
> producer/consumer in real world use cases. I think having just one
> producer, one consumer and one queue on a whole broker is very academic and
> not a typical real world case. I think we engineer for multiple
> consumers/multiple producers and should test with such setup.
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Wed, Mar 20, 2019 at 1:15 PM +0100, "Francesco Nigro" <
> [hidden email]> wrote:
>
>
>
>
>
>
>
>
>
>
> > That being said
> is there an actual real world throughput issue here?
>
> Yes and no: it's a chance to improve things, especially for cloud uses: it
> is a fact that now that Specter and Meltdown are there we don't want to
> waste CPU time
> sitting idle/on contention if isn't needed and as I've said "is a giant
> lock on any task submitted".
> IMO having a talk on how to improve it is not over-engineering, but just
> engineering, given that scaling non-persistent messages (or persistent with
> very fast disks)
> is something that we expect from a broker: from a commercial point of view
> is nice that we could scale by adding brokers, but if you can save 2
> machines to get
> the same throughput I think is a nice improvement for (m)any users.
>
> > , I don't
> know that I see much value in over engineering and micro managing this
> stuff unless there's a real world measurable benefit to be gained vs just
> theoretical benchmarks as it's just going to make things harder to maintain
> and mistakes easier to make in the future.
>
> Cassandra from Datastax has gained about 2X throughput by solving this, but
> it can be said that's a "different scenario" too: as an engineer I can say
> no, is not.
> I've "recently" addressed with the client team a similar "issue" on
> qpid-jms, getting near 2X throughput (nudge nudge Robbie Gemmel/Tim Bish).
> And this "issue" (actually, a "chance to improve things") has been well
> hidden altough in front of anyone from a long time:
> https://issues.apache.org/jira/browse/QPIDJMS-396.
>
> The reason why I've written on the dev list is to understand if anyone has
> had the chance to measure in a real load scenario something like this.
>
>
> Il giorno mer 20 mar 2019 alle ore 12:07 Christopher Shannon <
> [hidden email]> ha scritto:
>
> > I don't think sacrificing low utilization is a good idea.  That being
> said
> > is there an actual real world throughput issue here? In general, I don't
> > know that I see much value in over engineering and micro managing this
> > stuff unless there's a real world measurable benefit to be gained vs just
> > theoretical benchmarks as it's just going to make things harder to
> maintain
> > and mistakes easier to make in the future.
> >
> > On Wed, Mar 20, 2019 at 6:51 AM Francesco Nigro
> > wrote:
> >
> > > HI folks,
> > >
> > > I'm writing here to share some thoughts related to the Artemis
> threading
> > > model and how it affects broker scalability.
> > >
> > > Currently (on 2.7.0) we relies on a shared thread pool ie
> > > ActiveMQThreadPoolExecutor backed by a LinkedBlockingQueue-ish queue to
> > > process tasks.
> > > Thanks to the the Actor abstraction we use a lock-free queue to
> serialize
> > > tasks (or items),
> > > processing them in batch in the shared thread pool, awaking a consumer
> > > thread only if needed (the logic is contained in ProcessorBase).
> > > The awaking operation (ie ProcessorBase::onAddedTaskIfNotRunning) will
> > > execute on the shared thread pool a specific task to drain and execute
> a
> > > batch of tasks only if necessary, not on every added task/item.
> > >
> > > Looking at the contention graphs of the broker (ie the bar width are
> the
> > > nanoseconds before entering into a lock) is quite clear the limitation
> of
> > > the current implementation:
> > >
> > > [image: image.png]
> > >
> > > In violet are shown the offer and poll operations on the
> > > LinkedBlockingQueue of the shared thread pool, happening from any
> thread
> > of
> > > the pool (the thread is the base of each bar, in red).
> > > The LinkedBlockingQueue indeed has a ReentrantLock to protect any
> > > operation on the linked q and is clear that having a giant lock in
> front
> > of
> > > high contention point won't scale.
> > >
> > > The above graph has been obtained with a single producer/single
> > > consumer/single queue/not-persistent run, but I don't have enough
> > resources
> > > to check what could happen with more and more
> producers/consumers/queues.
> > > The critical part is the offering/polling of tasks on the shared thread
> > > pool and in theory a maxed-out broker shouldn't have many idle threads
> to
> > > be awaken, but given that more producers/consumers/queues means many
> > > different Actors, in order to guarantee each actor tasks to be
> executed,
> > > the shared thread pool will need to process many unnecessary "awake"
> > tasks,
> > > creating lot of contention on the blocking linked q, slowing down the
> > > entire broker.
> > >
> > > In the past I've tried to replace the current shared thread pool
> > > implementation with a ForkJoinPool or (the most recent attempt) by
> using
> > a
> > > lock-free q instead of BlockingLinkedQueue, with no success (
> > > https://github.com/apache/activemq-artemis/pull/2582).
> > >
> > > Below the contention graph using a lock-free q in the shared thread
> pool:
> > >
> > > [image: image.png]
> > >
> > > In violet now we have QueueImpl::deliver and RefsOperation::afterCommit
> > > that are contending QueueImpl lock, but the numbers for each bar are
> very
> > > different: in the previous graph the contention on the shared thread
> pool
> > > lock is of 600 ns, while here is 20-80 ns and it can scale with number
> of
> > > queues, while the previous version not.
> > >
> > > All green right? So, why I've reverted the lock-free thread pool?
> > >
> > > Because with a low utilization of the broker (ie 1 producer/1
> consumer/1
> > > queue) the latencies and throughput were actually worse: cpu
> utilization
> > > graphs were showing that ProcessorBase::onAddedTaskIfNotRunning was
> > > spending most of its time by awaking the shared thread pool. The same
> was
> > > happening with a ForkJoin pool, sadly.
> > > It seems (and it is just a guess) that, given that tasks get consumed
> > > faster (there is no lock preventing them to get polled and executed),
> the
> > > thread pool is getting idle sooner (the default thread pool size is of
> 30
> > > and I have a machine with just 8 real cores), forcing any new task
> > > submission to awake any of the thread pool to process incoming tasks.
> > >
> > > What are your thoughts on this?
> > > I don't want to trade so much the "low utilization" performance for the
> > > scaling TBH, that's why I've preferred to revert the change.
> > > Note that other applications with scalability needs (eg Cassandra) have
> > > changed their shared pool approach based on SEDA to a thread-per-pool
> > > architecture for this same reason.
> > >
> > > Cheers,
> > > Franz
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis Scalability Thoughts

michael.andre.pearce
Performance was similar. As such agreed if there is an impact for another usecase thats genuine maybe leave it as is.




Get Outlook for Android







On Wed, Mar 20, 2019 at 1:49 PM +0100, "Francesco Nigro" <[hidden email]> wrote:










@michael

> If youre talking about the solution that you regressed just before we
released. Then we did test it in our real testing env. I didnt notice and
negative impact for our use cases.

That's nice to be heard, MIchael. Probably I could have avoided to revert
it, but consider that the accademic use case ie 1 P/1 C/1 Q is just a way
to emulate a "low-utilization" case.
As Christopher has said, there are users that don't want to trade low
utilization performance...and I understand it.
The thing is, have you seen if using TranferLinkedQueue has improved over
the original thread pool with many producers/consumers/queues scenario?

> Regards to code, i actually thought what you had it made code cleaner.

That's another benefit, but the change was meant to provide only benefits,
that's why I've reverted it...

>  think we engineer for multiple consumers/multiple producers and should
test with such setup.
I agree with you, but it hurts my heart to deliver a patch that cause perf
regressions without providing a huge benefit ie proven in other cases...

Thank guys,
Franz




Il giorno mer 20 mar 2019 alle ore 13:34
 ha scritto:

> So Franz.
>
>
>
>
> If youre talking about the solution that you regressed just before we
> released. Then we did test it in our real testing env. I didnt notice and
> negative impact for our use cases.
>
>
>
>
>
>
> Regards to code, i actually thought what you had it made code cleaner.
>
>
>
>
> Im not sure what use case your concerned about with your change, generally
> i expect brokers to have more than just one queue and just one
> producer/consumer in real world use cases. I think having just one
> producer, one consumer and one queue on a whole broker is very academic and
> not a typical real world case. I think we engineer for multiple
> consumers/multiple producers and should test with such setup.
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Wed, Mar 20, 2019 at 1:15 PM +0100, "Francesco Nigro" <
> [hidden email]> wrote:
>
>
>
>
>
>
>
>
>
>
> > That being said
> is there an actual real world throughput issue here?
>
> Yes and no: it's a chance to improve things, especially for cloud uses: it
> is a fact that now that Specter and Meltdown are there we don't want to
> waste CPU time
> sitting idle/on contention if isn't needed and as I've said "is a giant
> lock on any task submitted".
> IMO having a talk on how to improve it is not over-engineering, but just
> engineering, given that scaling non-persistent messages (or persistent with
> very fast disks)
> is something that we expect from a broker: from a commercial point of view
> is nice that we could scale by adding brokers, but if you can save 2
> machines to get
> the same throughput I think is a nice improvement for (m)any users.
>
> > , I don't
> know that I see much value in over engineering and micro managing this
> stuff unless there's a real world measurable benefit to be gained vs just
> theoretical benchmarks as it's just going to make things harder to maintain
> and mistakes easier to make in the future.
>
> Cassandra from Datastax has gained about 2X throughput by solving this, but
> it can be said that's a "different scenario" too: as an engineer I can say
> no, is not.
> I've "recently" addressed with the client team a similar "issue" on
> qpid-jms, getting near 2X throughput (nudge nudge Robbie Gemmel/Tim Bish).
> And this "issue" (actually, a "chance to improve things") has been well
> hidden altough in front of anyone from a long time:
> https://issues.apache.org/jira/browse/QPIDJMS-396.
>
> The reason why I've written on the dev list is to understand if anyone has
> had the chance to measure in a real load scenario something like this.
>
>
> Il giorno mer 20 mar 2019 alle ore 12:07 Christopher Shannon <
> [hidden email]> ha scritto:
>
> > I don't think sacrificing low utilization is a good idea.  That being
> said
> > is there an actual real world throughput issue here? In general, I don't
> > know that I see much value in over engineering and micro managing this
> > stuff unless there's a real world measurable benefit to be gained vs just
> > theoretical benchmarks as it's just going to make things harder to
> maintain
> > and mistakes easier to make in the future.
> >
> > On Wed, Mar 20, 2019 at 6:51 AM Francesco Nigro
> > wrote:
> >
> > > HI folks,
> > >
> > > I'm writing here to share some thoughts related to the Artemis
> threading
> > > model and how it affects broker scalability.
> > >
> > > Currently (on 2.7.0) we relies on a shared thread pool ie
> > > ActiveMQThreadPoolExecutor backed by a LinkedBlockingQueue-ish queue to
> > > process tasks.
> > > Thanks to the the Actor abstraction we use a lock-free queue to
> serialize
> > > tasks (or items),
> > > processing them in batch in the shared thread pool, awaking a consumer
> > > thread only if needed (the logic is contained in ProcessorBase).
> > > The awaking operation (ie ProcessorBase::onAddedTaskIfNotRunning) will
> > > execute on the shared thread pool a specific task to drain and execute
> a
> > > batch of tasks only if necessary, not on every added task/item.
> > >
> > > Looking at the contention graphs of the broker (ie the bar width are
> the
> > > nanoseconds before entering into a lock) is quite clear the limitation
> of
> > > the current implementation:
> > >
> > > [image: image.png]
> > >
> > > In violet are shown the offer and poll operations on the
> > > LinkedBlockingQueue of the shared thread pool, happening from any
> thread
> > of
> > > the pool (the thread is the base of each bar, in red).
> > > The LinkedBlockingQueue indeed has a ReentrantLock to protect any
> > > operation on the linked q and is clear that having a giant lock in
> front
> > of
> > > high contention point won't scale.
> > >
> > > The above graph has been obtained with a single producer/single
> > > consumer/single queue/not-persistent run, but I don't have enough
> > resources
> > > to check what could happen with more and more
> producers/consumers/queues.
> > > The critical part is the offering/polling of tasks on the shared thread
> > > pool and in theory a maxed-out broker shouldn't have many idle threads
> to
> > > be awaken, but given that more producers/consumers/queues means many
> > > different Actors, in order to guarantee each actor tasks to be
> executed,
> > > the shared thread pool will need to process many unnecessary "awake"
> > tasks,
> > > creating lot of contention on the blocking linked q, slowing down the
> > > entire broker.
> > >
> > > In the past I've tried to replace the current shared thread pool
> > > implementation with a ForkJoinPool or (the most recent attempt) by
> using
> > a
> > > lock-free q instead of BlockingLinkedQueue, with no success (
> > > https://github.com/apache/activemq-artemis/pull/2582).
> > >
> > > Below the contention graph using a lock-free q in the shared thread
> pool:
> > >
> > > [image: image.png]
> > >
> > > In violet now we have QueueImpl::deliver and RefsOperation::afterCommit
> > > that are contending QueueImpl lock, but the numbers for each bar are
> very
> > > different: in the previous graph the contention on the shared thread
> pool
> > > lock is of 600 ns, while here is 20-80 ns and it can scale with number
> of
> > > queues, while the previous version not.
> > >
> > > All green right? So, why I've reverted the lock-free thread pool?
> > >
> > > Because with a low utilization of the broker (ie 1 producer/1
> consumer/1
> > > queue) the latencies and throughput were actually worse: cpu
> utilization
> > > graphs were showing that ProcessorBase::onAddedTaskIfNotRunning was
> > > spending most of its time by awaking the shared thread pool. The same
> was
> > > happening with a ForkJoin pool, sadly.
> > > It seems (and it is just a guess) that, given that tasks get consumed
> > > faster (there is no lock preventing them to get polled and executed),
> the
> > > thread pool is getting idle sooner (the default thread pool size is of
> 30
> > > and I have a machine with just 8 real cores), forcing any new task
> > > submission to awake any of the thread pool to process incoming tasks.
> > >
> > > What are your thoughts on this?
> > > I don't want to trade so much the "low utilization" performance for the
> > > scaling TBH, that's why I've preferred to revert the change.
> > > Note that other applications with scalability needs (eg Cassandra) have
> > > changed their shared pool approach based on SEDA to a thread-per-pool
> > > architecture for this same reason.
> > >
> > > Cheers,
> > > Franz
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
>
>
>





Reply | Threaded
Open this post in threaded view
|

Re: Artemis Scalability Thoughts

michael.andre.pearce
Another option could be that you provide your new solution and turn it on by default but we make it possible for that other use case if it becomes an issue to toggle back to the old?




I honestly dont see the other usecase being common. But as long as you give the option for someone to toggle back then no one loses out, but others and the more typical cases could benefit from.




Get Outlook for Android







On Wed, Mar 20, 2019 at 2:31 PM +0100, <[hidden email]> wrote:










Performance was similar. As such agreed if there is an impact for another usecase thats genuine maybe leave it as is.




Get Outlook for Android







On Wed, Mar 20, 2019 at 1:49 PM +0100, "Francesco Nigro"  wrote:










@michael

> If youre talking about the solution that you regressed just before we
released. Then we did test it in our real testing env. I didnt notice and
negative impact for our use cases.

That's nice to be heard, MIchael. Probably I could have avoided to revert
it, but consider that the accademic use case ie 1 P/1 C/1 Q is just a way
to emulate a "low-utilization" case.
As Christopher has said, there are users that don't want to trade low
utilization performance...and I understand it.
The thing is, have you seen if using TranferLinkedQueue has improved over
the original thread pool with many producers/consumers/queues scenario?

> Regards to code, i actually thought what you had it made code cleaner.

That's another benefit, but the change was meant to provide only benefits,
that's why I've reverted it...

>  think we engineer for multiple consumers/multiple producers and should
test with such setup.
I agree with you, but it hurts my heart to deliver a patch that cause perf
regressions without providing a huge benefit ie proven in other cases...

Thank guys,
Franz




Il giorno mer 20 mar 2019 alle ore 13:34
 ha scritto:

> So Franz.
>
>
>
>
> If youre talking about the solution that you regressed just before we
> released. Then we did test it in our real testing env. I didnt notice and
> negative impact for our use cases.
>
>
>
>
>
>
> Regards to code, i actually thought what you had it made code cleaner.
>
>
>
>
> Im not sure what use case your concerned about with your change, generally
> i expect brokers to have more than just one queue and just one
> producer/consumer in real world use cases. I think having just one
> producer, one consumer and one queue on a whole broker is very academic and
> not a typical real world case. I think we engineer for multiple
> consumers/multiple producers and should test with such setup.
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Wed, Mar 20, 2019 at 1:15 PM +0100, "Francesco Nigro" <
> [hidden email]> wrote:
>
>
>
>
>
>
>
>
>
>
> > That being said
> is there an actual real world throughput issue here?
>
> Yes and no: it's a chance to improve things, especially for cloud uses: it
> is a fact that now that Specter and Meltdown are there we don't want to
> waste CPU time
> sitting idle/on contention if isn't needed and as I've said "is a giant
> lock on any task submitted".
> IMO having a talk on how to improve it is not over-engineering, but just
> engineering, given that scaling non-persistent messages (or persistent with
> very fast disks)
> is something that we expect from a broker: from a commercial point of view
> is nice that we could scale by adding brokers, but if you can save 2
> machines to get
> the same throughput I think is a nice improvement for (m)any users.
>
> > , I don't
> know that I see much value in over engineering and micro managing this
> stuff unless there's a real world measurable benefit to be gained vs just
> theoretical benchmarks as it's just going to make things harder to maintain
> and mistakes easier to make in the future.
>
> Cassandra from Datastax has gained about 2X throughput by solving this, but
> it can be said that's a "different scenario" too: as an engineer I can say
> no, is not.
> I've "recently" addressed with the client team a similar "issue" on
> qpid-jms, getting near 2X throughput (nudge nudge Robbie Gemmel/Tim Bish).
> And this "issue" (actually, a "chance to improve things") has been well
> hidden altough in front of anyone from a long time:
> https://issues.apache.org/jira/browse/QPIDJMS-396.
>
> The reason why I've written on the dev list is to understand if anyone has
> had the chance to measure in a real load scenario something like this.
>
>
> Il giorno mer 20 mar 2019 alle ore 12:07 Christopher Shannon <
> [hidden email]> ha scritto:
>
> > I don't think sacrificing low utilization is a good idea.  That being
> said
> > is there an actual real world throughput issue here? In general, I don't
> > know that I see much value in over engineering and micro managing this
> > stuff unless there's a real world measurable benefit to be gained vs just
> > theoretical benchmarks as it's just going to make things harder to
> maintain
> > and mistakes easier to make in the future.
> >
> > On Wed, Mar 20, 2019 at 6:51 AM Francesco Nigro
> > wrote:
> >
> > > HI folks,
> > >
> > > I'm writing here to share some thoughts related to the Artemis
> threading
> > > model and how it affects broker scalability.
> > >
> > > Currently (on 2.7.0) we relies on a shared thread pool ie
> > > ActiveMQThreadPoolExecutor backed by a LinkedBlockingQueue-ish queue to
> > > process tasks.
> > > Thanks to the the Actor abstraction we use a lock-free queue to
> serialize
> > > tasks (or items),
> > > processing them in batch in the shared thread pool, awaking a consumer
> > > thread only if needed (the logic is contained in ProcessorBase).
> > > The awaking operation (ie ProcessorBase::onAddedTaskIfNotRunning) will
> > > execute on the shared thread pool a specific task to drain and execute
> a
> > > batch of tasks only if necessary, not on every added task/item.
> > >
> > > Looking at the contention graphs of the broker (ie the bar width are
> the
> > > nanoseconds before entering into a lock) is quite clear the limitation
> of
> > > the current implementation:
> > >
> > > [image: image.png]
> > >
> > > In violet are shown the offer and poll operations on the
> > > LinkedBlockingQueue of the shared thread pool, happening from any
> thread
> > of
> > > the pool (the thread is the base of each bar, in red).
> > > The LinkedBlockingQueue indeed has a ReentrantLock to protect any
> > > operation on the linked q and is clear that having a giant lock in
> front
> > of
> > > high contention point won't scale.
> > >
> > > The above graph has been obtained with a single producer/single
> > > consumer/single queue/not-persistent run, but I don't have enough
> > resources
> > > to check what could happen with more and more
> producers/consumers/queues.
> > > The critical part is the offering/polling of tasks on the shared thread
> > > pool and in theory a maxed-out broker shouldn't have many idle threads
> to
> > > be awaken, but given that more producers/consumers/queues means many
> > > different Actors, in order to guarantee each actor tasks to be
> executed,
> > > the shared thread pool will need to process many unnecessary "awake"
> > tasks,
> > > creating lot of contention on the blocking linked q, slowing down the
> > > entire broker.
> > >
> > > In the past I've tried to replace the current shared thread pool
> > > implementation with a ForkJoinPool or (the most recent attempt) by
> using
> > a
> > > lock-free q instead of BlockingLinkedQueue, with no success (
> > > https://github.com/apache/activemq-artemis/pull/2582).
> > >
> > > Below the contention graph using a lock-free q in the shared thread
> pool:
> > >
> > > [image: image.png]
> > >
> > > In violet now we have QueueImpl::deliver and RefsOperation::afterCommit
> > > that are contending QueueImpl lock, but the numbers for each bar are
> very
> > > different: in the previous graph the contention on the shared thread
> pool
> > > lock is of 600 ns, while here is 20-80 ns and it can scale with number
> of
> > > queues, while the previous version not.
> > >
> > > All green right? So, why I've reverted the lock-free thread pool?
> > >
> > > Because with a low utilization of the broker (ie 1 producer/1
> consumer/1
> > > queue) the latencies and throughput were actually worse: cpu
> utilization
> > > graphs were showing that ProcessorBase::onAddedTaskIfNotRunning was
> > > spending most of its time by awaking the shared thread pool. The same
> was
> > > happening with a ForkJoin pool, sadly.
> > > It seems (and it is just a guess) that, given that tasks get consumed
> > > faster (there is no lock preventing them to get polled and executed),
> the
> > > thread pool is getting idle sooner (the default thread pool size is of
> 30
> > > and I have a machine with just 8 real cores), forcing any new task
> > > submission to awake any of the thread pool to process incoming tasks.
> > >
> > > What are your thoughts on this?
> > > I don't want to trade so much the "low utilization" performance for the
> > > scaling TBH, that's why I've preferred to revert the change.
> > > Note that other applications with scalability needs (eg Cassandra) have
> > > changed their shared pool approach based on SEDA to a thread-per-pool
> > > architecture for this same reason.
> > >
> > > Cheers,
> > > Franz
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> >
>
>
>
>
>
>










Reply | Threaded
Open this post in threaded view
|

Re: Artemis Scalability Thoughts

nigro_franz
> Another option could be that you provide your new solution and turn it on
by default but we make it possible for that other use case if it becomes an
issue to toggle back to the old?

The point is that the expected solution is not working (indeed in your case
it hasn't delivered any visible gain): there is a much complex analysis to
be done IMO, that's why I wanted some feedback about this
issue. I need to understand first if anyone has been hit by it and/or I can
help to spot it using the proper tools (the same I've used to show the
contention issue in the first post).
The thing is that you don't know if you don't scale if you're limited by
networks or is not your use case (and it makes sense, is
business-dependent).
Luckly we have many users (that are skilled developers as well) and I'm
happy to share these thoughts here, trusty to get some good insight :)

Il giorno mer 20 mar 2019 alle ore 15:13
<[hidden email]> ha scritto:

> Another option could be that you provide your new solution and turn it on
> by default but we make it possible for that other use case if it becomes an
> issue to toggle back to the old?
>
>
>
>
> I honestly dont see the other usecase being common. But as long as you
> give the option for someone to toggle back then no one loses out, but
> others and the more typical cases could benefit from.
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Wed, Mar 20, 2019 at 2:31 PM +0100, <[hidden email]>
> wrote:
>
>
>
>
>
>
>
>
>
>
> Performance was similar. As such agreed if there is an impact for another
> usecase thats genuine maybe leave it as is.
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Wed, Mar 20, 2019 at 1:49 PM +0100, "Francesco Nigro"  wrote:
>
>
>
>
>
>
>
>
>
>
> @michael
>
> > If youre talking about the solution that you regressed just before we
> released. Then we did test it in our real testing env. I didnt notice and
> negative impact for our use cases.
>
> That's nice to be heard, MIchael. Probably I could have avoided to revert
> it, but consider that the accademic use case ie 1 P/1 C/1 Q is just a way
> to emulate a "low-utilization" case.
> As Christopher has said, there are users that don't want to trade low
> utilization performance...and I understand it.
> The thing is, have you seen if using TranferLinkedQueue has improved over
> the original thread pool with many producers/consumers/queues scenario?
>
> > Regards to code, i actually thought what you had it made code cleaner.
>
> That's another benefit, but the change was meant to provide only benefits,
> that's why I've reverted it...
>
> >  think we engineer for multiple consumers/multiple producers and should
> test with such setup.
> I agree with you, but it hurts my heart to deliver a patch that cause perf
> regressions without providing a huge benefit ie proven in other cases...
>
> Thank guys,
> Franz
>
>
>
>
> Il giorno mer 20 mar 2019 alle ore 13:34
>  ha scritto:
>
> > So Franz.
> >
> >
> >
> >
> > If youre talking about the solution that you regressed just before we
> > released. Then we did test it in our real testing env. I didnt notice and
> > negative impact for our use cases.
> >
> >
> >
> >
> >
> >
> > Regards to code, i actually thought what you had it made code cleaner.
> >
> >
> >
> >
> > Im not sure what use case your concerned about with your change,
> generally
> > i expect brokers to have more than just one queue and just one
> > producer/consumer in real world use cases. I think having just one
> > producer, one consumer and one queue on a whole broker is very academic
> and
> > not a typical real world case. I think we engineer for multiple
> > consumers/multiple producers and should test with such setup.
> >
> >
> >
> >
> > Get Outlook for Android
> >
> >
> >
> >
> >
> >
> >
> > On Wed, Mar 20, 2019 at 1:15 PM +0100, "Francesco Nigro" <
> > [hidden email]> wrote:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > > That being said
> > is there an actual real world throughput issue here?
> >
> > Yes and no: it's a chance to improve things, especially for cloud uses:
> it
> > is a fact that now that Specter and Meltdown are there we don't want to
> > waste CPU time
> > sitting idle/on contention if isn't needed and as I've said "is a giant
> > lock on any task submitted".
> > IMO having a talk on how to improve it is not over-engineering, but just
> > engineering, given that scaling non-persistent messages (or persistent
> with
> > very fast disks)
> > is something that we expect from a broker: from a commercial point of
> view
> > is nice that we could scale by adding brokers, but if you can save 2
> > machines to get
> > the same throughput I think is a nice improvement for (m)any users.
> >
> > > , I don't
> > know that I see much value in over engineering and micro managing this
> > stuff unless there's a real world measurable benefit to be gained vs just
> > theoretical benchmarks as it's just going to make things harder to
> maintain
> > and mistakes easier to make in the future.
> >
> > Cassandra from Datastax has gained about 2X throughput by solving this,
> but
> > it can be said that's a "different scenario" too: as an engineer I can
> say
> > no, is not.
> > I've "recently" addressed with the client team a similar "issue" on
> > qpid-jms, getting near 2X throughput (nudge nudge Robbie Gemmel/Tim
> Bish).
> > And this "issue" (actually, a "chance to improve things") has been well
> > hidden altough in front of anyone from a long time:
> > https://issues.apache.org/jira/browse/QPIDJMS-396.
> >
> > The reason why I've written on the dev list is to understand if anyone
> has
> > had the chance to measure in a real load scenario something like this.
> >
> >
> > Il giorno mer 20 mar 2019 alle ore 12:07 Christopher Shannon <
> > [hidden email]> ha scritto:
> >
> > > I don't think sacrificing low utilization is a good idea.  That being
> > said
> > > is there an actual real world throughput issue here? In general, I
> don't
> > > know that I see much value in over engineering and micro managing this
> > > stuff unless there's a real world measurable benefit to be gained vs
> just
> > > theoretical benchmarks as it's just going to make things harder to
> > maintain
> > > and mistakes easier to make in the future.
> > >
> > > On Wed, Mar 20, 2019 at 6:51 AM Francesco Nigro
> > > wrote:
> > >
> > > > HI folks,
> > > >
> > > > I'm writing here to share some thoughts related to the Artemis
> > threading
> > > > model and how it affects broker scalability.
> > > >
> > > > Currently (on 2.7.0) we relies on a shared thread pool ie
> > > > ActiveMQThreadPoolExecutor backed by a LinkedBlockingQueue-ish queue
> to
> > > > process tasks.
> > > > Thanks to the the Actor abstraction we use a lock-free queue to
> > serialize
> > > > tasks (or items),
> > > > processing them in batch in the shared thread pool, awaking a
> consumer
> > > > thread only if needed (the logic is contained in ProcessorBase).
> > > > The awaking operation (ie ProcessorBase::onAddedTaskIfNotRunning)
> will
> > > > execute on the shared thread pool a specific task to drain and
> execute
> > a
> > > > batch of tasks only if necessary, not on every added task/item.
> > > >
> > > > Looking at the contention graphs of the broker (ie the bar width are
> > the
> > > > nanoseconds before entering into a lock) is quite clear the
> limitation
> > of
> > > > the current implementation:
> > > >
> > > > [image: image.png]
> > > >
> > > > In violet are shown the offer and poll operations on the
> > > > LinkedBlockingQueue of the shared thread pool, happening from any
> > thread
> > > of
> > > > the pool (the thread is the base of each bar, in red).
> > > > The LinkedBlockingQueue indeed has a ReentrantLock to protect any
> > > > operation on the linked q and is clear that having a giant lock in
> > front
> > > of
> > > > high contention point won't scale.
> > > >
> > > > The above graph has been obtained with a single producer/single
> > > > consumer/single queue/not-persistent run, but I don't have enough
> > > resources
> > > > to check what could happen with more and more
> > producers/consumers/queues.
> > > > The critical part is the offering/polling of tasks on the shared
> thread
> > > > pool and in theory a maxed-out broker shouldn't have many idle
> threads
> > to
> > > > be awaken, but given that more producers/consumers/queues means many
> > > > different Actors, in order to guarantee each actor tasks to be
> > executed,
> > > > the shared thread pool will need to process many unnecessary "awake"
> > > tasks,
> > > > creating lot of contention on the blocking linked q, slowing down the
> > > > entire broker.
> > > >
> > > > In the past I've tried to replace the current shared thread pool
> > > > implementation with a ForkJoinPool or (the most recent attempt) by
> > using
> > > a
> > > > lock-free q instead of BlockingLinkedQueue, with no success (
> > > > https://github.com/apache/activemq-artemis/pull/2582).
> > > >
> > > > Below the contention graph using a lock-free q in the shared thread
> > pool:
> > > >
> > > > [image: image.png]
> > > >
> > > > In violet now we have QueueImpl::deliver and
> RefsOperation::afterCommit
> > > > that are contending QueueImpl lock, but the numbers for each bar are
> > very
> > > > different: in the previous graph the contention on the shared thread
> > pool
> > > > lock is of 600 ns, while here is 20-80 ns and it can scale with
> number
> > of
> > > > queues, while the previous version not.
> > > >
> > > > All green right? So, why I've reverted the lock-free thread pool?
> > > >
> > > > Because with a low utilization of the broker (ie 1 producer/1
> > consumer/1
> > > > queue) the latencies and throughput were actually worse: cpu
> > utilization
> > > > graphs were showing that ProcessorBase::onAddedTaskIfNotRunning was
> > > > spending most of its time by awaking the shared thread pool. The same
> > was
> > > > happening with a ForkJoin pool, sadly.
> > > > It seems (and it is just a guess) that, given that tasks get consumed
> > > > faster (there is no lock preventing them to get polled and executed),
> > the
> > > > thread pool is getting idle sooner (the default thread pool size is
> of
> > 30
> > > > and I have a machine with just 8 real cores), forcing any new task
> > > > submission to awake any of the thread pool to process incoming tasks.
> > > >
> > > > What are your thoughts on this?
> > > > I don't want to trade so much the "low utilization" performance for
> the
> > > > scaling TBH, that's why I've preferred to revert the change.
> > > > Note that other applications with scalability needs (eg Cassandra)
> have
> > > > changed their shared pool approach based on SEDA to a thread-per-pool
> > > > architecture for this same reason.
> > > >
> > > > Cheers,
> > > > Franz
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> >
> >
> >
> >
> >
> >
>
>
>
>
>
>
>
>
>
>
>