A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

classic Classic list List threaded Threaded
25 messages Options
12
Reply | Threaded
Open this post in threaded view
|

A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
OK.  I think I have a handle regarding what’s happening during queue purges
that cause GC lockups.

Wanted to get your feedback.

I can create a bug for this if you guys think my assessment is accurate as
I think the fix is someone reasonable / easy.

I have a unit test which duplicates this now but I need to do more cleanup
so I can put it into a public github repo for you guys to look at.

## Problem overview.

ActiveMQ supports a feature where it can GC a queue that is inactive. IE
now messages and no consumers.

However, there’s a bug where

purgeInactiveDestinations

in

org.apache.activemq.broker.region.RegionBroker

creates a read/write lock (inactiveDestinationsPurgeLock) which is held
during the entire queue GC.

each individual queue GC takes about 100ms with a disk backed queue and
10ms with a memory backed (non-persistent) queue. If you have thousands of
them to GC at once the inactiveDestinationsPurgeLock lock is held the
entire time which can last from 60 seconds to 5 minutes (and essentially
unbounded).

A read lock is also held for this in addConsumer addProducer so that when a
new consumer or produce tries to connect, they’re blocked until queue GC
completes.

Existing producers/consumers work JUST fine.

The lock MUST be held on each queue because if it isn’t there’s a race
where a queue is flagged to be GCd , then a producer comes in and writes a
new message, then the background thread deletes the queue which it marked
as GCable but it had the newly produced message.  This would result in data
loss.

## Confirmed

I have a unit tests now that confirms this.   I create 7500 queues, produce
1 message in each, then consume it. I keep all consumers open.  then I
release all 7500 queues at once.

I then have an consumer/producer pair I hold open and produce and consume
messages on it.  this works fine.

However, I have another which creates a new producer each time.  This will
block for 60,000ms multiple time while queue GC is happening in the
background.

## Proposed solution.

Rework the read/write locks to be one lock per queue.

So instead of using one global lock per broker, we use one lock per queue
name.  This way the locks are FAR more granular and new producers/consumers
won’t block during this time period.

If a queue named ‘foo’ is being GCd and a new producer is created on a
‘bar’ queue the bar producer will work fine and won’t block on the foo
queue.

This can be accomplished by:

creating a concurrent hash map with the name of the queue as the key (or an
ActiveMQDestination as the key) which stores read/write locks as the
values. Then we use this as the lock backing and the purge thread and
add/remove producers will all reference the more granular lock.

….

Now initially, I was thinking I would just fix this myself, however, I
might have a workaround for our queue design which uses less queues, and I
think this will drop our queue requirement from a few thousand to a few
dozen.  So at that point this won’t be as much of a priority.

However, this is a significant scalability issue with ActiveMQ… one that
doesn’t need to exist.  In our situation I think our performance would be
fine even with 7500 queues once this bug is fixed.

Perhaps it should just exist as an open JIRA that could be fixed at some
time in the future?

I can also get time to clean up a project with a test which demonstrates
this problem.

Kevin

--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
Btw. another way to fix this is to set the purge interval low, say 15
seconds, and then set the max number of queues to delete each time to a low
value.

This shouldn’t be as pretty as using one lock per queue but would be easy
to implement without modifying much code.

Kevin

On Sun, Feb 22, 2015 at 1:01 PM, Kevin Burton <[hidden email]> wrote:

> OK.  I think I have a handle regarding what’s happening during queue
> purges that cause GC lockups.
>
> Wanted to get your feedback.
>
> I can create a bug for this if you guys think my assessment is accurate as
> I think the fix is someone reasonable / easy.
>
> I have a unit test which duplicates this now but I need to do more cleanup
> so I can put it into a public github repo for you guys to look at.
>
> ## Problem overview.
>
> ActiveMQ supports a feature where it can GC a queue that is inactive. IE
> now messages and no consumers.
>
> However, there’s a bug where
>
> purgeInactiveDestinations
>
> in
>
> org.apache.activemq.broker.region.RegionBroker
>
> creates a read/write lock (inactiveDestinationsPurgeLock) which is held
> during the entire queue GC.
>
> each individual queue GC takes about 100ms with a disk backed queue and
> 10ms with a memory backed (non-persistent) queue. If you have thousands of
> them to GC at once the inactiveDestinationsPurgeLock lock is held the
> entire time which can last from 60 seconds to 5 minutes (and essentially
> unbounded).
>
> A read lock is also held for this in addConsumer addProducer so that when
> a new consumer or produce tries to connect, they’re blocked until queue GC
> completes.
>
> Existing producers/consumers work JUST fine.
>
> The lock MUST be held on each queue because if it isn’t there’s a race
> where a queue is flagged to be GCd , then a producer comes in and writes a
> new message, then the background thread deletes the queue which it marked
> as GCable but it had the newly produced message.  This would result in data
> loss.
>
> ## Confirmed
>
> I have a unit tests now that confirms this.   I create 7500 queues,
> produce 1 message in each, then consume it. I keep all consumers open.
>  then I release all 7500 queues at once.
>
> I then have an consumer/producer pair I hold open and produce and consume
> messages on it.  this works fine.
>
> However, I have another which creates a new producer each time.  This will
> block for 60,000ms multiple time while queue GC is happening in the
> background.
>
> ## Proposed solution.
>
> Rework the read/write locks to be one lock per queue.
>
> So instead of using one global lock per broker, we use one lock per queue
> name.  This way the locks are FAR more granular and new producers/consumers
> won’t block during this time period.
>
> If a queue named ‘foo’ is being GCd and a new producer is created on a
> ‘bar’ queue the bar producer will work fine and won’t block on the foo
> queue.
>
> This can be accomplished by:
>
> creating a concurrent hash map with the name of the queue as the key (or
> an ActiveMQDestination as the key) which stores read/write locks as the
> values. Then we use this as the lock backing and the purge thread and
> add/remove producers will all reference the more granular lock.
>
> ….
>
> Now initially, I was thinking I would just fix this myself, however, I
> might have a workaround for our queue design which uses less queues, and I
> think this will drop our queue requirement from a few thousand to a few
> dozen.  So at that point this won’t be as much of a priority.
>
> However, this is a significant scalability issue with ActiveMQ… one that
> doesn’t need to exist.  In our situation I think our performance would be
> fine even with 7500 queues once this bug is fixed.
>
> Perhaps it should just exist as an open JIRA that could be fixed at some
> time in the future?
>
> I can also get time to clean up a project with a test which demonstrates
> this problem.
>
> Kevin
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>


--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
Here’s the project with a unit tests to demonstrate the bug.

https://github.com/spinn3r/activemq-gc-purge-lockup

The one thing is that I’m missing a SLF4J dependency which doesn’t allow
ActiveMQ to log to the console about the queue GCs… you can then see all
the GC activity.

But it does successfully show that here is significant producer creation
latency.  60-80 seconds.

Kevin


On Sun, Feb 22, 2015 at 1:08 PM, Kevin Burton <[hidden email]> wrote:

> Btw. another way to fix this is to set the purge interval low, say 15
> seconds, and then set the max number of queues to delete each time to a low
> value.
>
> This shouldn’t be as pretty as using one lock per queue but would be easy
> to implement without modifying much code.
>
> Kevin
>
> On Sun, Feb 22, 2015 at 1:01 PM, Kevin Burton <[hidden email]> wrote:
>
>> OK.  I think I have a handle regarding what’s happening during queue
>> purges that cause GC lockups.
>>
>> Wanted to get your feedback.
>>
>> I can create a bug for this if you guys think my assessment is accurate
>> as I think the fix is someone reasonable / easy.
>>
>> I have a unit test which duplicates this now but I need to do more
>> cleanup so I can put it into a public github repo for you guys to look at.
>>
>> ## Problem overview.
>>
>> ActiveMQ supports a feature where it can GC a queue that is inactive. IE
>> now messages and no consumers.
>>
>> However, there’s a bug where
>>
>> purgeInactiveDestinations
>>
>> in
>>
>> org.apache.activemq.broker.region.RegionBroker
>>
>> creates a read/write lock (inactiveDestinationsPurgeLock) which is held
>> during the entire queue GC.
>>
>> each individual queue GC takes about 100ms with a disk backed queue and
>> 10ms with a memory backed (non-persistent) queue. If you have thousands of
>> them to GC at once the inactiveDestinationsPurgeLock lock is held the
>> entire time which can last from 60 seconds to 5 minutes (and essentially
>> unbounded).
>>
>> A read lock is also held for this in addConsumer addProducer so that when
>> a new consumer or produce tries to connect, they’re blocked until queue GC
>> completes.
>>
>> Existing producers/consumers work JUST fine.
>>
>> The lock MUST be held on each queue because if it isn’t there’s a race
>> where a queue is flagged to be GCd , then a producer comes in and writes a
>> new message, then the background thread deletes the queue which it marked
>> as GCable but it had the newly produced message.  This would result in data
>> loss.
>>
>> ## Confirmed
>>
>> I have a unit tests now that confirms this.   I create 7500 queues,
>> produce 1 message in each, then consume it. I keep all consumers open.
>>  then I release all 7500 queues at once.
>>
>> I then have an consumer/producer pair I hold open and produce and consume
>> messages on it.  this works fine.
>>
>> However, I have another which creates a new producer each time.  This
>> will block for 60,000ms multiple time while queue GC is happening in the
>> background.
>>
>> ## Proposed solution.
>>
>> Rework the read/write locks to be one lock per queue.
>>
>> So instead of using one global lock per broker, we use one lock per queue
>> name.  This way the locks are FAR more granular and new producers/consumers
>> won’t block during this time period.
>>
>> If a queue named ‘foo’ is being GCd and a new producer is created on a
>> ‘bar’ queue the bar producer will work fine and won’t block on the foo
>> queue.
>>
>> This can be accomplished by:
>>
>> creating a concurrent hash map with the name of the queue as the key (or
>> an ActiveMQDestination as the key) which stores read/write locks as the
>> values. Then we use this as the lock backing and the purge thread and
>> add/remove producers will all reference the more granular lock.
>>
>> ….
>>
>> Now initially, I was thinking I would just fix this myself, however, I
>> might have a workaround for our queue design which uses less queues, and I
>> think this will drop our queue requirement from a few thousand to a few
>> dozen.  So at that point this won’t be as much of a priority.
>>
>> However, this is a significant scalability issue with ActiveMQ… one that
>> doesn’t need to exist.  In our situation I think our performance would be
>> fine even with 7500 queues once this bug is fixed.
>>
>> Perhaps it should just exist as an open JIRA that could be fixed at some
>> time in the future?
>>
>> I can also get time to clean up a project with a test which demonstrates
>> this problem.
>>
>> Kevin
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>


--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Tim Bain
I like the more-granular lock idea; that's better than holding the lock
with brief unlocking periods as I proposed in the email I sent just now to
your other thread.  In practice you'll need to do the refactoring I
proposed in order to implement the algorithm you've described here, only
you'll add a lookup for figuring out the lock for this queue.

I think this is much better than relying on the low max number of queues
and trying to do just a few at a time.  Do you know whether what's slow is
the call to canGC() or the call to removeDestination()?  If it's canGC(),
the do-it-often-but-only-for-a-few approach will kill you performance-wise
because you'll spend large amounts of time holding the lock while calling
canGC on active destinations as you look for any that are inactive.  So for
multiple reasons, don't do that.

Definitely submit an enhancement request in JIRA, even if you're not sure
you'll need it.  Let's not lose track of the idea now that you've put all
this time into digging into it.

Tim

On Sun, Feb 22, 2015 at 3:06 PM, Kevin Burton <[hidden email]> wrote:

> Here’s the project with a unit tests to demonstrate the bug.
>
> https://github.com/spinn3r/activemq-gc-purge-lockup
>
> The one thing is that I’m missing a SLF4J dependency which doesn’t allow
> ActiveMQ to log to the console about the queue GCs… you can then see all
> the GC activity.
>
> But it does successfully show that here is significant producer creation
> latency.  60-80 seconds.
>
> Kevin
>
>
> On Sun, Feb 22, 2015 at 1:08 PM, Kevin Burton <[hidden email]> wrote:
>
> > Btw. another way to fix this is to set the purge interval low, say 15
> > seconds, and then set the max number of queues to delete each time to a
> low
> > value.
> >
> > This shouldn’t be as pretty as using one lock per queue but would be easy
> > to implement without modifying much code.
> >
> > Kevin
> >
> > On Sun, Feb 22, 2015 at 1:01 PM, Kevin Burton <[hidden email]>
> wrote:
> >
> >> OK.  I think I have a handle regarding what’s happening during queue
> >> purges that cause GC lockups.
> >>
> >> Wanted to get your feedback.
> >>
> >> I can create a bug for this if you guys think my assessment is accurate
> >> as I think the fix is someone reasonable / easy.
> >>
> >> I have a unit test which duplicates this now but I need to do more
> >> cleanup so I can put it into a public github repo for you guys to look
> at.
> >>
> >> ## Problem overview.
> >>
> >> ActiveMQ supports a feature where it can GC a queue that is inactive. IE
> >> now messages and no consumers.
> >>
> >> However, there’s a bug where
> >>
> >> purgeInactiveDestinations
> >>
> >> in
> >>
> >> org.apache.activemq.broker.region.RegionBroker
> >>
> >> creates a read/write lock (inactiveDestinationsPurgeLock) which is held
> >> during the entire queue GC.
> >>
> >> each individual queue GC takes about 100ms with a disk backed queue and
> >> 10ms with a memory backed (non-persistent) queue. If you have thousands
> of
> >> them to GC at once the inactiveDestinationsPurgeLock lock is held the
> >> entire time which can last from 60 seconds to 5 minutes (and essentially
> >> unbounded).
> >>
> >> A read lock is also held for this in addConsumer addProducer so that
> when
> >> a new consumer or produce tries to connect, they’re blocked until queue
> GC
> >> completes.
> >>
> >> Existing producers/consumers work JUST fine.
> >>
> >> The lock MUST be held on each queue because if it isn’t there’s a race
> >> where a queue is flagged to be GCd , then a producer comes in and
> writes a
> >> new message, then the background thread deletes the queue which it
> marked
> >> as GCable but it had the newly produced message.  This would result in
> data
> >> loss.
> >>
> >> ## Confirmed
> >>
> >> I have a unit tests now that confirms this.   I create 7500 queues,
> >> produce 1 message in each, then consume it. I keep all consumers open.
> >>  then I release all 7500 queues at once.
> >>
> >> I then have an consumer/producer pair I hold open and produce and
> consume
> >> messages on it.  this works fine.
> >>
> >> However, I have another which creates a new producer each time.  This
> >> will block for 60,000ms multiple time while queue GC is happening in the
> >> background.
> >>
> >> ## Proposed solution.
> >>
> >> Rework the read/write locks to be one lock per queue.
> >>
> >> So instead of using one global lock per broker, we use one lock per
> queue
> >> name.  This way the locks are FAR more granular and new
> producers/consumers
> >> won’t block during this time period.
> >>
> >> If a queue named ‘foo’ is being GCd and a new producer is created on a
> >> ‘bar’ queue the bar producer will work fine and won’t block on the foo
> >> queue.
> >>
> >> This can be accomplished by:
> >>
> >> creating a concurrent hash map with the name of the queue as the key (or
> >> an ActiveMQDestination as the key) which stores read/write locks as the
> >> values. Then we use this as the lock backing and the purge thread and
> >> add/remove producers will all reference the more granular lock.
> >>
> >> ….
> >>
> >> Now initially, I was thinking I would just fix this myself, however, I
> >> might have a workaround for our queue design which uses less queues,
> and I
> >> think this will drop our queue requirement from a few thousand to a few
> >> dozen.  So at that point this won’t be as much of a priority.
> >>
> >> However, this is a significant scalability issue with ActiveMQ… one that
> >> doesn’t need to exist.  In our situation I think our performance would
> be
> >> fine even with 7500 queues once this bug is fixed.
> >>
> >> Perhaps it should just exist as an open JIRA that could be fixed at some
> >> time in the future?
> >>
> >> I can also get time to clean up a project with a test which demonstrates
> >> this problem.
> >>
> >> Kevin
> >>
> >> --
> >>
> >> Founder/CEO Spinn3r.com
> >> Location: *San Francisco, CA*
> >> blog: http://burtonator.wordpress.com
> >> … or check out my Google+ profile
> >> <https://plus.google.com/102718274791889610666/posts>
> >> <http://spinn3r.com>
> >>
> >>
> >
> >
> > --
> >
> > Founder/CEO Spinn3r.com
> > Location: *San Francisco, CA*
> > blog: http://burtonator.wordpress.com
> > … or check out my Google+ profile
> > <https://plus.google.com/102718274791889610666/posts>
> > <http://spinn3r.com>
> >
> >
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
On Sun, Feb 22, 2015 at 2:43 PM, Tim Bain <[hidden email]> wrote:

> I like the more-granular lock idea; that's better than holding the lock
> with brief unlocking periods as I proposed in the email I sent just now to
> your other thread.  In practice you'll need to do the refactoring I
> proposed in order to implement the algorithm you've described here, only
> you'll add a lookup for figuring out the lock for this queue.
>
>
I just responded to that email but I agree with you.


> I think this is much better than relying on the low max number of queues
> and trying to do just a few at a time.  Do you know whether what's slow is
> the call to canGC() or the call to removeDestination()?  If it's canGC(),
> the do-it-often-but-only-for-a-few approach will kill you performance-wise
> because you'll spend large amounts of time holding the lock while calling
> canGC on active destinations as you look for any that are inactive.  So for
> multiple reasons, don't do that.
>
>
I don’t know yet..  I suspect removeDestination because I always see it
there in the stack trace and I also suspect that 99.% of this is LevelDB.
It’s just that this global lock blocking everything means that ActiveMQ is
dead in the water while GC is happening.

What I was going to do was look at installing a snapshot and then starting
to work with that snapshot for a while.  I just built it with a tar.gz so I
can use that to push changes into production easily.

I wanted to put a timer / stopwatch around the ENTIRE gc loop to measure
the total time it takes (we really need to trace the duration of that) and
also can add one around canGC and removeDestination.


> Definitely submit an enhancement request in JIRA, even if you're not sure
> you'll need it.  Let's not lose track of the idea now that you've put all
> this time into digging into it.
>

Sounds like a plan. I have a test project that I can use now to reference
and I’ll create a JIRA for this.  I have to upgrade to 5.11 anyway so if I
can just take the current branch, build it, then create my .deb from it
then I can easily push test code into production to verify that we’ve
resolved this bug.

Kevin

--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Tim Bain
So if LevelDB cleanup during removeDestination() is the presumed culprit,
can we spin off the LevelDB cleanup work into a separate thread (better: a
task object to be serviced by a ThreadPool so you can avoid a fork bomb if
we remove many destinations at once) so the call to removeDestination() can
return quickly and LevelDB can do its record-keeping in the background
without blocking message-processing?

I've spent precisely zero time looking at that code so I'm shooting from
the hip, but it feels like we should be able to separate the action of
removing the destination from the broker's in-memory list (along with the
various record-keeping actions that might be required such as advisory
messages) from the action of cleaning up the LevelDB message store as it
relates to that destination; I would expect LevelDB should be capable of
doing that cleanup even if it no longer existed in ActiveMQ's in-memory
records, as long as the thread to do the LevelDB cleanup kept a reference
to the destination object.  But since that's not code I've ever looked at
and I don't claim to know the details of what's being done there, maybe
there's some race condition or edge case where that wouldn't work (e.g. if
the destination is deleted but then re-created before the LevelDB task
runs), so hopefully someone who knows that code better can shed light on
any potential flaws.

Incidentally, why would LevelDB have to do any non-trivial amount of work
to delete a destination that contains no active messages?  If that's not
fast, I'm very curious why not.

On Sun, Feb 22, 2015 at 4:03 PM, Kevin Burton <[hidden email]> wrote:

> On Sun, Feb 22, 2015 at 2:43 PM, Tim Bain <[hidden email]> wrote:
>
> > I like the more-granular lock idea; that's better than holding the lock
> > with brief unlocking periods as I proposed in the email I sent just now
> to
> > your other thread.  In practice you'll need to do the refactoring I
> > proposed in order to implement the algorithm you've described here, only
> > you'll add a lookup for figuring out the lock for this queue.
> >
> >
> I just responded to that email but I agree with you.
>
>
> > I think this is much better than relying on the low max number of queues
> > and trying to do just a few at a time.  Do you know whether what's slow
> is
> > the call to canGC() or the call to removeDestination()?  If it's canGC(),
> > the do-it-often-but-only-for-a-few approach will kill you
> performance-wise
> > because you'll spend large amounts of time holding the lock while calling
> > canGC on active destinations as you look for any that are inactive.  So
> for
> > multiple reasons, don't do that.
> >
> >
> I don’t know yet..  I suspect removeDestination because I always see it
> there in the stack trace and I also suspect that 99.% of this is LevelDB.
> It’s just that this global lock blocking everything means that ActiveMQ is
> dead in the water while GC is happening.
>
> What I was going to do was look at installing a snapshot and then starting
> to work with that snapshot for a while.  I just built it with a tar.gz so I
> can use that to push changes into production easily.
>
> I wanted to put a timer / stopwatch around the ENTIRE gc loop to measure
> the total time it takes (we really need to trace the duration of that) and
> also can add one around canGC and removeDestination.
>
>
> > Definitely submit an enhancement request in JIRA, even if you're not sure
> > you'll need it.  Let's not lose track of the idea now that you've put all
> > this time into digging into it.
> >
>
> Sounds like a plan. I have a test project that I can use now to reference
> and I’ll create a JIRA for this.  I have to upgrade to 5.11 anyway so if I
> can just take the current branch, build it, then create my .deb from it
> then I can easily push test code into production to verify that we’ve
> resolved this bug.
>
> Kevin
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <[hidden email]> wrote:

> So if LevelDB cleanup during removeDestination() is the presumed culprit,
> can we spin off the LevelDB cleanup work into a separate thread (better: a
> task object to be serviced by a ThreadPool so you can avoid a fork bomb if
> we remove many destinations at once) so the call to removeDestination() can
> return quickly and LevelDB can do its record-keeping in the background
> without blocking message-processing?
>

Would that be possible?  If the delete is pending on ActiveMQ there is a
race where a producer could re-create it unless the lock is held.

Though I guess if you dispatched to the GC thread WITH the lock still held
you would be ok but I think if we use the existing purge thread then we’re
fine.

OK. I think I’m wrong about LevelDB being the issue.  To be fair I wasn’t
100% certain before but I should have specified.

Our current production broker is running with persistent=false.. .and I
just re-ran the tests without disk persistence and it has the same problem.

So the main issue how is why the heck is ActiveMQ taking SO LONG to GC a
queue.  It’s taking about 100ms which is an insane amount of time
considering this is done all in memory.

Kevin

--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.

This is a non-persistent broker I’m testing on.

Looks like it’s spending all it’s time in CopyOnWriteArrayList.remove and
Timer.purge…

Which is hopeful because this is ALL due to ActiveMQ internals and in
theory LevelDB should perform well if we improve the performance of
ActiveMQ internals and fix this lock bug.

Which would rock!

It should ALSO make queue creation faster.


On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <[hidden email]> wrote:

>
>
> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <[hidden email]> wrote:
>
>> So if LevelDB cleanup during removeDestination() is the presumed culprit,
>> can we spin off the LevelDB cleanup work into a separate thread (better: a
>> task object to be serviced by a ThreadPool so you can avoid a fork bomb if
>> we remove many destinations at once) so the call to removeDestination()
>> can
>> return quickly and LevelDB can do its record-keeping in the background
>> without blocking message-processing?
>>
>
> Would that be possible?  If the delete is pending on ActiveMQ there is a
> race where a producer could re-create it unless the lock is held.
>
> Though I guess if you dispatched to the GC thread WITH the lock still held
> you would be ok but I think if we use the existing purge thread then we’re
> fine.
>
> OK. I think I’m wrong about LevelDB being the issue.  To be fair I wasn’t
> 100% certain before but I should have specified.
>
> Our current production broker is running with persistent=false.. .and I
> just re-ran the tests without disk persistence and it has the same problem.
>
> So the main issue how is why the heck is ActiveMQ taking SO LONG to GC a
> queue.  It’s taking about 100ms which is an insane amount of time
> considering this is done all in memory.
>
> Kevin
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>


--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
And spending some more time in jprofiler, looks like 20% of this is do to
schedulerSupport and the other 80% of this is do to advisorySupport.

if I set both to false the total runtime of my tests drops in half… and the
latencies fall from

max create producer latency: 10,176 ms
> max create message on existing producer and consumer: 2 ms


… to

max create producer latency: 1 ms
> max create message on existing producer and consumer: 1 ms


and this isn’t without fixing the purge background lock.

So the question now is what the heck is the advisory support doing that can
result in such massive performance overhead.

… and I think advisorySupport is enabled by default so that’s problematic
as well.


On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <[hidden email]> wrote:

> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
>
> This is a non-persistent broker I’m testing on.
>
> Looks like it’s spending all it’s time in CopyOnWriteArrayList.remove and
> Timer.purge…
>
> Which is hopeful because this is ALL due to ActiveMQ internals and in
> theory LevelDB should perform well if we improve the performance of
> ActiveMQ internals and fix this lock bug.
>
> Which would rock!
>
> It should ALSO make queue creation faster.
>
>
> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <[hidden email]> wrote:
>
>>
>>
>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <[hidden email]> wrote:
>>
>>> So if LevelDB cleanup during removeDestination() is the presumed culprit,
>>> can we spin off the LevelDB cleanup work into a separate thread (better:
>>> a
>>> task object to be serviced by a ThreadPool so you can avoid a fork bomb
>>> if
>>> we remove many destinations at once) so the call to removeDestination()
>>> can
>>> return quickly and LevelDB can do its record-keeping in the background
>>> without blocking message-processing?
>>>
>>
>> Would that be possible?  If the delete is pending on ActiveMQ there is a
>> race where a producer could re-create it unless the lock is held.
>>
>> Though I guess if you dispatched to the GC thread WITH the lock still
>> held you would be ok but I think if we use the existing purge thread then
>> we’re fine.
>>
>> OK. I think I’m wrong about LevelDB being the issue.  To be fair I wasn’t
>> 100% certain before but I should have specified.
>>
>> Our current production broker is running with persistent=false.. .and I
>> just re-ran the tests without disk persistence and it has the same problem.
>>
>> So the main issue how is why the heck is ActiveMQ taking SO LONG to GC a
>> queue.  It’s taking about 100ms which is an insane amount of time
>> considering this is done all in memory.
>>
>> Kevin
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>


--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
Here’s a jprofiler view with the advisory support enabled if you’re curious.

http://i.imgur.com/I1jesZz.jpg

I’m not familiar with the internals of ActiveMQ enough to have any obvious
optimization ideas.

One other idea I had (which would require a ton of refactoring I think)
would be to potentially bulk delete all the queues at once.


On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <[hidden email]> wrote:

> And spending some more time in jprofiler, looks like 20% of this is do to
> schedulerSupport and the other 80% of this is do to advisorySupport.
>
> if I set both to false the total runtime of my tests drops in half… and
> the latencies fall from
>
> max create producer latency: 10,176 ms
>> max create message on existing producer and consumer: 2 ms
>
>
> … to
>
> max create producer latency: 1 ms
>> max create message on existing producer and consumer: 1 ms
>
>
> and this isn’t without fixing the purge background lock.
>
> So the question now is what the heck is the advisory support doing that
> can result in such massive performance overhead.
>
> … and I think advisorySupport is enabled by default so that’s problematic
> as well.
>
>
> On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <[hidden email]> wrote:
>
>> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
>>
>> This is a non-persistent broker I’m testing on.
>>
>> Looks like it’s spending all it’s time in CopyOnWriteArrayList.remove and
>> Timer.purge…
>>
>> Which is hopeful because this is ALL due to ActiveMQ internals and in
>> theory LevelDB should perform well if we improve the performance of
>> ActiveMQ internals and fix this lock bug.
>>
>> Which would rock!
>>
>> It should ALSO make queue creation faster.
>>
>>
>> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <[hidden email]> wrote:
>>
>>>
>>>
>>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <[hidden email]> wrote:
>>>
>>>> So if LevelDB cleanup during removeDestination() is the presumed
>>>> culprit,
>>>> can we spin off the LevelDB cleanup work into a separate thread
>>>> (better: a
>>>> task object to be serviced by a ThreadPool so you can avoid a fork bomb
>>>> if
>>>> we remove many destinations at once) so the call to removeDestination()
>>>> can
>>>> return quickly and LevelDB can do its record-keeping in the background
>>>> without blocking message-processing?
>>>>
>>>
>>> Would that be possible?  If the delete is pending on ActiveMQ there is a
>>> race where a producer could re-create it unless the lock is held.
>>>
>>> Though I guess if you dispatched to the GC thread WITH the lock still
>>> held you would be ok but I think if we use the existing purge thread then
>>> we’re fine.
>>>
>>> OK. I think I’m wrong about LevelDB being the issue.  To be fair I
>>> wasn’t 100% certain before but I should have specified.
>>>
>>> Our current production broker is running with persistent=false.. .and I
>>> just re-ran the tests without disk persistence and it has the same problem.
>>>
>>> So the main issue how is why the heck is ActiveMQ taking SO LONG to GC a
>>> queue.  It’s taking about 100ms which is an insane amount of time
>>> considering this is done all in memory.
>>>
>>> Kevin
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> <https://plus.google.com/102718274791889610666/posts>
>>> <http://spinn3r.com>
>>>
>>>
>>
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>


--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Tim Bain
A decent amount of the time is being spent calling remove() on various
array-backed collections.  Those data structures might be inappropriate for
the number of destinations you're running, since array-backed collections
tend to have add/remove operations that are O(N); some improvement might
come from something as simple as moving to a ConcurrentHashSet instead of a
CopyOnWriteArraySet, for example.  (Or it might make performance worse
because of other aspects of how those collections are used; people other
than me would be in a better position to evaluate the full range of
performance requirements for those collections.)

Scheduler.cancel() also takes an alarming amount of time for what looks
like a really simple method (
http://grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-all/5.10.0/org/apache/activemq/thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
).

On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <[hidden email]> wrote:

> Here’s a jprofiler view with the advisory support enabled if you’re
> curious.
>
> http://i.imgur.com/I1jesZz.jpg
>
> I’m not familiar with the internals of ActiveMQ enough to have any obvious
> optimization ideas.
>
> One other idea I had (which would require a ton of refactoring I think)
> would be to potentially bulk delete all the queues at once.
>
>
> On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <[hidden email]> wrote:
>
> > And spending some more time in jprofiler, looks like 20% of this is do to
> > schedulerSupport and the other 80% of this is do to advisorySupport.
> >
> > if I set both to false the total runtime of my tests drops in half… and
> > the latencies fall from
> >
> > max create producer latency: 10,176 ms
> >> max create message on existing producer and consumer: 2 ms
> >
> >
> > … to
> >
> > max create producer latency: 1 ms
> >> max create message on existing producer and consumer: 1 ms
> >
> >
> > and this isn’t without fixing the purge background lock.
> >
> > So the question now is what the heck is the advisory support doing that
> > can result in such massive performance overhead.
> >
> > … and I think advisorySupport is enabled by default so that’s problematic
> > as well.
> >
> >
> > On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <[hidden email]>
> wrote:
> >
> >> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
> >>
> >> This is a non-persistent broker I’m testing on.
> >>
> >> Looks like it’s spending all it’s time in CopyOnWriteArrayList.remove
> and
> >> Timer.purge…
> >>
> >> Which is hopeful because this is ALL due to ActiveMQ internals and in
> >> theory LevelDB should perform well if we improve the performance of
> >> ActiveMQ internals and fix this lock bug.
> >>
> >> Which would rock!
> >>
> >> It should ALSO make queue creation faster.
> >>
> >>
> >> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <[hidden email]>
> wrote:
> >>
> >>>
> >>>
> >>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <[hidden email]>
> wrote:
> >>>
> >>>> So if LevelDB cleanup during removeDestination() is the presumed
> >>>> culprit,
> >>>> can we spin off the LevelDB cleanup work into a separate thread
> >>>> (better: a
> >>>> task object to be serviced by a ThreadPool so you can avoid a fork
> bomb
> >>>> if
> >>>> we remove many destinations at once) so the call to
> removeDestination()
> >>>> can
> >>>> return quickly and LevelDB can do its record-keeping in the background
> >>>> without blocking message-processing?
> >>>>
> >>>
> >>> Would that be possible?  If the delete is pending on ActiveMQ there is
> a
> >>> race where a producer could re-create it unless the lock is held.
> >>>
> >>> Though I guess if you dispatched to the GC thread WITH the lock still
> >>> held you would be ok but I think if we use the existing purge thread
> then
> >>> we’re fine.
> >>>
> >>> OK. I think I’m wrong about LevelDB being the issue.  To be fair I
> >>> wasn’t 100% certain before but I should have specified.
> >>>
> >>> Our current production broker is running with persistent=false.. .and I
> >>> just re-ran the tests without disk persistence and it has the same
> problem.
> >>>
> >>> So the main issue how is why the heck is ActiveMQ taking SO LONG to GC
> a
> >>> queue.  It’s taking about 100ms which is an insane amount of time
> >>> considering this is done all in memory.
> >>>
> >>> Kevin
> >>>
> >>> --
> >>>
> >>> Founder/CEO Spinn3r.com
> >>> Location: *San Francisco, CA*
> >>> blog: http://burtonator.wordpress.com
> >>> … or check out my Google+ profile
> >>> <https://plus.google.com/102718274791889610666/posts>
> >>> <http://spinn3r.com>
> >>>
> >>>
> >>
> >>
> >> --
> >>
> >> Founder/CEO Spinn3r.com
> >> Location: *San Francisco, CA*
> >> blog: http://burtonator.wordpress.com
> >> … or check out my Google+ profile
> >> <https://plus.google.com/102718274791889610666/posts>
> >> <http://spinn3r.com>
> >>
> >>
> >
> >
> > --
> >
> > Founder/CEO Spinn3r.com
> > Location: *San Francisco, CA*
> > blog: http://burtonator.wordpress.com
> > … or check out my Google+ profile
> > <https://plus.google.com/102718274791889610666/posts>
> > <http://spinn3r.com>
> >
> >
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
Yes.  That was my thinking too.. that just replacing the CopyOnWriteArraySet
with something more performance would solve the issue.

This would also improve queue creation time as well as queue deletion time.


What I think I”m going to do in the mean time is:

- implement a granular lock based on queue name… I am going to use an
interface so we can replace the implementation later.

- implement timing for the purge thread so it tracks how long it takes to
remove a queue but also how long the entire loop takes.

I’ll do this on a branch so it should be easy to merge.

On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <[hidden email]> wrote:

> A decent amount of the time is being spent calling remove() on various
> array-backed collections.  Those data structures might be inappropriate for
> the number of destinations you're running, since array-backed collections
> tend to have add/remove operations that are O(N); some improvement might
> come from something as simple as moving to a ConcurrentHashSet instead of a
> CopyOnWriteArraySet, for example.  (Or it might make performance worse
> because of other aspects of how those collections are used; people other
> than me would be in a better position to evaluate the full range of
> performance requirements for those collections.)
>
> Scheduler.cancel() also takes an alarming amount of time for what looks
> like a really simple method (
>
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-all/5.10.0/org/apache/activemq/thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
> ).
>
> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <[hidden email]> wrote:
>
> > Here’s a jprofiler view with the advisory support enabled if you’re
> > curious.
> >
> > http://i.imgur.com/I1jesZz.jpg
> >
> > I’m not familiar with the internals of ActiveMQ enough to have any
> obvious
> > optimization ideas.
> >
> > One other idea I had (which would require a ton of refactoring I think)
> > would be to potentially bulk delete all the queues at once.
> >
> >
> > On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <[hidden email]>
> wrote:
> >
> > > And spending some more time in jprofiler, looks like 20% of this is do
> to
> > > schedulerSupport and the other 80% of this is do to advisorySupport.
> > >
> > > if I set both to false the total runtime of my tests drops in half… and
> > > the latencies fall from
> > >
> > > max create producer latency: 10,176 ms
> > >> max create message on existing producer and consumer: 2 ms
> > >
> > >
> > > … to
> > >
> > > max create producer latency: 1 ms
> > >> max create message on existing producer and consumer: 1 ms
> > >
> > >
> > > and this isn’t without fixing the purge background lock.
> > >
> > > So the question now is what the heck is the advisory support doing that
> > > can result in such massive performance overhead.
> > >
> > > … and I think advisorySupport is enabled by default so that’s
> problematic
> > > as well.
> > >
> > >
> > > On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <[hidden email]>
> > wrote:
> > >
> > >> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
> > >>
> > >> This is a non-persistent broker I’m testing on.
> > >>
> > >> Looks like it’s spending all it’s time in CopyOnWriteArrayList.remove
> > and
> > >> Timer.purge…
> > >>
> > >> Which is hopeful because this is ALL due to ActiveMQ internals and in
> > >> theory LevelDB should perform well if we improve the performance of
> > >> ActiveMQ internals and fix this lock bug.
> > >>
> > >> Which would rock!
> > >>
> > >> It should ALSO make queue creation faster.
> > >>
> > >>
> > >> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <[hidden email]>
> > wrote:
> > >>
> > >>>
> > >>>
> > >>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <[hidden email]>
> > wrote:
> > >>>
> > >>>> So if LevelDB cleanup during removeDestination() is the presumed
> > >>>> culprit,
> > >>>> can we spin off the LevelDB cleanup work into a separate thread
> > >>>> (better: a
> > >>>> task object to be serviced by a ThreadPool so you can avoid a fork
> > bomb
> > >>>> if
> > >>>> we remove many destinations at once) so the call to
> > removeDestination()
> > >>>> can
> > >>>> return quickly and LevelDB can do its record-keeping in the
> background
> > >>>> without blocking message-processing?
> > >>>>
> > >>>
> > >>> Would that be possible?  If the delete is pending on ActiveMQ there
> is
> > a
> > >>> race where a producer could re-create it unless the lock is held.
> > >>>
> > >>> Though I guess if you dispatched to the GC thread WITH the lock still
> > >>> held you would be ok but I think if we use the existing purge thread
> > then
> > >>> we’re fine.
> > >>>
> > >>> OK. I think I’m wrong about LevelDB being the issue.  To be fair I
> > >>> wasn’t 100% certain before but I should have specified.
> > >>>
> > >>> Our current production broker is running with persistent=false..
> .and I
> > >>> just re-ran the tests without disk persistence and it has the same
> > problem.
> > >>>
> > >>> So the main issue how is why the heck is ActiveMQ taking SO LONG to
> GC
> > a
> > >>> queue.  It’s taking about 100ms which is an insane amount of time
> > >>> considering this is done all in memory.
> > >>>
> > >>> Kevin
> > >>>
> > >>> --
> > >>>
> > >>> Founder/CEO Spinn3r.com
> > >>> Location: *San Francisco, CA*
> > >>> blog: http://burtonator.wordpress.com
> > >>> … or check out my Google+ profile
> > >>> <https://plus.google.com/102718274791889610666/posts>
> > >>> <http://spinn3r.com>
> > >>>
> > >>>
> > >>
> > >>
> > >> --
> > >>
> > >> Founder/CEO Spinn3r.com
> > >> Location: *San Francisco, CA*
> > >> blog: http://burtonator.wordpress.com
> > >> … or check out my Google+ profile
> > >> <https://plus.google.com/102718274791889610666/posts>
> > >> <http://spinn3r.com>
> > >>
> > >>
> > >
> > >
> > > --
> > >
> > > Founder/CEO Spinn3r.com
> > > Location: *San Francisco, CA*
> > > blog: http://burtonator.wordpress.com
> > > … or check out my Google+ profile
> > > <https://plus.google.com/102718274791889610666/posts>
> > > <http://spinn3r.com>
> > >
> > >
> >
> >
> > --
> >
> > Founder/CEO Spinn3r.com
> > Location: *San Francisco, CA*
> > blog: http://burtonator.wordpress.com
> > … or check out my Google+ profile
> > <https://plus.google.com/102718274791889610666/posts>
> > <http://spinn3r.com>
> >
>



--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
So I have some working/theoretical code that should resolve this.

It acquires a lock *per* ActiveMQDestination, this way there is no lock
contention.

But here’s where I’m stuck.

    @Override
>     public void removeSubscription(ConnectionContext context,
> RemoveSubscriptionInfo info) throws Exception {
>         inactiveDestinationsPurgeLock.readLock().lock();
>         try {
>             topicRegion.removeSubscription(context, info);
>         } finally {
>             inactiveDestinationsPurgeLock.readLock().unlock();
>         }
>     }


.. this is in RegionBroker.

There is no ActiveMQDestination involved here so I’m not sure the best way
to resolve this.

Any advice?


On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <[hidden email]> wrote:

> Yes.  That was my thinking too.. that just replacing the CopyOnWriteArraySet
> with something more performance would solve the issue.
>
> This would also improve queue creation time as well as queue deletion
> time.
>
> What I think I”m going to do in the mean time is:
>
> - implement a granular lock based on queue name… I am going to use an
> interface so we can replace the implementation later.
>
> - implement timing for the purge thread so it tracks how long it takes to
> remove a queue but also how long the entire loop takes.
>
> I’ll do this on a branch so it should be easy to merge.
>
> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <[hidden email]> wrote:
>
>> A decent amount of the time is being spent calling remove() on various
>> array-backed collections.  Those data structures might be inappropriate
>> for
>> the number of destinations you're running, since array-backed collections
>> tend to have add/remove operations that are O(N); some improvement might
>> come from something as simple as moving to a ConcurrentHashSet instead of
>> a
>> CopyOnWriteArraySet, for example.  (Or it might make performance worse
>> because of other aspects of how those collections are used; people other
>> than me would be in a better position to evaluate the full range of
>> performance requirements for those collections.)
>>
>> Scheduler.cancel() also takes an alarming amount of time for what looks
>> like a really simple method (
>>
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-all/5.10.0/org/apache/activemq/thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
>> ).
>>
>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <[hidden email]> wrote:
>>
>> > Here’s a jprofiler view with the advisory support enabled if you’re
>> > curious.
>> >
>> > http://i.imgur.com/I1jesZz.jpg
>> >
>> > I’m not familiar with the internals of ActiveMQ enough to have any
>> obvious
>> > optimization ideas.
>> >
>> > One other idea I had (which would require a ton of refactoring I think)
>> > would be to potentially bulk delete all the queues at once.
>> >
>> >
>> > On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <[hidden email]>
>> wrote:
>> >
>> > > And spending some more time in jprofiler, looks like 20% of this is
>> do to
>> > > schedulerSupport and the other 80% of this is do to advisorySupport.
>> > >
>> > > if I set both to false the total runtime of my tests drops in half…
>> and
>> > > the latencies fall from
>> > >
>> > > max create producer latency: 10,176 ms
>> > >> max create message on existing producer and consumer: 2 ms
>> > >
>> > >
>> > > … to
>> > >
>> > > max create producer latency: 1 ms
>> > >> max create message on existing producer and consumer: 1 ms
>> > >
>> > >
>> > > and this isn’t without fixing the purge background lock.
>> > >
>> > > So the question now is what the heck is the advisory support doing
>> that
>> > > can result in such massive performance overhead.
>> > >
>> > > … and I think advisorySupport is enabled by default so that’s
>> problematic
>> > > as well.
>> > >
>> > >
>> > > On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <[hidden email]>
>> > wrote:
>> > >
>> > >> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
>> > >>
>> > >> This is a non-persistent broker I’m testing on.
>> > >>
>> > >> Looks like it’s spending all it’s time in CopyOnWriteArrayList.remove
>> > and
>> > >> Timer.purge…
>> > >>
>> > >> Which is hopeful because this is ALL due to ActiveMQ internals and in
>> > >> theory LevelDB should perform well if we improve the performance of
>> > >> ActiveMQ internals and fix this lock bug.
>> > >>
>> > >> Which would rock!
>> > >>
>> > >> It should ALSO make queue creation faster.
>> > >>
>> > >>
>> > >> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <[hidden email]>
>> > wrote:
>> > >>
>> > >>>
>> > >>>
>> > >>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <[hidden email]>
>> > wrote:
>> > >>>
>> > >>>> So if LevelDB cleanup during removeDestination() is the presumed
>> > >>>> culprit,
>> > >>>> can we spin off the LevelDB cleanup work into a separate thread
>> > >>>> (better: a
>> > >>>> task object to be serviced by a ThreadPool so you can avoid a fork
>> > bomb
>> > >>>> if
>> > >>>> we remove many destinations at once) so the call to
>> > removeDestination()
>> > >>>> can
>> > >>>> return quickly and LevelDB can do its record-keeping in the
>> background
>> > >>>> without blocking message-processing?
>> > >>>>
>> > >>>
>> > >>> Would that be possible?  If the delete is pending on ActiveMQ there
>> is
>> > a
>> > >>> race where a producer could re-create it unless the lock is held.
>> > >>>
>> > >>> Though I guess if you dispatched to the GC thread WITH the lock
>> still
>> > >>> held you would be ok but I think if we use the existing purge thread
>> > then
>> > >>> we’re fine.
>> > >>>
>> > >>> OK. I think I’m wrong about LevelDB being the issue.  To be fair I
>> > >>> wasn’t 100% certain before but I should have specified.
>> > >>>
>> > >>> Our current production broker is running with persistent=false..
>> .and I
>> > >>> just re-ran the tests without disk persistence and it has the same
>> > problem.
>> > >>>
>> > >>> So the main issue how is why the heck is ActiveMQ taking SO LONG to
>> GC
>> > a
>> > >>> queue.  It’s taking about 100ms which is an insane amount of time
>> > >>> considering this is done all in memory.
>> > >>>
>> > >>> Kevin
>> > >>>
>> > >>> --
>> > >>>
>> > >>> Founder/CEO Spinn3r.com
>> > >>> Location: *San Francisco, CA*
>> > >>> blog: http://burtonator.wordpress.com
>> > >>> … or check out my Google+ profile
>> > >>> <https://plus.google.com/102718274791889610666/posts>
>> > >>> <http://spinn3r.com>
>> > >>>
>> > >>>
>> > >>
>> > >>
>> > >> --
>> > >>
>> > >> Founder/CEO Spinn3r.com
>> > >> Location: *San Francisco, CA*
>> > >> blog: http://burtonator.wordpress.com
>> > >> … or check out my Google+ profile
>> > >> <https://plus.google.com/102718274791889610666/posts>
>> > >> <http://spinn3r.com>
>> > >>
>> > >>
>> > >
>> > >
>> > > --
>> > >
>> > > Founder/CEO Spinn3r.com
>> > > Location: *San Francisco, CA*
>> > > blog: http://burtonator.wordpress.com
>> > > … or check out my Google+ profile
>> > > <https://plus.google.com/102718274791889610666/posts>
>> > > <http://spinn3r.com>
>> > >
>> > >
>> >
>> >
>> > --
>> >
>> > Founder/CEO Spinn3r.com
>> > Location: *San Francisco, CA*
>> > blog: http://burtonator.wordpress.com
>> > … or check out my Google+ profile
>> > <https://plus.google.com/102718274791889610666/posts>
>> > <http://spinn3r.com>
>> >
>>
>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>


--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
Actually, is the lock even needed here?  Why would it be?  if we’re
*removing* a subscription, why does it care if we possibly ALSO remove a
separate / isolated queue before/after the subscription is removed.

I think this is redundant and can be removed.  Maybe I’m wrong though.

I looked at all the callers and none were associated with queues.

On Sun, Feb 22, 2015 at 11:07 PM, Kevin Burton <[hidden email]> wrote:

> So I have some working/theoretical code that should resolve this.
>
> It acquires a lock *per* ActiveMQDestination, this way there is no lock
> contention.
>
> But here’s where I’m stuck.
>
>     @Override
>>     public void removeSubscription(ConnectionContext context,
>> RemoveSubscriptionInfo info) throws Exception {
>>         inactiveDestinationsPurgeLock.readLock().lock();
>>         try {
>>             topicRegion.removeSubscription(context, info);
>>         } finally {
>>             inactiveDestinationsPurgeLock.readLock().unlock();
>>         }
>>     }
>
>
> .. this is in RegionBroker.
>
> There is no ActiveMQDestination involved here so I’m not sure the best way
> to resolve this.
>
> Any advice?
>
>
> On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <[hidden email]> wrote:
>
>> Yes.  That was my thinking too.. that just replacing the CopyOnWriteArraySet
>> with something more performance would solve the issue.
>>
>> This would also improve queue creation time as well as queue deletion
>> time.
>>
>> What I think I”m going to do in the mean time is:
>>
>> - implement a granular lock based on queue name… I am going to use an
>> interface so we can replace the implementation later.
>>
>> - implement timing for the purge thread so it tracks how long it takes to
>> remove a queue but also how long the entire loop takes.
>>
>> I’ll do this on a branch so it should be easy to merge.
>>
>> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <[hidden email]> wrote:
>>
>>> A decent amount of the time is being spent calling remove() on various
>>> array-backed collections.  Those data structures might be inappropriate
>>> for
>>> the number of destinations you're running, since array-backed collections
>>> tend to have add/remove operations that are O(N); some improvement might
>>> come from something as simple as moving to a ConcurrentHashSet instead
>>> of a
>>> CopyOnWriteArraySet, for example.  (Or it might make performance worse
>>> because of other aspects of how those collections are used; people other
>>> than me would be in a better position to evaluate the full range of
>>> performance requirements for those collections.)
>>>
>>> Scheduler.cancel() also takes an alarming amount of time for what looks
>>> like a really simple method (
>>>
>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-all/5.10.0/org/apache/activemq/thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
>>> ).
>>>
>>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <[hidden email]>
>>> wrote:
>>>
>>> > Here’s a jprofiler view with the advisory support enabled if you’re
>>> > curious.
>>> >
>>> > http://i.imgur.com/I1jesZz.jpg
>>> >
>>> > I’m not familiar with the internals of ActiveMQ enough to have any
>>> obvious
>>> > optimization ideas.
>>> >
>>> > One other idea I had (which would require a ton of refactoring I think)
>>> > would be to potentially bulk delete all the queues at once.
>>> >
>>> >
>>> > On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <[hidden email]>
>>> wrote:
>>> >
>>> > > And spending some more time in jprofiler, looks like 20% of this is
>>> do to
>>> > > schedulerSupport and the other 80% of this is do to advisorySupport.
>>> > >
>>> > > if I set both to false the total runtime of my tests drops in half…
>>> and
>>> > > the latencies fall from
>>> > >
>>> > > max create producer latency: 10,176 ms
>>> > >> max create message on existing producer and consumer: 2 ms
>>> > >
>>> > >
>>> > > … to
>>> > >
>>> > > max create producer latency: 1 ms
>>> > >> max create message on existing producer and consumer: 1 ms
>>> > >
>>> > >
>>> > > and this isn’t without fixing the purge background lock.
>>> > >
>>> > > So the question now is what the heck is the advisory support doing
>>> that
>>> > > can result in such massive performance overhead.
>>> > >
>>> > > … and I think advisorySupport is enabled by default so that’s
>>> problematic
>>> > > as well.
>>> > >
>>> > >
>>> > > On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <[hidden email]>
>>> > wrote:
>>> > >
>>> > >> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
>>> > >>
>>> > >> This is a non-persistent broker I’m testing on.
>>> > >>
>>> > >> Looks like it’s spending all it’s time in
>>> CopyOnWriteArrayList.remove
>>> > and
>>> > >> Timer.purge…
>>> > >>
>>> > >> Which is hopeful because this is ALL due to ActiveMQ internals and
>>> in
>>> > >> theory LevelDB should perform well if we improve the performance of
>>> > >> ActiveMQ internals and fix this lock bug.
>>> > >>
>>> > >> Which would rock!
>>> > >>
>>> > >> It should ALSO make queue creation faster.
>>> > >>
>>> > >>
>>> > >> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <[hidden email]>
>>> > wrote:
>>> > >>
>>> > >>>
>>> > >>>
>>> > >>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <[hidden email]>
>>> > wrote:
>>> > >>>
>>> > >>>> So if LevelDB cleanup during removeDestination() is the presumed
>>> > >>>> culprit,
>>> > >>>> can we spin off the LevelDB cleanup work into a separate thread
>>> > >>>> (better: a
>>> > >>>> task object to be serviced by a ThreadPool so you can avoid a fork
>>> > bomb
>>> > >>>> if
>>> > >>>> we remove many destinations at once) so the call to
>>> > removeDestination()
>>> > >>>> can
>>> > >>>> return quickly and LevelDB can do its record-keeping in the
>>> background
>>> > >>>> without blocking message-processing?
>>> > >>>>
>>> > >>>
>>> > >>> Would that be possible?  If the delete is pending on ActiveMQ
>>> there is
>>> > a
>>> > >>> race where a producer could re-create it unless the lock is held.
>>> > >>>
>>> > >>> Though I guess if you dispatched to the GC thread WITH the lock
>>> still
>>> > >>> held you would be ok but I think if we use the existing purge
>>> thread
>>> > then
>>> > >>> we’re fine.
>>> > >>>
>>> > >>> OK. I think I’m wrong about LevelDB being the issue.  To be fair I
>>> > >>> wasn’t 100% certain before but I should have specified.
>>> > >>>
>>> > >>> Our current production broker is running with persistent=false..
>>> .and I
>>> > >>> just re-ran the tests without disk persistence and it has the same
>>> > problem.
>>> > >>>
>>> > >>> So the main issue how is why the heck is ActiveMQ taking SO LONG
>>> to GC
>>> > a
>>> > >>> queue.  It’s taking about 100ms which is an insane amount of time
>>> > >>> considering this is done all in memory.
>>> > >>>
>>> > >>> Kevin
>>> > >>>
>>> > >>> --
>>> > >>>
>>> > >>> Founder/CEO Spinn3r.com
>>> > >>> Location: *San Francisco, CA*
>>> > >>> blog: http://burtonator.wordpress.com
>>> > >>> … or check out my Google+ profile
>>> > >>> <https://plus.google.com/102718274791889610666/posts>
>>> > >>> <http://spinn3r.com>
>>> > >>>
>>> > >>>
>>> > >>
>>> > >>
>>> > >> --
>>> > >>
>>> > >> Founder/CEO Spinn3r.com
>>> > >> Location: *San Francisco, CA*
>>> > >> blog: http://burtonator.wordpress.com
>>> > >> … or check out my Google+ profile
>>> > >> <https://plus.google.com/102718274791889610666/posts>
>>> > >> <http://spinn3r.com>
>>> > >>
>>> > >>
>>> > >
>>> > >
>>> > > --
>>> > >
>>> > > Founder/CEO Spinn3r.com
>>> > > Location: *San Francisco, CA*
>>> > > blog: http://burtonator.wordpress.com
>>> > > … or check out my Google+ profile
>>> > > <https://plus.google.com/102718274791889610666/posts>
>>> > > <http://spinn3r.com>
>>> > >
>>> > >
>>> >
>>> >
>>> > --
>>> >
>>> > Founder/CEO Spinn3r.com
>>> > Location: *San Francisco, CA*
>>> > blog: http://burtonator.wordpress.com
>>> > … or check out my Google+ profile
>>> > <https://plus.google.com/102718274791889610666/posts>
>>> > <http://spinn3r.com>
>>> >
>>>
>>
>>
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>


--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
OK.  This is ready to go and I have a patch branch:

https://issues.apache.org/jira/browse/AMQ-5609

I’m stuck at the moment though because tests don’t pass.  But it was
failing tests before so I don’t think it has anything to do with my
changes.



On Sun, Feb 22, 2015 at 11:11 PM, Kevin Burton <[hidden email]> wrote:

> Actually, is the lock even needed here?  Why would it be?  if we’re
> *removing* a subscription, why does it care if we possibly ALSO remove a
> separate / isolated queue before/after the subscription is removed.
>
> I think this is redundant and can be removed.  Maybe I’m wrong though.
>
> I looked at all the callers and none were associated with queues.
>
> On Sun, Feb 22, 2015 at 11:07 PM, Kevin Burton <[hidden email]> wrote:
>
>> So I have some working/theoretical code that should resolve this.
>>
>> It acquires a lock *per* ActiveMQDestination, this way there is no lock
>> contention.
>>
>> But here’s where I’m stuck.
>>
>>     @Override
>>>     public void removeSubscription(ConnectionContext context,
>>> RemoveSubscriptionInfo info) throws Exception {
>>>         inactiveDestinationsPurgeLock.readLock().lock();
>>>         try {
>>>             topicRegion.removeSubscription(context, info);
>>>         } finally {
>>>             inactiveDestinationsPurgeLock.readLock().unlock();
>>>         }
>>>     }
>>
>>
>> .. this is in RegionBroker.
>>
>> There is no ActiveMQDestination involved here so I’m not sure the best
>> way to resolve this.
>>
>> Any advice?
>>
>>
>> On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <[hidden email]> wrote:
>>
>>> Yes.  That was my thinking too.. that just replacing the CopyOnWriteArraySet
>>> with something more performance would solve the issue.
>>>
>>> This would also improve queue creation time as well as queue deletion
>>> time.
>>>
>>> What I think I”m going to do in the mean time is:
>>>
>>> - implement a granular lock based on queue name… I am going to use an
>>> interface so we can replace the implementation later.
>>>
>>> - implement timing for the purge thread so it tracks how long it takes
>>> to remove a queue but also how long the entire loop takes.
>>>
>>> I’ll do this on a branch so it should be easy to merge.
>>>
>>> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <[hidden email]> wrote:
>>>
>>>> A decent amount of the time is being spent calling remove() on various
>>>> array-backed collections.  Those data structures might be inappropriate
>>>> for
>>>> the number of destinations you're running, since array-backed
>>>> collections
>>>> tend to have add/remove operations that are O(N); some improvement might
>>>> come from something as simple as moving to a ConcurrentHashSet instead
>>>> of a
>>>> CopyOnWriteArraySet, for example.  (Or it might make performance worse
>>>> because of other aspects of how those collections are used; people other
>>>> than me would be in a better position to evaluate the full range of
>>>> performance requirements for those collections.)
>>>>
>>>> Scheduler.cancel() also takes an alarming amount of time for what looks
>>>> like a really simple method (
>>>>
>>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-all/5.10.0/org/apache/activemq/thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
>>>> ).
>>>>
>>>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <[hidden email]>
>>>> wrote:
>>>>
>>>> > Here’s a jprofiler view with the advisory support enabled if you’re
>>>> > curious.
>>>> >
>>>> > http://i.imgur.com/I1jesZz.jpg
>>>> >
>>>> > I’m not familiar with the internals of ActiveMQ enough to have any
>>>> obvious
>>>> > optimization ideas.
>>>> >
>>>> > One other idea I had (which would require a ton of refactoring I
>>>> think)
>>>> > would be to potentially bulk delete all the queues at once.
>>>> >
>>>> >
>>>> > On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <[hidden email]>
>>>> wrote:
>>>> >
>>>> > > And spending some more time in jprofiler, looks like 20% of this is
>>>> do to
>>>> > > schedulerSupport and the other 80% of this is do to advisorySupport.
>>>> > >
>>>> > > if I set both to false the total runtime of my tests drops in half…
>>>> and
>>>> > > the latencies fall from
>>>> > >
>>>> > > max create producer latency: 10,176 ms
>>>> > >> max create message on existing producer and consumer: 2 ms
>>>> > >
>>>> > >
>>>> > > … to
>>>> > >
>>>> > > max create producer latency: 1 ms
>>>> > >> max create message on existing producer and consumer: 1 ms
>>>> > >
>>>> > >
>>>> > > and this isn’t without fixing the purge background lock.
>>>> > >
>>>> > > So the question now is what the heck is the advisory support doing
>>>> that
>>>> > > can result in such massive performance overhead.
>>>> > >
>>>> > > … and I think advisorySupport is enabled by default so that’s
>>>> problematic
>>>> > > as well.
>>>> > >
>>>> > >
>>>> > > On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <[hidden email]>
>>>> > wrote:
>>>> > >
>>>> > >> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
>>>> > >>
>>>> > >> This is a non-persistent broker I’m testing on.
>>>> > >>
>>>> > >> Looks like it’s spending all it’s time in
>>>> CopyOnWriteArrayList.remove
>>>> > and
>>>> > >> Timer.purge…
>>>> > >>
>>>> > >> Which is hopeful because this is ALL due to ActiveMQ internals and
>>>> in
>>>> > >> theory LevelDB should perform well if we improve the performance of
>>>> > >> ActiveMQ internals and fix this lock bug.
>>>> > >>
>>>> > >> Which would rock!
>>>> > >>
>>>> > >> It should ALSO make queue creation faster.
>>>> > >>
>>>> > >>
>>>> > >> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <[hidden email]>
>>>> > wrote:
>>>> > >>
>>>> > >>>
>>>> > >>>
>>>> > >>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <[hidden email]>
>>>> > wrote:
>>>> > >>>
>>>> > >>>> So if LevelDB cleanup during removeDestination() is the presumed
>>>> > >>>> culprit,
>>>> > >>>> can we spin off the LevelDB cleanup work into a separate thread
>>>> > >>>> (better: a
>>>> > >>>> task object to be serviced by a ThreadPool so you can avoid a
>>>> fork
>>>> > bomb
>>>> > >>>> if
>>>> > >>>> we remove many destinations at once) so the call to
>>>> > removeDestination()
>>>> > >>>> can
>>>> > >>>> return quickly and LevelDB can do its record-keeping in the
>>>> background
>>>> > >>>> without blocking message-processing?
>>>> > >>>>
>>>> > >>>
>>>> > >>> Would that be possible?  If the delete is pending on ActiveMQ
>>>> there is
>>>> > a
>>>> > >>> race where a producer could re-create it unless the lock is held.
>>>> > >>>
>>>> > >>> Though I guess if you dispatched to the GC thread WITH the lock
>>>> still
>>>> > >>> held you would be ok but I think if we use the existing purge
>>>> thread
>>>> > then
>>>> > >>> we’re fine.
>>>> > >>>
>>>> > >>> OK. I think I’m wrong about LevelDB being the issue.  To be fair I
>>>> > >>> wasn’t 100% certain before but I should have specified.
>>>> > >>>
>>>> > >>> Our current production broker is running with persistent=false..
>>>> .and I
>>>> > >>> just re-ran the tests without disk persistence and it has the same
>>>> > problem.
>>>> > >>>
>>>> > >>> So the main issue how is why the heck is ActiveMQ taking SO LONG
>>>> to GC
>>>> > a
>>>> > >>> queue.  It’s taking about 100ms which is an insane amount of time
>>>> > >>> considering this is done all in memory.
>>>> > >>>
>>>> > >>> Kevin
>>>> > >>>
>>>> > >>> --
>>>> > >>>
>>>> > >>> Founder/CEO Spinn3r.com
>>>> > >>> Location: *San Francisco, CA*
>>>> > >>> blog: http://burtonator.wordpress.com
>>>> > >>> … or check out my Google+ profile
>>>> > >>> <https://plus.google.com/102718274791889610666/posts>
>>>> > >>> <http://spinn3r.com>
>>>> > >>>
>>>> > >>>
>>>> > >>
>>>> > >>
>>>> > >> --
>>>> > >>
>>>> > >> Founder/CEO Spinn3r.com
>>>> > >> Location: *San Francisco, CA*
>>>> > >> blog: http://burtonator.wordpress.com
>>>> > >> … or check out my Google+ profile
>>>> > >> <https://plus.google.com/102718274791889610666/posts>
>>>> > >> <http://spinn3r.com>
>>>> > >>
>>>> > >>
>>>> > >
>>>> > >
>>>> > > --
>>>> > >
>>>> > > Founder/CEO Spinn3r.com
>>>> > > Location: *San Francisco, CA*
>>>> > > blog: http://burtonator.wordpress.com
>>>> > > … or check out my Google+ profile
>>>> > > <https://plus.google.com/102718274791889610666/posts>
>>>> > > <http://spinn3r.com>
>>>> > >
>>>> > >
>>>> >
>>>> >
>>>> > --
>>>> >
>>>> > Founder/CEO Spinn3r.com
>>>> > Location: *San Francisco, CA*
>>>> > blog: http://burtonator.wordpress.com
>>>> > … or check out my Google+ profile
>>>> > <https://plus.google.com/102718274791889610666/posts>
>>>> > <http://spinn3r.com>
>>>> >
>>>>
>>>
>>>
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> <https://plus.google.com/102718274791889610666/posts>
>>> <http://spinn3r.com>
>>>
>>>
>>
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
>


--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

gtully
if there are any test failures - try to run them individually
-Dtest=a,b etc. There may be an issue with a full test run, but all of
the tests that are enabled should work. I know there are some issues
with jdbc tests that hang or fail due to previous runs no cleaning up,
but that should be the most of it. I got a bunch of full test runs
before the 5.11 release if that is any help.

On 23 February 2015 at 20:38, Kevin Burton <[hidden email]> wrote:

> OK.  This is ready to go and I have a patch branch:
>
> https://issues.apache.org/jira/browse/AMQ-5609
>
> I’m stuck at the moment though because tests don’t pass.  But it was
> failing tests before so I don’t think it has anything to do with my
> changes.
>
>
>
> On Sun, Feb 22, 2015 at 11:11 PM, Kevin Burton <[hidden email]> wrote:
>
>> Actually, is the lock even needed here?  Why would it be?  if we’re
>> *removing* a subscription, why does it care if we possibly ALSO remove a
>> separate / isolated queue before/after the subscription is removed.
>>
>> I think this is redundant and can be removed.  Maybe I’m wrong though.
>>
>> I looked at all the callers and none were associated with queues.
>>
>> On Sun, Feb 22, 2015 at 11:07 PM, Kevin Burton <[hidden email]> wrote:
>>
>>> So I have some working/theoretical code that should resolve this.
>>>
>>> It acquires a lock *per* ActiveMQDestination, this way there is no lock
>>> contention.
>>>
>>> But here’s where I’m stuck.
>>>
>>>     @Override
>>>>     public void removeSubscription(ConnectionContext context,
>>>> RemoveSubscriptionInfo info) throws Exception {
>>>>         inactiveDestinationsPurgeLock.readLock().lock();
>>>>         try {
>>>>             topicRegion.removeSubscription(context, info);
>>>>         } finally {
>>>>             inactiveDestinationsPurgeLock.readLock().unlock();
>>>>         }
>>>>     }
>>>
>>>
>>> .. this is in RegionBroker.
>>>
>>> There is no ActiveMQDestination involved here so I’m not sure the best
>>> way to resolve this.
>>>
>>> Any advice?
>>>
>>>
>>> On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <[hidden email]> wrote:
>>>
>>>> Yes.  That was my thinking too.. that just replacing the CopyOnWriteArraySet
>>>> with something more performance would solve the issue.
>>>>
>>>> This would also improve queue creation time as well as queue deletion
>>>> time.
>>>>
>>>> What I think I”m going to do in the mean time is:
>>>>
>>>> - implement a granular lock based on queue name… I am going to use an
>>>> interface so we can replace the implementation later.
>>>>
>>>> - implement timing for the purge thread so it tracks how long it takes
>>>> to remove a queue but also how long the entire loop takes.
>>>>
>>>> I’ll do this on a branch so it should be easy to merge.
>>>>
>>>> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <[hidden email]> wrote:
>>>>
>>>>> A decent amount of the time is being spent calling remove() on various
>>>>> array-backed collections.  Those data structures might be inappropriate
>>>>> for
>>>>> the number of destinations you're running, since array-backed
>>>>> collections
>>>>> tend to have add/remove operations that are O(N); some improvement might
>>>>> come from something as simple as moving to a ConcurrentHashSet instead
>>>>> of a
>>>>> CopyOnWriteArraySet, for example.  (Or it might make performance worse
>>>>> because of other aspects of how those collections are used; people other
>>>>> than me would be in a better position to evaluate the full range of
>>>>> performance requirements for those collections.)
>>>>>
>>>>> Scheduler.cancel() also takes an alarming amount of time for what looks
>>>>> like a really simple method (
>>>>>
>>>>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-all/5.10.0/org/apache/activemq/thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
>>>>> ).
>>>>>
>>>>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <[hidden email]>
>>>>> wrote:
>>>>>
>>>>> > Here’s a jprofiler view with the advisory support enabled if you’re
>>>>> > curious.
>>>>> >
>>>>> > http://i.imgur.com/I1jesZz.jpg
>>>>> >
>>>>> > I’m not familiar with the internals of ActiveMQ enough to have any
>>>>> obvious
>>>>> > optimization ideas.
>>>>> >
>>>>> > One other idea I had (which would require a ton of refactoring I
>>>>> think)
>>>>> > would be to potentially bulk delete all the queues at once.
>>>>> >
>>>>> >
>>>>> > On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <[hidden email]>
>>>>> wrote:
>>>>> >
>>>>> > > And spending some more time in jprofiler, looks like 20% of this is
>>>>> do to
>>>>> > > schedulerSupport and the other 80% of this is do to advisorySupport.
>>>>> > >
>>>>> > > if I set both to false the total runtime of my tests drops in half…
>>>>> and
>>>>> > > the latencies fall from
>>>>> > >
>>>>> > > max create producer latency: 10,176 ms
>>>>> > >> max create message on existing producer and consumer: 2 ms
>>>>> > >
>>>>> > >
>>>>> > > … to
>>>>> > >
>>>>> > > max create producer latency: 1 ms
>>>>> > >> max create message on existing producer and consumer: 1 ms
>>>>> > >
>>>>> > >
>>>>> > > and this isn’t without fixing the purge background lock.
>>>>> > >
>>>>> > > So the question now is what the heck is the advisory support doing
>>>>> that
>>>>> > > can result in such massive performance overhead.
>>>>> > >
>>>>> > > … and I think advisorySupport is enabled by default so that’s
>>>>> problematic
>>>>> > > as well.
>>>>> > >
>>>>> > >
>>>>> > > On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <[hidden email]>
>>>>> > wrote:
>>>>> > >
>>>>> > >> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
>>>>> > >>
>>>>> > >> This is a non-persistent broker I’m testing on.
>>>>> > >>
>>>>> > >> Looks like it’s spending all it’s time in
>>>>> CopyOnWriteArrayList.remove
>>>>> > and
>>>>> > >> Timer.purge…
>>>>> > >>
>>>>> > >> Which is hopeful because this is ALL due to ActiveMQ internals and
>>>>> in
>>>>> > >> theory LevelDB should perform well if we improve the performance of
>>>>> > >> ActiveMQ internals and fix this lock bug.
>>>>> > >>
>>>>> > >> Which would rock!
>>>>> > >>
>>>>> > >> It should ALSO make queue creation faster.
>>>>> > >>
>>>>> > >>
>>>>> > >> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <[hidden email]>
>>>>> > wrote:
>>>>> > >>
>>>>> > >>>
>>>>> > >>>
>>>>> > >>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <[hidden email]>
>>>>> > wrote:
>>>>> > >>>
>>>>> > >>>> So if LevelDB cleanup during removeDestination() is the presumed
>>>>> > >>>> culprit,
>>>>> > >>>> can we spin off the LevelDB cleanup work into a separate thread
>>>>> > >>>> (better: a
>>>>> > >>>> task object to be serviced by a ThreadPool so you can avoid a
>>>>> fork
>>>>> > bomb
>>>>> > >>>> if
>>>>> > >>>> we remove many destinations at once) so the call to
>>>>> > removeDestination()
>>>>> > >>>> can
>>>>> > >>>> return quickly and LevelDB can do its record-keeping in the
>>>>> background
>>>>> > >>>> without blocking message-processing?
>>>>> > >>>>
>>>>> > >>>
>>>>> > >>> Would that be possible?  If the delete is pending on ActiveMQ
>>>>> there is
>>>>> > a
>>>>> > >>> race where a producer could re-create it unless the lock is held.
>>>>> > >>>
>>>>> > >>> Though I guess if you dispatched to the GC thread WITH the lock
>>>>> still
>>>>> > >>> held you would be ok but I think if we use the existing purge
>>>>> thread
>>>>> > then
>>>>> > >>> we’re fine.
>>>>> > >>>
>>>>> > >>> OK. I think I’m wrong about LevelDB being the issue.  To be fair I
>>>>> > >>> wasn’t 100% certain before but I should have specified.
>>>>> > >>>
>>>>> > >>> Our current production broker is running with persistent=false..
>>>>> .and I
>>>>> > >>> just re-ran the tests without disk persistence and it has the same
>>>>> > problem.
>>>>> > >>>
>>>>> > >>> So the main issue how is why the heck is ActiveMQ taking SO LONG
>>>>> to GC
>>>>> > a
>>>>> > >>> queue.  It’s taking about 100ms which is an insane amount of time
>>>>> > >>> considering this is done all in memory.
>>>>> > >>>
>>>>> > >>> Kevin
>>>>> > >>>
>>>>> > >>> --
>>>>> > >>>
>>>>> > >>> Founder/CEO Spinn3r.com
>>>>> > >>> Location: *San Francisco, CA*
>>>>> > >>> blog: http://burtonator.wordpress.com
>>>>> > >>> … or check out my Google+ profile
>>>>> > >>> <https://plus.google.com/102718274791889610666/posts>
>>>>> > >>> <http://spinn3r.com>
>>>>> > >>>
>>>>> > >>>
>>>>> > >>
>>>>> > >>
>>>>> > >> --
>>>>> > >>
>>>>> > >> Founder/CEO Spinn3r.com
>>>>> > >> Location: *San Francisco, CA*
>>>>> > >> blog: http://burtonator.wordpress.com
>>>>> > >> … or check out my Google+ profile
>>>>> > >> <https://plus.google.com/102718274791889610666/posts>
>>>>> > >> <http://spinn3r.com>
>>>>> > >>
>>>>> > >>
>>>>> > >
>>>>> > >
>>>>> > > --
>>>>> > >
>>>>> > > Founder/CEO Spinn3r.com
>>>>> > > Location: *San Francisco, CA*
>>>>> > > blog: http://burtonator.wordpress.com
>>>>> > > … or check out my Google+ profile
>>>>> > > <https://plus.google.com/102718274791889610666/posts>
>>>>> > > <http://spinn3r.com>
>>>>> > >
>>>>> > >
>>>>> >
>>>>> >
>>>>> > --
>>>>> >
>>>>> > Founder/CEO Spinn3r.com
>>>>> > Location: *San Francisco, CA*
>>>>> > blog: http://burtonator.wordpress.com
>>>>> > … or check out my Google+ profile
>>>>> > <https://plus.google.com/102718274791889610666/posts>
>>>>> > <http://spinn3r.com>
>>>>> >
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> Founder/CEO Spinn3r.com
>>>> Location: *San Francisco, CA*
>>>> blog: http://burtonator.wordpress.com
>>>> … or check out my Google+ profile
>>>> <https://plus.google.com/102718274791889610666/posts>
>>>> <http://spinn3r.com>
>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> <https://plus.google.com/102718274791889610666/posts>
>>> <http://spinn3r.com>
>>>
>>>
>>
>>
>> --
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>> <http://spinn3r.com>
>>
>>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
How long do the tests usually take? I’m looking at 45 minutes right now
before I gave up… I think part of it was that the box I was testing on was
virtualized and didn’t have enough resources.

I tried to parallelize the tests (-T 8 with maven) but I got other errors
so I assume the ports are singletons.

On Tue, Feb 24, 2015 at 8:03 AM, Gary Tully <[hidden email]> wrote:

> if there are any test failures - try to run them individually
> -Dtest=a,b etc. There may be an issue with a full test run, but all of
> the tests that are enabled should work. I know there are some issues
> with jdbc tests that hang or fail due to previous runs no cleaning up,
> but that should be the most of it. I got a bunch of full test runs
> before the 5.11 release if that is any help.
>
> On 23 February 2015 at 20:38, Kevin Burton <[hidden email]> wrote:
> > OK.  This is ready to go and I have a patch branch:
> >
> > https://issues.apache.org/jira/browse/AMQ-5609
> >
> > I’m stuck at the moment though because tests don’t pass.  But it was
> > failing tests before so I don’t think it has anything to do with my
> > changes.
> >
> >
> >
> > On Sun, Feb 22, 2015 at 11:11 PM, Kevin Burton <[hidden email]>
> wrote:
> >
> >> Actually, is the lock even needed here?  Why would it be?  if we’re
> >> *removing* a subscription, why does it care if we possibly ALSO remove a
> >> separate / isolated queue before/after the subscription is removed.
> >>
> >> I think this is redundant and can be removed.  Maybe I’m wrong though.
> >>
> >> I looked at all the callers and none were associated with queues.
> >>
> >> On Sun, Feb 22, 2015 at 11:07 PM, Kevin Burton <[hidden email]>
> wrote:
> >>
> >>> So I have some working/theoretical code that should resolve this.
> >>>
> >>> It acquires a lock *per* ActiveMQDestination, this way there is no lock
> >>> contention.
> >>>
> >>> But here’s where I’m stuck.
> >>>
> >>>     @Override
> >>>>     public void removeSubscription(ConnectionContext context,
> >>>> RemoveSubscriptionInfo info) throws Exception {
> >>>>         inactiveDestinationsPurgeLock.readLock().lock();
> >>>>         try {
> >>>>             topicRegion.removeSubscription(context, info);
> >>>>         } finally {
> >>>>             inactiveDestinationsPurgeLock.readLock().unlock();
> >>>>         }
> >>>>     }
> >>>
> >>>
> >>> .. this is in RegionBroker.
> >>>
> >>> There is no ActiveMQDestination involved here so I’m not sure the best
> >>> way to resolve this.
> >>>
> >>> Any advice?
> >>>
> >>>
> >>> On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <[hidden email]>
> wrote:
> >>>
> >>>> Yes.  That was my thinking too.. that just replacing the
> CopyOnWriteArraySet
> >>>> with something more performance would solve the issue.
> >>>>
> >>>> This would also improve queue creation time as well as queue deletion
> >>>> time.
> >>>>
> >>>> What I think I”m going to do in the mean time is:
> >>>>
> >>>> - implement a granular lock based on queue name… I am going to use an
> >>>> interface so we can replace the implementation later.
> >>>>
> >>>> - implement timing for the purge thread so it tracks how long it takes
> >>>> to remove a queue but also how long the entire loop takes.
> >>>>
> >>>> I’ll do this on a branch so it should be easy to merge.
> >>>>
> >>>> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <[hidden email]>
> wrote:
> >>>>
> >>>>> A decent amount of the time is being spent calling remove() on
> various
> >>>>> array-backed collections.  Those data structures might be
> inappropriate
> >>>>> for
> >>>>> the number of destinations you're running, since array-backed
> >>>>> collections
> >>>>> tend to have add/remove operations that are O(N); some improvement
> might
> >>>>> come from something as simple as moving to a ConcurrentHashSet
> instead
> >>>>> of a
> >>>>> CopyOnWriteArraySet, for example.  (Or it might make performance
> worse
> >>>>> because of other aspects of how those collections are used; people
> other
> >>>>> than me would be in a better position to evaluate the full range of
> >>>>> performance requirements for those collections.)
> >>>>>
> >>>>> Scheduler.cancel() also takes an alarming amount of time for what
> looks
> >>>>> like a really simple method (
> >>>>>
> >>>>>
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-all/5.10.0/org/apache/activemq/thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
> >>>>> ).
> >>>>>
> >>>>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <[hidden email]>
> >>>>> wrote:
> >>>>>
> >>>>> > Here’s a jprofiler view with the advisory support enabled if you’re
> >>>>> > curious.
> >>>>> >
> >>>>> > http://i.imgur.com/I1jesZz.jpg
> >>>>> >
> >>>>> > I’m not familiar with the internals of ActiveMQ enough to have any
> >>>>> obvious
> >>>>> > optimization ideas.
> >>>>> >
> >>>>> > One other idea I had (which would require a ton of refactoring I
> >>>>> think)
> >>>>> > would be to potentially bulk delete all the queues at once.
> >>>>> >
> >>>>> >
> >>>>> > On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <[hidden email]>
> >>>>> wrote:
> >>>>> >
> >>>>> > > And spending some more time in jprofiler, looks like 20% of this
> is
> >>>>> do to
> >>>>> > > schedulerSupport and the other 80% of this is do to
> advisorySupport.
> >>>>> > >
> >>>>> > > if I set both to false the total runtime of my tests drops in
> half…
> >>>>> and
> >>>>> > > the latencies fall from
> >>>>> > >
> >>>>> > > max create producer latency: 10,176 ms
> >>>>> > >> max create message on existing producer and consumer: 2 ms
> >>>>> > >
> >>>>> > >
> >>>>> > > … to
> >>>>> > >
> >>>>> > > max create producer latency: 1 ms
> >>>>> > >> max create message on existing producer and consumer: 1 ms
> >>>>> > >
> >>>>> > >
> >>>>> > > and this isn’t without fixing the purge background lock.
> >>>>> > >
> >>>>> > > So the question now is what the heck is the advisory support
> doing
> >>>>> that
> >>>>> > > can result in such massive performance overhead.
> >>>>> > >
> >>>>> > > … and I think advisorySupport is enabled by default so that’s
> >>>>> problematic
> >>>>> > > as well.
> >>>>> > >
> >>>>> > >
> >>>>> > > On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <
> [hidden email]>
> >>>>> > wrote:
> >>>>> > >
> >>>>> > >> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
> >>>>> > >>
> >>>>> > >> This is a non-persistent broker I’m testing on.
> >>>>> > >>
> >>>>> > >> Looks like it’s spending all it’s time in
> >>>>> CopyOnWriteArrayList.remove
> >>>>> > and
> >>>>> > >> Timer.purge…
> >>>>> > >>
> >>>>> > >> Which is hopeful because this is ALL due to ActiveMQ internals
> and
> >>>>> in
> >>>>> > >> theory LevelDB should perform well if we improve the
> performance of
> >>>>> > >> ActiveMQ internals and fix this lock bug.
> >>>>> > >>
> >>>>> > >> Which would rock!
> >>>>> > >>
> >>>>> > >> It should ALSO make queue creation faster.
> >>>>> > >>
> >>>>> > >>
> >>>>> > >> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <
> [hidden email]>
> >>>>> > wrote:
> >>>>> > >>
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <
> [hidden email]>
> >>>>> > wrote:
> >>>>> > >>>
> >>>>> > >>>> So if LevelDB cleanup during removeDestination() is the
> presumed
> >>>>> > >>>> culprit,
> >>>>> > >>>> can we spin off the LevelDB cleanup work into a separate
> thread
> >>>>> > >>>> (better: a
> >>>>> > >>>> task object to be serviced by a ThreadPool so you can avoid a
> >>>>> fork
> >>>>> > bomb
> >>>>> > >>>> if
> >>>>> > >>>> we remove many destinations at once) so the call to
> >>>>> > removeDestination()
> >>>>> > >>>> can
> >>>>> > >>>> return quickly and LevelDB can do its record-keeping in the
> >>>>> background
> >>>>> > >>>> without blocking message-processing?
> >>>>> > >>>>
> >>>>> > >>>
> >>>>> > >>> Would that be possible?  If the delete is pending on ActiveMQ
> >>>>> there is
> >>>>> > a
> >>>>> > >>> race where a producer could re-create it unless the lock is
> held.
> >>>>> > >>>
> >>>>> > >>> Though I guess if you dispatched to the GC thread WITH the lock
> >>>>> still
> >>>>> > >>> held you would be ok but I think if we use the existing purge
> >>>>> thread
> >>>>> > then
> >>>>> > >>> we’re fine.
> >>>>> > >>>
> >>>>> > >>> OK. I think I’m wrong about LevelDB being the issue.  To be
> fair I
> >>>>> > >>> wasn’t 100% certain before but I should have specified.
> >>>>> > >>>
> >>>>> > >>> Our current production broker is running with
> persistent=false..
> >>>>> .and I
> >>>>> > >>> just re-ran the tests without disk persistence and it has the
> same
> >>>>> > problem.
> >>>>> > >>>
> >>>>> > >>> So the main issue how is why the heck is ActiveMQ taking SO
> LONG
> >>>>> to GC
> >>>>> > a
> >>>>> > >>> queue.  It’s taking about 100ms which is an insane amount of
> time
> >>>>> > >>> considering this is done all in memory.
> >>>>> > >>>
> >>>>> > >>> Kevin
> >>>>> > >>>
> >>>>> > >>> --
> >>>>> > >>>
> >>>>> > >>> Founder/CEO Spinn3r.com
> >>>>> > >>> Location: *San Francisco, CA*
> >>>>> > >>> blog: http://burtonator.wordpress.com
> >>>>> > >>> … or check out my Google+ profile
> >>>>> > >>> <https://plus.google.com/102718274791889610666/posts>
> >>>>> > >>> <http://spinn3r.com>
> >>>>> > >>>
> >>>>> > >>>
> >>>>> > >>
> >>>>> > >>
> >>>>> > >> --
> >>>>> > >>
> >>>>> > >> Founder/CEO Spinn3r.com
> >>>>> > >> Location: *San Francisco, CA*
> >>>>> > >> blog: http://burtonator.wordpress.com
> >>>>> > >> … or check out my Google+ profile
> >>>>> > >> <https://plus.google.com/102718274791889610666/posts>
> >>>>> > >> <http://spinn3r.com>
> >>>>> > >>
> >>>>> > >>
> >>>>> > >
> >>>>> > >
> >>>>> > > --
> >>>>> > >
> >>>>> > > Founder/CEO Spinn3r.com
> >>>>> > > Location: *San Francisco, CA*
> >>>>> > > blog: http://burtonator.wordpress.com
> >>>>> > > … or check out my Google+ profile
> >>>>> > > <https://plus.google.com/102718274791889610666/posts>
> >>>>> > > <http://spinn3r.com>
> >>>>> > >
> >>>>> > >
> >>>>> >
> >>>>> >
> >>>>> > --
> >>>>> >
> >>>>> > Founder/CEO Spinn3r.com
> >>>>> > Location: *San Francisco, CA*
> >>>>> > blog: http://burtonator.wordpress.com
> >>>>> > … or check out my Google+ profile
> >>>>> > <https://plus.google.com/102718274791889610666/posts>
> >>>>> > <http://spinn3r.com>
> >>>>> >
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>>
> >>>> Founder/CEO Spinn3r.com
> >>>> Location: *San Francisco, CA*
> >>>> blog: http://burtonator.wordpress.com
> >>>> … or check out my Google+ profile
> >>>> <https://plus.google.com/102718274791889610666/posts>
> >>>> <http://spinn3r.com>
> >>>>
> >>>>
> >>>
> >>>
> >>> --
> >>>
> >>> Founder/CEO Spinn3r.com
> >>> Location: *San Francisco, CA*
> >>> blog: http://burtonator.wordpress.com
> >>> … or check out my Google+ profile
> >>> <https://plus.google.com/102718274791889610666/posts>
> >>> <http://spinn3r.com>
> >>>
> >>>
> >>
> >>
> >> --
> >>
> >> Founder/CEO Spinn3r.com
> >> Location: *San Francisco, CA*
> >> blog: http://burtonator.wordpress.com
> >> … or check out my Google+ profile
> >> <https://plus.google.com/102718274791889610666/posts>
> >> <http://spinn3r.com>
> >>
> >>
> >
> >
> > --
> >
> > Founder/CEO Spinn3r.com
> > Location: *San Francisco, CA*
> > blog: http://burtonator.wordpress.com
> > … or check out my Google+ profile
> > <https://plus.google.com/102718274791889610666/posts>
> > <http://spinn3r.com>
>



--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

tabish121@gmail.com
On 02/24/2015 01:01 PM, Kevin Burton wrote:
> How long do the tests usually take?

An eternity + 1hr

> I’m looking at 45 minutes right now
> before I gave up… I think part of it was that the box I was testing on was
> virtualized and didn’t have enough resources.
>
> I tried to parallelize the tests (-T 8 with maven) but I got other errors
> so I assume the ports are singletons.
They won't be happy if you run them parallel.

>
> On Tue, Feb 24, 2015 at 8:03 AM, Gary Tully <[hidden email]> wrote:
>
>> if there are any test failures - try to run them individually
>> -Dtest=a,b etc. There may be an issue with a full test run, but all of
>> the tests that are enabled should work. I know there are some issues
>> with jdbc tests that hang or fail due to previous runs no cleaning up,
>> but that should be the most of it. I got a bunch of full test runs
>> before the 5.11 release if that is any help.
>>
>> On 23 February 2015 at 20:38, Kevin Burton <[hidden email]> wrote:
>>> OK.  This is ready to go and I have a patch branch:
>>>
>>> https://issues.apache.org/jira/browse/AMQ-5609
>>>
>>> I’m stuck at the moment though because tests don’t pass.  But it was
>>> failing tests before so I don’t think it has anything to do with my
>>> changes.
>>>
>>>
>>>
>>> On Sun, Feb 22, 2015 at 11:11 PM, Kevin Burton <[hidden email]>
>> wrote:
>>>> Actually, is the lock even needed here?  Why would it be?  if we’re
>>>> *removing* a subscription, why does it care if we possibly ALSO remove a
>>>> separate / isolated queue before/after the subscription is removed.
>>>>
>>>> I think this is redundant and can be removed.  Maybe I’m wrong though.
>>>>
>>>> I looked at all the callers and none were associated with queues.
>>>>
>>>> On Sun, Feb 22, 2015 at 11:07 PM, Kevin Burton <[hidden email]>
>> wrote:
>>>>> So I have some working/theoretical code that should resolve this.
>>>>>
>>>>> It acquires a lock *per* ActiveMQDestination, this way there is no lock
>>>>> contention.
>>>>>
>>>>> But here’s where I’m stuck.
>>>>>
>>>>>      @Override
>>>>>>      public void removeSubscription(ConnectionContext context,
>>>>>> RemoveSubscriptionInfo info) throws Exception {
>>>>>>          inactiveDestinationsPurgeLock.readLock().lock();
>>>>>>          try {
>>>>>>              topicRegion.removeSubscription(context, info);
>>>>>>          } finally {
>>>>>>              inactiveDestinationsPurgeLock.readLock().unlock();
>>>>>>          }
>>>>>>      }
>>>>>
>>>>> .. this is in RegionBroker.
>>>>>
>>>>> There is no ActiveMQDestination involved here so I’m not sure the best
>>>>> way to resolve this.
>>>>>
>>>>> Any advice?
>>>>>
>>>>>
>>>>> On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <[hidden email]>
>> wrote:
>>>>>> Yes.  That was my thinking too.. that just replacing the
>> CopyOnWriteArraySet
>>>>>> with something more performance would solve the issue.
>>>>>>
>>>>>> This would also improve queue creation time as well as queue deletion
>>>>>> time.
>>>>>>
>>>>>> What I think I”m going to do in the mean time is:
>>>>>>
>>>>>> - implement a granular lock based on queue name… I am going to use an
>>>>>> interface so we can replace the implementation later.
>>>>>>
>>>>>> - implement timing for the purge thread so it tracks how long it takes
>>>>>> to remove a queue but also how long the entire loop takes.
>>>>>>
>>>>>> I’ll do this on a branch so it should be easy to merge.
>>>>>>
>>>>>> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <[hidden email]>
>> wrote:
>>>>>>> A decent amount of the time is being spent calling remove() on
>> various
>>>>>>> array-backed collections.  Those data structures might be
>> inappropriate
>>>>>>> for
>>>>>>> the number of destinations you're running, since array-backed
>>>>>>> collections
>>>>>>> tend to have add/remove operations that are O(N); some improvement
>> might
>>>>>>> come from something as simple as moving to a ConcurrentHashSet
>> instead
>>>>>>> of a
>>>>>>> CopyOnWriteArraySet, for example.  (Or it might make performance
>> worse
>>>>>>> because of other aspects of how those collections are used; people
>> other
>>>>>>> than me would be in a better position to evaluate the full range of
>>>>>>> performance requirements for those collections.)
>>>>>>>
>>>>>>> Scheduler.cancel() also takes an alarming amount of time for what
>> looks
>>>>>>> like a really simple method (
>>>>>>>
>>>>>>>
>> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.activemq/activemq-all/5.10.0/org/apache/activemq/thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
>>>>>>> ).
>>>>>>>
>>>>>>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <[hidden email]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Here’s a jprofiler view with the advisory support enabled if you’re
>>>>>>>> curious.
>>>>>>>>
>>>>>>>> http://i.imgur.com/I1jesZz.jpg
>>>>>>>>
>>>>>>>> I’m not familiar with the internals of ActiveMQ enough to have any
>>>>>>> obvious
>>>>>>>> optimization ideas.
>>>>>>>>
>>>>>>>> One other idea I had (which would require a ton of refactoring I
>>>>>>> think)
>>>>>>>> would be to potentially bulk delete all the queues at once.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <[hidden email]>
>>>>>>> wrote:
>>>>>>>>> And spending some more time in jprofiler, looks like 20% of this
>> is
>>>>>>> do to
>>>>>>>>> schedulerSupport and the other 80% of this is do to
>> advisorySupport.
>>>>>>>>> if I set both to false the total runtime of my tests drops in
>> half…
>>>>>>> and
>>>>>>>>> the latencies fall from
>>>>>>>>>
>>>>>>>>> max create producer latency: 10,176 ms
>>>>>>>>>> max create message on existing producer and consumer: 2 ms
>>>>>>>>>
>>>>>>>>> … to
>>>>>>>>>
>>>>>>>>> max create producer latency: 1 ms
>>>>>>>>>> max create message on existing producer and consumer: 1 ms
>>>>>>>>>
>>>>>>>>> and this isn’t without fixing the purge background lock.
>>>>>>>>>
>>>>>>>>> So the question now is what the heck is the advisory support
>> doing
>>>>>>> that
>>>>>>>>> can result in such massive performance overhead.
>>>>>>>>>
>>>>>>>>> … and I think advisorySupport is enabled by default so that’s
>>>>>>> problematic
>>>>>>>>> as well.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <
>> [hidden email]>
>>>>>>>> wrote:
>>>>>>>>>> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
>>>>>>>>>>
>>>>>>>>>> This is a non-persistent broker I’m testing on.
>>>>>>>>>>
>>>>>>>>>> Looks like it’s spending all it’s time in
>>>>>>> CopyOnWriteArrayList.remove
>>>>>>>> and
>>>>>>>>>> Timer.purge…
>>>>>>>>>>
>>>>>>>>>> Which is hopeful because this is ALL due to ActiveMQ internals
>> and
>>>>>>> in
>>>>>>>>>> theory LevelDB should perform well if we improve the
>> performance of
>>>>>>>>>> ActiveMQ internals and fix this lock bug.
>>>>>>>>>>
>>>>>>>>>> Which would rock!
>>>>>>>>>>
>>>>>>>>>> It should ALSO make queue creation faster.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <
>> [hidden email]>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <
>> [hidden email]>
>>>>>>>> wrote:
>>>>>>>>>>>> So if LevelDB cleanup during removeDestination() is the
>> presumed
>>>>>>>>>>>> culprit,
>>>>>>>>>>>> can we spin off the LevelDB cleanup work into a separate
>> thread
>>>>>>>>>>>> (better: a
>>>>>>>>>>>> task object to be serviced by a ThreadPool so you can avoid a
>>>>>>> fork
>>>>>>>> bomb
>>>>>>>>>>>> if
>>>>>>>>>>>> we remove many destinations at once) so the call to
>>>>>>>> removeDestination()
>>>>>>>>>>>> can
>>>>>>>>>>>> return quickly and LevelDB can do its record-keeping in the
>>>>>>> background
>>>>>>>>>>>> without blocking message-processing?
>>>>>>>>>>>>
>>>>>>>>>>> Would that be possible?  If the delete is pending on ActiveMQ
>>>>>>> there is
>>>>>>>> a
>>>>>>>>>>> race where a producer could re-create it unless the lock is
>> held.
>>>>>>>>>>> Though I guess if you dispatched to the GC thread WITH the lock
>>>>>>> still
>>>>>>>>>>> held you would be ok but I think if we use the existing purge
>>>>>>> thread
>>>>>>>> then
>>>>>>>>>>> we’re fine.
>>>>>>>>>>>
>>>>>>>>>>> OK. I think I’m wrong about LevelDB being the issue.  To be
>> fair I
>>>>>>>>>>> wasn’t 100% certain before but I should have specified.
>>>>>>>>>>>
>>>>>>>>>>> Our current production broker is running with
>> persistent=false..
>>>>>>> .and I
>>>>>>>>>>> just re-ran the tests without disk persistence and it has the
>> same
>>>>>>>> problem.
>>>>>>>>>>> So the main issue how is why the heck is ActiveMQ taking SO
>> LONG
>>>>>>> to GC
>>>>>>>> a
>>>>>>>>>>> queue.  It’s taking about 100ms which is an insane amount of
>> time
>>>>>>>>>>> considering this is done all in memory.
>>>>>>>>>>>
>>>>>>>>>>> Kevin
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>> … or check out my Google+ profile
>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>> Location: *San Francisco, CA*
>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>> … or check out my Google+ profile
>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>> <http://spinn3r.com>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Founder/CEO Spinn3r.com
>>>>>> Location: *San Francisco, CA*
>>>>>> blog: http://burtonator.wordpress.com
>>>>>> … or check out my Google+ profile
>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>> <http://spinn3r.com>
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Founder/CEO Spinn3r.com
>>>>> Location: *San Francisco, CA*
>>>>> blog: http://burtonator.wordpress.com
>>>>> … or check out my Google+ profile
>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>> <http://spinn3r.com>
>>>>>
>>>>>
>>>>
>>>> --
>>>>
>>>> Founder/CEO Spinn3r.com
>>>> Location: *San Francisco, CA*
>>>> blog: http://burtonator.wordpress.com
>>>> … or check out my Google+ profile
>>>> <https://plus.google.com/102718274791889610666/posts>
>>>> <http://spinn3r.com>
>>>>
>>>>
>>>
>>> --
>>>
>>> Founder/CEO Spinn3r.com
>>> Location: *San Francisco, CA*
>>> blog: http://burtonator.wordpress.com
>>> … or check out my Google+ profile
>>> <https://plus.google.com/102718274791889610666/posts>
>>> <http://spinn3r.com>
>
>


--
Tim Bish
Sr Software Engineer | RedHat Inc.
[hidden email] | www.redhat.com
skype: tabish121 | twitter: @tabish121
blog: http://timbish.blogspot.com/

Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

Kevin Burton
oh  geez… seriously though.  can you give me an estimate?

Assuming 30 seconds each I’m assuming about 11 hours.

http://activemq.apache.org/junit-reports.html
https://hudson.apache.org/hudson/job/ActiveMQ/

Hudson is down for ActiveMQ.

Also, is there a disk space requirement for the tests? I have 10GB free or
so and I’m getting TONS of these:


java.lang.RuntimeException: Failed to start provided job scheduler store:
JobSchedulerStore:activemq-data

at java.io.RandomAccessFile.seek(Native Method)

at
org.apache.activemq.util.RecoverableRandomAccessFile.seek(RecoverableRandomAccessFile.java:384)

at
org.apache.activemq.store.kahadb.disk.page.PageFile.readPage(PageFile.java:877)

at
org.apache.activemq.store.kahadb.disk.page.Transaction.load(Transaction.java:427)

at
org.apache.activemq.store.kahadb.disk.page.Transaction.load(Transaction.java:377)

at
org.apache.activemq.store.kahadb.disk.index.BTreeIndex.load(BTreeIndex.java:159)

at
org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl$MetaData.load(JobSchedulerStoreImpl.java:90)

at
org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl$3.execute(JobSchedulerStoreImpl.java:277)

at
org.apache.activemq.store.kahadb.disk.page.Transaction.execute(Transaction.java:779)

at
org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl.doStart(JobSchedulerStoreImpl.java:261)

at org.apache.activemq.util.ServiceSupport.start(ServiceSupport.java:55)

at
org.apache.activemq.broker.BrokerService.setJobSchedulerStore(BrokerService.java:1891)

at
org.apache.activemq.transport.stomp.StompTestSupport.createBroker(StompTestSupport.java:174)

at
org.apache.activemq.transport.stomp.StompTestSupport.startBroker(StompTestSupport.java:112)

at
org.apache.activemq.transport.stomp.StompTestSupport.setUp(StompTestSupport.java:94)

at
org.apache.activemq.transport.stomp.Stomp12Test.setUp(Stomp12Test.java:41)

at
org.apache.activemq.transport.stomp.Stomp12SslAuthTest.setUp(Stomp12SslAuthTest.java:38)

On Tue, Feb 24, 2015 at 10:09 AM, Timothy Bish <[hidden email]> wrote:

> On 02/24/2015 01:01 PM, Kevin Burton wrote:
>
>> How long do the tests usually take?
>>
>
> An eternity + 1hr
>
>  I’m looking at 45 minutes right now
>> before I gave up… I think part of it was that the box I was testing on was
>> virtualized and didn’t have enough resources.
>>
>> I tried to parallelize the tests (-T 8 with maven) but I got other errors
>> so I assume the ports are singletons.
>>
> They won't be happy if you run them parallel.
>
>
>
>> On Tue, Feb 24, 2015 at 8:03 AM, Gary Tully <[hidden email]> wrote:
>>
>>  if there are any test failures - try to run them individually
>>> -Dtest=a,b etc. There may be an issue with a full test run, but all of
>>> the tests that are enabled should work. I know there are some issues
>>> with jdbc tests that hang or fail due to previous runs no cleaning up,
>>> but that should be the most of it. I got a bunch of full test runs
>>> before the 5.11 release if that is any help.
>>>
>>> On 23 February 2015 at 20:38, Kevin Burton <[hidden email]> wrote:
>>>
>>>> OK.  This is ready to go and I have a patch branch:
>>>>
>>>> https://issues.apache.org/jira/browse/AMQ-5609
>>>>
>>>> I’m stuck at the moment though because tests don’t pass.  But it was
>>>> failing tests before so I don’t think it has anything to do with my
>>>> changes.
>>>>
>>>>
>>>>
>>>> On Sun, Feb 22, 2015 at 11:11 PM, Kevin Burton <[hidden email]>
>>>>
>>> wrote:
>>>
>>>> Actually, is the lock even needed here?  Why would it be?  if we’re
>>>>> *removing* a subscription, why does it care if we possibly ALSO remove
>>>>> a
>>>>> separate / isolated queue before/after the subscription is removed.
>>>>>
>>>>> I think this is redundant and can be removed.  Maybe I’m wrong though.
>>>>>
>>>>> I looked at all the callers and none were associated with queues.
>>>>>
>>>>> On Sun, Feb 22, 2015 at 11:07 PM, Kevin Burton <[hidden email]>
>>>>>
>>>> wrote:
>>>
>>>> So I have some working/theoretical code that should resolve this.
>>>>>>
>>>>>> It acquires a lock *per* ActiveMQDestination, this way there is no
>>>>>> lock
>>>>>> contention.
>>>>>>
>>>>>> But here’s where I’m stuck.
>>>>>>
>>>>>>      @Override
>>>>>>
>>>>>>>      public void removeSubscription(ConnectionContext context,
>>>>>>> RemoveSubscriptionInfo info) throws Exception {
>>>>>>>          inactiveDestinationsPurgeLock.readLock().lock();
>>>>>>>          try {
>>>>>>>              topicRegion.removeSubscription(context, info);
>>>>>>>          } finally {
>>>>>>>              inactiveDestinationsPurgeLock.readLock().unlock();
>>>>>>>          }
>>>>>>>      }
>>>>>>>
>>>>>>
>>>>>> .. this is in RegionBroker.
>>>>>>
>>>>>> There is no ActiveMQDestination involved here so I’m not sure the best
>>>>>> way to resolve this.
>>>>>>
>>>>>> Any advice?
>>>>>>
>>>>>>
>>>>>> On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <[hidden email]>
>>>>>>
>>>>> wrote:
>>>
>>>> Yes.  That was my thinking too.. that just replacing the
>>>>>>>
>>>>>> CopyOnWriteArraySet
>>>
>>>> with something more performance would solve the issue.
>>>>>>>
>>>>>>> This would also improve queue creation time as well as queue deletion
>>>>>>> time.
>>>>>>>
>>>>>>> What I think I”m going to do in the mean time is:
>>>>>>>
>>>>>>> - implement a granular lock based on queue name… I am going to use an
>>>>>>> interface so we can replace the implementation later.
>>>>>>>
>>>>>>> - implement timing for the purge thread so it tracks how long it
>>>>>>> takes
>>>>>>> to remove a queue but also how long the entire loop takes.
>>>>>>>
>>>>>>> I’ll do this on a branch so it should be easy to merge.
>>>>>>>
>>>>>>> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <[hidden email]>
>>>>>>>
>>>>>> wrote:
>>>
>>>> A decent amount of the time is being spent calling remove() on
>>>>>>>>
>>>>>>> various
>>>
>>>> array-backed collections.  Those data structures might be
>>>>>>>>
>>>>>>> inappropriate
>>>
>>>> for
>>>>>>>> the number of destinations you're running, since array-backed
>>>>>>>> collections
>>>>>>>> tend to have add/remove operations that are O(N); some improvement
>>>>>>>>
>>>>>>> might
>>>
>>>> come from something as simple as moving to a ConcurrentHashSet
>>>>>>>>
>>>>>>> instead
>>>
>>>> of a
>>>>>>>> CopyOnWriteArraySet, for example.  (Or it might make performance
>>>>>>>>
>>>>>>> worse
>>>
>>>> because of other aspects of how those collections are used; people
>>>>>>>>
>>>>>>> other
>>>
>>>> than me would be in a better position to evaluate the full range of
>>>>>>>> performance requirements for those collections.)
>>>>>>>>
>>>>>>>> Scheduler.cancel() also takes an alarming amount of time for what
>>>>>>>>
>>>>>>> looks
>>>
>>>> like a really simple method (
>>>>>>>>
>>>>>>>>
>>>>>>>>  http://grepcode.com/file/repo1.maven.org/maven2/org.
>>> apache.activemq/activemq-all/5.10.0/org/apache/activemq/
>>> thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
>>>
>>>> ).
>>>>>>>>
>>>>>>>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <[hidden email]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Here’s a jprofiler view with the advisory support enabled if you’re
>>>>>>>>> curious.
>>>>>>>>>
>>>>>>>>> http://i.imgur.com/I1jesZz.jpg
>>>>>>>>>
>>>>>>>>> I’m not familiar with the internals of ActiveMQ enough to have any
>>>>>>>>>
>>>>>>>> obvious
>>>>>>>>
>>>>>>>>> optimization ideas.
>>>>>>>>>
>>>>>>>>> One other idea I had (which would require a ton of refactoring I
>>>>>>>>>
>>>>>>>> think)
>>>>>>>>
>>>>>>>>> would be to potentially bulk delete all the queues at once.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <[hidden email]>
>>>>>>>>>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> And spending some more time in jprofiler, looks like 20% of this
>>>>>>>>>>
>>>>>>>>> is
>>>
>>>> do to
>>>>>>>>
>>>>>>>>> schedulerSupport and the other 80% of this is do to
>>>>>>>>>>
>>>>>>>>> advisorySupport.
>>>
>>>> if I set both to false the total runtime of my tests drops in
>>>>>>>>>>
>>>>>>>>> half…
>>>
>>>> and
>>>>>>>>
>>>>>>>>> the latencies fall from
>>>>>>>>>>
>>>>>>>>>> max create producer latency: 10,176 ms
>>>>>>>>>>
>>>>>>>>>>> max create message on existing producer and consumer: 2 ms
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> … to
>>>>>>>>>>
>>>>>>>>>> max create producer latency: 1 ms
>>>>>>>>>>
>>>>>>>>>>> max create message on existing producer and consumer: 1 ms
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> and this isn’t without fixing the purge background lock.
>>>>>>>>>>
>>>>>>>>>> So the question now is what the heck is the advisory support
>>>>>>>>>>
>>>>>>>>> doing
>>>
>>>> that
>>>>>>>>
>>>>>>>>> can result in such massive performance overhead.
>>>>>>>>>>
>>>>>>>>>> … and I think advisorySupport is enabled by default so that’s
>>>>>>>>>>
>>>>>>>>> problematic
>>>>>>>>
>>>>>>>>> as well.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <
>>>>>>>>>>
>>>>>>>>> [hidden email]>
>>>
>>>> wrote:
>>>>>>>>>
>>>>>>>>>> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
>>>>>>>>>>>
>>>>>>>>>>> This is a non-persistent broker I’m testing on.
>>>>>>>>>>>
>>>>>>>>>>> Looks like it’s spending all it’s time in
>>>>>>>>>>>
>>>>>>>>>> CopyOnWriteArrayList.remove
>>>>>>>>
>>>>>>>>> and
>>>>>>>>>
>>>>>>>>>> Timer.purge…
>>>>>>>>>>>
>>>>>>>>>>> Which is hopeful because this is ALL due to ActiveMQ internals
>>>>>>>>>>>
>>>>>>>>>> and
>>>
>>>> in
>>>>>>>>
>>>>>>>>> theory LevelDB should perform well if we improve the
>>>>>>>>>>>
>>>>>>>>>> performance of
>>>
>>>> ActiveMQ internals and fix this lock bug.
>>>>>>>>>>>
>>>>>>>>>>> Which would rock!
>>>>>>>>>>>
>>>>>>>>>>> It should ALSO make queue creation faster.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <
>>>>>>>>>>>
>>>>>>>>>> [hidden email]>
>>>
>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <
>>>>>>>>>>>>
>>>>>>>>>>> [hidden email]>
>>>
>>>> wrote:
>>>>>>>>>
>>>>>>>>>> So if LevelDB cleanup during removeDestination() is the
>>>>>>>>>>>>>
>>>>>>>>>>>> presumed
>>>
>>>> culprit,
>>>>>>>>>>>>> can we spin off the LevelDB cleanup work into a separate
>>>>>>>>>>>>>
>>>>>>>>>>>> thread
>>>
>>>> (better: a
>>>>>>>>>>>>> task object to be serviced by a ThreadPool so you can avoid a
>>>>>>>>>>>>>
>>>>>>>>>>>> fork
>>>>>>>>
>>>>>>>>> bomb
>>>>>>>>>
>>>>>>>>>> if
>>>>>>>>>>>>> we remove many destinations at once) so the call to
>>>>>>>>>>>>>
>>>>>>>>>>>> removeDestination()
>>>>>>>>>
>>>>>>>>>> can
>>>>>>>>>>>>> return quickly and LevelDB can do its record-keeping in the
>>>>>>>>>>>>>
>>>>>>>>>>>> background
>>>>>>>>
>>>>>>>>> without blocking message-processing?
>>>>>>>>>>>>>
>>>>>>>>>>>>>  Would that be possible?  If the delete is pending on ActiveMQ
>>>>>>>>>>>>
>>>>>>>>>>> there is
>>>>>>>>
>>>>>>>>> a
>>>>>>>>>
>>>>>>>>>> race where a producer could re-create it unless the lock is
>>>>>>>>>>>>
>>>>>>>>>>> held.
>>>
>>>> Though I guess if you dispatched to the GC thread WITH the lock
>>>>>>>>>>>>
>>>>>>>>>>> still
>>>>>>>>
>>>>>>>>> held you would be ok but I think if we use the existing purge
>>>>>>>>>>>>
>>>>>>>>>>> thread
>>>>>>>>
>>>>>>>>> then
>>>>>>>>>
>>>>>>>>>> we’re fine.
>>>>>>>>>>>>
>>>>>>>>>>>> OK. I think I’m wrong about LevelDB being the issue.  To be
>>>>>>>>>>>>
>>>>>>>>>>> fair I
>>>
>>>> wasn’t 100% certain before but I should have specified.
>>>>>>>>>>>>
>>>>>>>>>>>> Our current production broker is running with
>>>>>>>>>>>>
>>>>>>>>>>> persistent=false..
>>>
>>>> .and I
>>>>>>>>
>>>>>>>>> just re-ran the tests without disk persistence and it has the
>>>>>>>>>>>>
>>>>>>>>>>> same
>>>
>>>> problem.
>>>>>>>>>
>>>>>>>>>> So the main issue how is why the heck is ActiveMQ taking SO
>>>>>>>>>>>>
>>>>>>>>>>> LONG
>>>
>>>> to GC
>>>>>>>>
>>>>>>>>> a
>>>>>>>>>
>>>>>>>>>> queue.  It’s taking about 100ms which is an insane amount of
>>>>>>>>>>>>
>>>>>>>>>>> time
>>>
>>>> considering this is done all in memory.
>>>>>>>>>>>>
>>>>>>>>>>>> Kevin
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>> … or check out my Google+ profile
>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Founder/CEO Spinn3r.com
>>>>>>> Location: *San Francisco, CA*
>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>> … or check out my Google+ profile
>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>> <http://spinn3r.com>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>>
>>>>>> Founder/CEO Spinn3r.com
>>>>>> Location: *San Francisco, CA*
>>>>>> blog: http://burtonator.wordpress.com
>>>>>> … or check out my Google+ profile
>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>> <http://spinn3r.com>
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>>
>>>>> Founder/CEO Spinn3r.com
>>>>> Location: *San Francisco, CA*
>>>>> blog: http://burtonator.wordpress.com
>>>>> … or check out my Google+ profile
>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>> <http://spinn3r.com>
>>>>>
>>>>>
>>>>>
>>>> --
>>>>
>>>> Founder/CEO Spinn3r.com
>>>> Location: *San Francisco, CA*
>>>> blog: http://burtonator.wordpress.com
>>>> … or check out my Google+ profile
>>>> <https://plus.google.com/102718274791889610666/posts>
>>>> <http://spinn3r.com>
>>>>
>>>
>>
>>
>
> --
> Tim Bish
> Sr Software Engineer | RedHat Inc.
> [hidden email] | www.redhat.com
> skype: tabish121 | twitter: @tabish121
> blog: http://timbish.blogspot.com/
>
>


--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: A proposal to rewrite purgeInactiveDestinations locking to prevent queue GC lockups.

tabish121@gmail.com
On 02/24/2015 01:22 PM, Kevin Burton wrote:
> oh  geez… seriously though.  can you give me an estimate?
>
> Assuming 30 seconds each I’m assuming about 11 hours.

That's probably not to far off, usually just let them run over night for
the all profile.

>
> http://activemq.apache.org/junit-reports.html
> https://hudson.apache.org/hudson/job/ActiveMQ/
> Hudson is down for ActiveMQ.
>
> Also, is there a disk space requirement for the tests? I have 10GB free or
> so and I’m getting TONS of these:
I don't think anyone has looked into the min required free for the tests
to run.

>
>
> java.lang.RuntimeException: Failed to start provided job scheduler store:
> JobSchedulerStore:activemq-data
>
> at java.io.RandomAccessFile.seek(Native Method)
>
> at
> org.apache.activemq.util.RecoverableRandomAccessFile.seek(RecoverableRandomAccessFile.java:384)
>
> at
> org.apache.activemq.store.kahadb.disk.page.PageFile.readPage(PageFile.java:877)
>
> at
> org.apache.activemq.store.kahadb.disk.page.Transaction.load(Transaction.java:427)
>
> at
> org.apache.activemq.store.kahadb.disk.page.Transaction.load(Transaction.java:377)
>
> at
> org.apache.activemq.store.kahadb.disk.index.BTreeIndex.load(BTreeIndex.java:159)
>
> at
> org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl$MetaData.load(JobSchedulerStoreImpl.java:90)
>
> at
> org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl$3.execute(JobSchedulerStoreImpl.java:277)
>
> at
> org.apache.activemq.store.kahadb.disk.page.Transaction.execute(Transaction.java:779)
>
> at
> org.apache.activemq.store.kahadb.scheduler.JobSchedulerStoreImpl.doStart(JobSchedulerStoreImpl.java:261)
>
> at org.apache.activemq.util.ServiceSupport.start(ServiceSupport.java:55)
>
> at
> org.apache.activemq.broker.BrokerService.setJobSchedulerStore(BrokerService.java:1891)
>
> at
> org.apache.activemq.transport.stomp.StompTestSupport.createBroker(StompTestSupport.java:174)
>
> at
> org.apache.activemq.transport.stomp.StompTestSupport.startBroker(StompTestSupport.java:112)
>
> at
> org.apache.activemq.transport.stomp.StompTestSupport.setUp(StompTestSupport.java:94)
>
> at
> org.apache.activemq.transport.stomp.Stomp12Test.setUp(Stomp12Test.java:41)
>
> at
> org.apache.activemq.transport.stomp.Stomp12SslAuthTest.setUp(Stomp12SslAuthTest.java:38)
>
> On Tue, Feb 24, 2015 at 10:09 AM, Timothy Bish <[hidden email]> wrote:
>
>> On 02/24/2015 01:01 PM, Kevin Burton wrote:
>>
>>> How long do the tests usually take?
>>>
>> An eternity + 1hr
>>
>>   I’m looking at 45 minutes right now
>>> before I gave up… I think part of it was that the box I was testing on was
>>> virtualized and didn’t have enough resources.
>>>
>>> I tried to parallelize the tests (-T 8 with maven) but I got other errors
>>> so I assume the ports are singletons.
>>>
>> They won't be happy if you run them parallel.
>>
>>
>>
>>> On Tue, Feb 24, 2015 at 8:03 AM, Gary Tully <[hidden email]> wrote:
>>>
>>>   if there are any test failures - try to run them individually
>>>> -Dtest=a,b etc. There may be an issue with a full test run, but all of
>>>> the tests that are enabled should work. I know there are some issues
>>>> with jdbc tests that hang or fail due to previous runs no cleaning up,
>>>> but that should be the most of it. I got a bunch of full test runs
>>>> before the 5.11 release if that is any help.
>>>>
>>>> On 23 February 2015 at 20:38, Kevin Burton <[hidden email]> wrote:
>>>>
>>>>> OK.  This is ready to go and I have a patch branch:
>>>>>
>>>>> https://issues.apache.org/jira/browse/AMQ-5609
>>>>>
>>>>> I’m stuck at the moment though because tests don’t pass.  But it was
>>>>> failing tests before so I don’t think it has anything to do with my
>>>>> changes.
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Feb 22, 2015 at 11:11 PM, Kevin Burton <[hidden email]>
>>>>>
>>>> wrote:
>>>>
>>>>> Actually, is the lock even needed here?  Why would it be?  if we’re
>>>>>> *removing* a subscription, why does it care if we possibly ALSO remove
>>>>>> a
>>>>>> separate / isolated queue before/after the subscription is removed.
>>>>>>
>>>>>> I think this is redundant and can be removed.  Maybe I’m wrong though.
>>>>>>
>>>>>> I looked at all the callers and none were associated with queues.
>>>>>>
>>>>>> On Sun, Feb 22, 2015 at 11:07 PM, Kevin Burton <[hidden email]>
>>>>>>
>>>>> wrote:
>>>>> So I have some working/theoretical code that should resolve this.
>>>>>>> It acquires a lock *per* ActiveMQDestination, this way there is no
>>>>>>> lock
>>>>>>> contention.
>>>>>>>
>>>>>>> But here’s where I’m stuck.
>>>>>>>
>>>>>>>       @Override
>>>>>>>
>>>>>>>>       public void removeSubscription(ConnectionContext context,
>>>>>>>> RemoveSubscriptionInfo info) throws Exception {
>>>>>>>>           inactiveDestinationsPurgeLock.readLock().lock();
>>>>>>>>           try {
>>>>>>>>               topicRegion.removeSubscription(context, info);
>>>>>>>>           } finally {
>>>>>>>>               inactiveDestinationsPurgeLock.readLock().unlock();
>>>>>>>>           }
>>>>>>>>       }
>>>>>>>>
>>>>>>> .. this is in RegionBroker.
>>>>>>>
>>>>>>> There is no ActiveMQDestination involved here so I’m not sure the best
>>>>>>> way to resolve this.
>>>>>>>
>>>>>>> Any advice?
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Feb 22, 2015 at 8:11 PM, Kevin Burton <[hidden email]>
>>>>>>>
>>>>>> wrote:
>>>>> Yes.  That was my thinking too.. that just replacing the
>>>>>>> CopyOnWriteArraySet
>>>>> with something more performance would solve the issue.
>>>>>>>> This would also improve queue creation time as well as queue deletion
>>>>>>>> time.
>>>>>>>>
>>>>>>>> What I think I”m going to do in the mean time is:
>>>>>>>>
>>>>>>>> - implement a granular lock based on queue name… I am going to use an
>>>>>>>> interface so we can replace the implementation later.
>>>>>>>>
>>>>>>>> - implement timing for the purge thread so it tracks how long it
>>>>>>>> takes
>>>>>>>> to remove a queue but also how long the entire loop takes.
>>>>>>>>
>>>>>>>> I’ll do this on a branch so it should be easy to merge.
>>>>>>>>
>>>>>>>> On Sun, Feb 22, 2015 at 7:40 PM, Tim Bain <[hidden email]>
>>>>>>>>
>>>>>>> wrote:
>>>>> A decent amount of the time is being spent calling remove() on
>>>>>>>> various
>>>>> array-backed collections.  Those data structures might be
>>>>>>>> inappropriate
>>>>> for
>>>>>>>>> the number of destinations you're running, since array-backed
>>>>>>>>> collections
>>>>>>>>> tend to have add/remove operations that are O(N); some improvement
>>>>>>>>>
>>>>>>>> might
>>>>> come from something as simple as moving to a ConcurrentHashSet
>>>>>>>> instead
>>>>> of a
>>>>>>>>> CopyOnWriteArraySet, for example.  (Or it might make performance
>>>>>>>>>
>>>>>>>> worse
>>>>> because of other aspects of how those collections are used; people
>>>>>>>> other
>>>>> than me would be in a better position to evaluate the full range of
>>>>>>>>> performance requirements for those collections.)
>>>>>>>>>
>>>>>>>>> Scheduler.cancel() also takes an alarming amount of time for what
>>>>>>>>>
>>>>>>>> looks
>>>>> like a really simple method (
>>>>>>>>>
>>>>>>>>>   http://grepcode.com/file/repo1.maven.org/maven2/org.
>>>> apache.activemq/activemq-all/5.10.0/org/apache/activemq/
>>>> thread/Scheduler.java#Scheduler.cancel%28java.lang.Runnable%29
>>>>
>>>>> ).
>>>>>>>>> On Sun, Feb 22, 2015 at 7:56 PM, Kevin Burton <[hidden email]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>   Here’s a jprofiler view with the advisory support enabled if you’re
>>>>>>>>>> curious.
>>>>>>>>>>
>>>>>>>>>> http://i.imgur.com/I1jesZz.jpg
>>>>>>>>>>
>>>>>>>>>> I’m not familiar with the internals of ActiveMQ enough to have any
>>>>>>>>>>
>>>>>>>>> obvious
>>>>>>>>>
>>>>>>>>>> optimization ideas.
>>>>>>>>>>
>>>>>>>>>> One other idea I had (which would require a ton of refactoring I
>>>>>>>>>>
>>>>>>>>> think)
>>>>>>>>>
>>>>>>>>>> would be to potentially bulk delete all the queues at once.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sun, Feb 22, 2015 at 6:42 PM, Kevin Burton <[hidden email]>
>>>>>>>>>>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> And spending some more time in jprofiler, looks like 20% of this
>>>>>>>>>> is
>>>>> do to
>>>>>>>>>> schedulerSupport and the other 80% of this is do to
>>>>>>>>>> advisorySupport.
>>>>> if I set both to false the total runtime of my tests drops in
>>>>>>>>>> half…
>>>>> and
>>>>>>>>>> the latencies fall from
>>>>>>>>>>> max create producer latency: 10,176 ms
>>>>>>>>>>>
>>>>>>>>>>>> max create message on existing producer and consumer: 2 ms
>>>>>>>>>>>>
>>>>>>>>>>> … to
>>>>>>>>>>>
>>>>>>>>>>> max create producer latency: 1 ms
>>>>>>>>>>>
>>>>>>>>>>>> max create message on existing producer and consumer: 1 ms
>>>>>>>>>>>>
>>>>>>>>>>> and this isn’t without fixing the purge background lock.
>>>>>>>>>>>
>>>>>>>>>>> So the question now is what the heck is the advisory support
>>>>>>>>>>>
>>>>>>>>>> doing
>>>>> that
>>>>>>>>>> can result in such massive performance overhead.
>>>>>>>>>>> … and I think advisorySupport is enabled by default so that’s
>>>>>>>>>>>
>>>>>>>>>> problematic
>>>>>>>>>> as well.
>>>>>>>>>>>
>>>>>>>>>>> On Sun, Feb 22, 2015 at 4:45 PM, Kevin Burton <
>>>>>>>>>>>
>>>>>>>>>> [hidden email]>
>>>>> wrote:
>>>>>>>>>>> OK.  Loaded up JProfiler and confirmed that it’s not LevelDB.
>>>>>>>>>>>> This is a non-persistent broker I’m testing on.
>>>>>>>>>>>>
>>>>>>>>>>>> Looks like it’s spending all it’s time in
>>>>>>>>>>>>
>>>>>>>>>>> CopyOnWriteArrayList.remove
>>>>>>>>>> and
>>>>>>>>>>
>>>>>>>>>>> Timer.purge…
>>>>>>>>>>>> Which is hopeful because this is ALL due to ActiveMQ internals
>>>>>>>>>>>>
>>>>>>>>>>> and
>>>>> in
>>>>>>>>>> theory LevelDB should perform well if we improve the
>>>>>>>>>>> performance of
>>>>> ActiveMQ internals and fix this lock bug.
>>>>>>>>>>>> Which would rock!
>>>>>>>>>>>>
>>>>>>>>>>>> It should ALSO make queue creation faster.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sun, Feb 22, 2015 at 4:10 PM, Kevin Burton <
>>>>>>>>>>>>
>>>>>>>>>>> [hidden email]>
>>>>> wrote:
>>>>>>>>>>>>> On Sun, Feb 22, 2015 at 3:30 PM, Tim Bain <
>>>>>>>>>>>>>
>>>>>>>>>>>> [hidden email]>
>>>>> wrote:
>>>>>>>>>>> So if LevelDB cleanup during removeDestination() is the
>>>>>>>>>>>>> presumed
>>>>> culprit,
>>>>>>>>>>>>>> can we spin off the LevelDB cleanup work into a separate
>>>>>>>>>>>>>>
>>>>>>>>>>>>> thread
>>>>> (better: a
>>>>>>>>>>>>>> task object to be serviced by a ThreadPool so you can avoid a
>>>>>>>>>>>>>>
>>>>>>>>>>>>> fork
>>>>>>>>>> bomb
>>>>>>>>>>
>>>>>>>>>>> if
>>>>>>>>>>>>>> we remove many destinations at once) so the call to
>>>>>>>>>>>>>>
>>>>>>>>>>>>> removeDestination()
>>>>>>>>>>> can
>>>>>>>>>>>>>> return quickly and LevelDB can do its record-keeping in the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> background
>>>>>>>>>> without blocking message-processing?
>>>>>>>>>>>>>>   Would that be possible?  If the delete is pending on ActiveMQ
>>>>>>>>>>>> there is
>>>>>>>>>> a
>>>>>>>>>>
>>>>>>>>>>> race where a producer could re-create it unless the lock is
>>>>>>>>>>>> held.
>>>>> Though I guess if you dispatched to the GC thread WITH the lock
>>>>>>>>>>>> still
>>>>>>>>>> held you would be ok but I think if we use the existing purge
>>>>>>>>>>>> thread
>>>>>>>>>> then
>>>>>>>>>>
>>>>>>>>>>> we’re fine.
>>>>>>>>>>>>> OK. I think I’m wrong about LevelDB being the issue.  To be
>>>>>>>>>>>>>
>>>>>>>>>>>> fair I
>>>>> wasn’t 100% certain before but I should have specified.
>>>>>>>>>>>>> Our current production broker is running with
>>>>>>>>>>>>>
>>>>>>>>>>>> persistent=false..
>>>>> .and I
>>>>>>>>>> just re-ran the tests without disk persistence and it has the
>>>>>>>>>>>> same
>>>>> problem.
>>>>>>>>>>> So the main issue how is why the heck is ActiveMQ taking SO
>>>>>>>>>>>> LONG
>>>>> to GC
>>>>>>>>>> a
>>>>>>>>>>
>>>>>>>>>>> queue.  It’s taking about 100ms which is an insane amount of
>>>>>>>>>>>> time
>>>>> considering this is done all in memory.
>>>>>>>>>>>>> Kevin
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>>
>>>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>>
>>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>>>> Location: *San Francisco, CA*
>>>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>>>> … or check out my Google+ profile
>>>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>>>> <http://spinn3r.com>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Founder/CEO Spinn3r.com
>>>>>>>> Location: *San Francisco, CA*
>>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>>> … or check out my Google+ profile
>>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>>> <http://spinn3r.com>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>> Founder/CEO Spinn3r.com
>>>>>>> Location: *San Francisco, CA*
>>>>>>> blog: http://burtonator.wordpress.com
>>>>>>> … or check out my Google+ profile
>>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>> <http://spinn3r.com>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>>
>>>>>> Founder/CEO Spinn3r.com
>>>>>> Location: *San Francisco, CA*
>>>>>> blog: http://burtonator.wordpress.com
>>>>>> … or check out my Google+ profile
>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>> <http://spinn3r.com>
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>>
>>>>> Founder/CEO Spinn3r.com
>>>>> Location: *San Francisco, CA*
>>>>> blog: http://burtonator.wordpress.com
>>>>> … or check out my Google+ profile
>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>> <http://spinn3r.com>
>>>>>
>>>
>> --
>> Tim Bish
>> Sr Software Engineer | RedHat Inc.
>> [hidden email] | www.redhat.com
>> skype: tabish121 | twitter: @tabish121
>> blog: http://timbish.blogspot.com/
>>
>>
>


--
Tim Bish
Sr Software Engineer | RedHat Inc.
[hidden email] | www.redhat.com
skype: tabish121 | twitter: @tabish121
blog: http://timbish.blogspot.com/

12