Diagnosing long term ActiveMQ memory leaks?

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Diagnosing long term ActiveMQ memory leaks?

Kevin Burton
I’m trying to diagnose a long term memory leak with ActiveMQ.

Basically , my app runs fine for about a week or so, then goes to 100% CPU
doing continually full GCs back to back.

No work is done during that period.

I have a large number of sessions to the AMQ box, but things are fine on
startup.

It’s entirely possible that y app isn’t releasing resources, but I”m trying
to figure out the best way to track that down.

I’m using org.apache.activemq.UseDedicatedTaskRunner=false so that thread
pools are used.  Which apparently can cause a bit of wasted memory.

I have a heap snapshot.  I loaded that into the Eclipse Memory Analyzer and
didn’t see any obvious candidates but of course I’m not an expert on the
ActiveMQ code base.

Are there any solid JMX counters I can track during this process?  Number
of sessions? etc.

--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing long term ActiveMQ memory leaks?

Tim Bain
What JVM are you using, and what GC strategy with which options?  And for
that matter, what broker version?

With Hotspot 7u21 and G1GC while running a long-running performance stress
test I've observed that Old Gen use increases over time (despite the fact
that G1GC is supposed to collect Old Gen during its normal collection
operations), and GCs against Old Gen happen semi-continually after Old Gen
hits a certain memory threshold.  However, unlike what you're observing, 1)
the GCs I saw were Old Gen GCs but not full GCs (G1 allows GCing Old Gen
during incremental GCs), 2) the broker remains responsive with reasonable
pause times close to my target, and 3) once Old Gen hits the 90% threshold
that forces a full GC, that full GC is able to successfully collect nearly
all of the Old Gen memory.  My conclusion from that was that although
objects were being promoted to Old Gen (and I tried unsuccessfully to
prevent that from occurring, see
http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-td4686450.html),
nearly all of them were unreachable by the time a full GC actually occurred.

So if you're seeing continual full GCs (not just Old Gen GCs if you're
using G1) that don't actually free any Old Gen memory, then what you're
seeing different behavior than I saw, and it means that the objects in Old
Gen are still reachable.  One possible reason for that would be messages
still being held in destinations waiting to be consumed; look for queues
without consumers (especially DLQs), as well as durable subscribers that
are offline.  If you're certain that's not the case, maybe you can post
some of the results of analyzing the heap snapshot so that people who know
the codebase better could see if anything jumps out?

On Sat, Dec 20, 2014 at 1:51 PM, Kevin Burton <[hidden email]> wrote:

> I’m trying to diagnose a long term memory leak with ActiveMQ.
>
> Basically , my app runs fine for about a week or so, then goes to 100% CPU
> doing continually full GCs back to back.
>
> No work is done during that period.
>
> I have a large number of sessions to the AMQ box, but things are fine on
> startup.
>
> It’s entirely possible that y app isn’t releasing resources, but I”m trying
> to figure out the best way to track that down.
>
> I’m using org.apache.activemq.UseDedicatedTaskRunner=false so that thread
> pools are used.  Which apparently can cause a bit of wasted memory.
>
> I have a heap snapshot.  I loaded that into the Eclipse Memory Analyzer and
> didn’t see any obvious candidates but of course I’m not an expert on the
> ActiveMQ code base.
>
> Are there any solid JMX counters I can track during this process?  Number
> of sessions? etc.
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing long term ActiveMQ memory leaks?

Kevin Burton
Great feedback.  Thanks btw. I’m working on getting up better JMX monitors
so I can track memory here more aggressively.  Bumping up memory by 1.5G
temporarily fixed the problem. However, it seems correlated to the number
of connections.  So I suspect I’ll just hit this again in the next few
weeks.

By that time I plan to have better JMX monitors in place to resolve this.

On Sat, Dec 20, 2014 at 10:28 PM, Tim Bain <[hidden email]> wrote:

>
> What JVM are you using, and what GC strategy with which options?  And for
> that matter, what broker version?
>
> With Hotspot 7u21 and G1GC while running a long-running performance stress
> test I've observed that Old Gen use increases over time (despite the fact
> that G1GC is supposed to collect Old Gen during its normal collection
> operations), and GCs against Old Gen happen semi-continually after Old Gen
> hits a certain memory threshold.  However, unlike what you're observing, 1)
> the GCs I saw were Old Gen GCs but not full GCs (G1 allows GCing Old Gen
> during incremental GCs), 2) the broker remains responsive with reasonable
> pause times close to my target, and 3) once Old Gen hits the 90% threshold
> that forces a full GC, that full GC is able to successfully collect nearly
> all of the Old Gen memory.  My conclusion from that was that although
> objects were being promoted to Old Gen (and I tried unsuccessfully to
> prevent that from occurring, see
>
> http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-td4686450.html
> ),
> nearly all of them were unreachable by the time a full GC actually
> occurred.
>
> So if you're seeing continual full GCs (not just Old Gen GCs if you're
> using G1) that don't actually free any Old Gen memory, then what you're
> seeing different behavior than I saw, and it means that the objects in Old
> Gen are still reachable.  One possible reason for that would be messages
> still being held in destinations waiting to be consumed; look for queues
> without consumers (especially DLQs), as well as durable subscribers that
> are offline.  If you're certain that's not the case, maybe you can post
> some of the results of analyzing the heap snapshot so that people who know
> the codebase better could see if anything jumps out?
>
> On Sat, Dec 20, 2014 at 1:51 PM, Kevin Burton <[hidden email]> wrote:
>
> > I’m trying to diagnose a long term memory leak with ActiveMQ.
> >
> > Basically , my app runs fine for about a week or so, then goes to 100%
> CPU
> > doing continually full GCs back to back.
> >
> > No work is done during that period.
> >
> > I have a large number of sessions to the AMQ box, but things are fine on
> > startup.
> >
> > It’s entirely possible that y app isn’t releasing resources, but I”m
> trying
> > to figure out the best way to track that down.
> >
> > I’m using org.apache.activemq.UseDedicatedTaskRunner=false so that thread
> > pools are used.  Which apparently can cause a bit of wasted memory.
> >
> > I have a heap snapshot.  I loaded that into the Eclipse Memory Analyzer
> and
> > didn’t see any obvious candidates but of course I’m not an expert on the
> > ActiveMQ code base.
> >
> > Are there any solid JMX counters I can track during this process?  Number
> > of sessions? etc.
> >
> > --
> >
> > Founder/CEO Spinn3r.com
> > Location: *San Francisco, CA*
> > blog: http://burtonator.wordpress.com
> > … or check out my Google+ profile
> > <https://plus.google.com/102718274791889610666/posts>
> > <http://spinn3r.com>
> >
>


--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing long term ActiveMQ memory leaks?

Tim Bain
BTW, you never did answer the question about which GC strategy you're
using, and it occurred to me that if you're using CMS, lots of full GCs
that don't actually reclaim much memory after a long time being up is the
classic failure scenario for CMS.  It happens when Old Gen gets fragmented,
which in turn happens because CMS is a non-compacting GC strategy in Old
Gen.  If you're using CMS and seeing continual full GCs, you should look at
whether G1 GC would be better for your needs.

On Mon, Dec 22, 2014 at 11:03 PM, Kevin Burton <[hidden email]> wrote:

> Great feedback.  Thanks btw. I’m working on getting up better JMX monitors
> so I can track memory here more aggressively.  Bumping up memory by 1.5G
> temporarily fixed the problem. However, it seems correlated to the number
> of connections.  So I suspect I’ll just hit this again in the next few
> weeks.
>
> By that time I plan to have better JMX monitors in place to resolve this.
>
> On Sat, Dec 20, 2014 at 10:28 PM, Tim Bain <[hidden email]> wrote:
> >
> > What JVM are you using, and what GC strategy with which options?  And for
> > that matter, what broker version?
> >
> > With Hotspot 7u21 and G1GC while running a long-running performance
> stress
> > test I've observed that Old Gen use increases over time (despite the fact
> > that G1GC is supposed to collect Old Gen during its normal collection
> > operations), and GCs against Old Gen happen semi-continually after Old
> Gen
> > hits a certain memory threshold.  However, unlike what you're observing,
> 1)
> > the GCs I saw were Old Gen GCs but not full GCs (G1 allows GCing Old Gen
> > during incremental GCs), 2) the broker remains responsive with reasonable
> > pause times close to my target, and 3) once Old Gen hits the 90%
> threshold
> > that forces a full GC, that full GC is able to successfully collect
> nearly
> > all of the Old Gen memory.  My conclusion from that was that although
> > objects were being promoted to Old Gen (and I tried unsuccessfully to
> > prevent that from occurring, see
> >
> >
> http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-td4686450.html
> > ),
> > nearly all of them were unreachable by the time a full GC actually
> > occurred.
> >
> > So if you're seeing continual full GCs (not just Old Gen GCs if you're
> > using G1) that don't actually free any Old Gen memory, then what you're
> > seeing different behavior than I saw, and it means that the objects in
> Old
> > Gen are still reachable.  One possible reason for that would be messages
> > still being held in destinations waiting to be consumed; look for queues
> > without consumers (especially DLQs), as well as durable subscribers that
> > are offline.  If you're certain that's not the case, maybe you can post
> > some of the results of analyzing the heap snapshot so that people who
> know
> > the codebase better could see if anything jumps out?
> >
> > On Sat, Dec 20, 2014 at 1:51 PM, Kevin Burton <[hidden email]>
> wrote:
> >
> > > I’m trying to diagnose a long term memory leak with ActiveMQ.
> > >
> > > Basically , my app runs fine for about a week or so, then goes to 100%
> > CPU
> > > doing continually full GCs back to back.
> > >
> > > No work is done during that period.
> > >
> > > I have a large number of sessions to the AMQ box, but things are fine
> on
> > > startup.
> > >
> > > It’s entirely possible that y app isn’t releasing resources, but I”m
> > trying
> > > to figure out the best way to track that down.
> > >
> > > I’m using org.apache.activemq.UseDedicatedTaskRunner=false so that
> thread
> > > pools are used.  Which apparently can cause a bit of wasted memory.
> > >
> > > I have a heap snapshot.  I loaded that into the Eclipse Memory Analyzer
> > and
> > > didn’t see any obvious candidates but of course I’m not an expert on
> the
> > > ActiveMQ code base.
> > >
> > > Are there any solid JMX counters I can track during this process?
> Number
> > > of sessions? etc.
> > >
> > > --
> > >
> > > Founder/CEO Spinn3r.com
> > > Location: *San Francisco, CA*
> > > blog: http://burtonator.wordpress.com
> > > … or check out my Google+ profile
> > > <https://plus.google.com/102718274791889610666/posts>
> > > <http://spinn3r.com>
> > >
> >
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing long term ActiveMQ memory leaks?

Kevin Burton
The default GC which I think is still parallel.  But I should explicitly
set it.  I’m working on getting JMX monitors up so that I can track
ActiveMQ counters but also GC state.

I can’t get a reliable and short term failure to occur so right now I’ve
focused on mitigating the issue and easily rebuilding my queue when it
crashes.

On Tue, Dec 23, 2014 at 8:55 AM, Tim Bain <[hidden email]> wrote:

> BTW, you never did answer the question about which GC strategy you're
> using, and it occurred to me that if you're using CMS, lots of full GCs
> that don't actually reclaim much memory after a long time being up is the
> classic failure scenario for CMS.  It happens when Old Gen gets fragmented,
> which in turn happens because CMS is a non-compacting GC strategy in Old
> Gen.  If you're using CMS and seeing continual full GCs, you should look at
> whether G1 GC would be better for your needs.
>
> On Mon, Dec 22, 2014 at 11:03 PM, Kevin Burton <[hidden email]> wrote:
>
> > Great feedback.  Thanks btw. I’m working on getting up better JMX
> monitors
> > so I can track memory here more aggressively.  Bumping up memory by 1.5G
> > temporarily fixed the problem. However, it seems correlated to the number
> > of connections.  So I suspect I’ll just hit this again in the next few
> > weeks.
> >
> > By that time I plan to have better JMX monitors in place to resolve this.
> >
> > On Sat, Dec 20, 2014 at 10:28 PM, Tim Bain <[hidden email]>
> wrote:
> > >
> > > What JVM are you using, and what GC strategy with which options?  And
> for
> > > that matter, what broker version?
> > >
> > > With Hotspot 7u21 and G1GC while running a long-running performance
> > stress
> > > test I've observed that Old Gen use increases over time (despite the
> fact
> > > that G1GC is supposed to collect Old Gen during its normal collection
> > > operations), and GCs against Old Gen happen semi-continually after Old
> > Gen
> > > hits a certain memory threshold.  However, unlike what you're
> observing,
> > 1)
> > > the GCs I saw were Old Gen GCs but not full GCs (G1 allows GCing Old
> Gen
> > > during incremental GCs), 2) the broker remains responsive with
> reasonable
> > > pause times close to my target, and 3) once Old Gen hits the 90%
> > threshold
> > > that forces a full GC, that full GC is able to successfully collect
> > nearly
> > > all of the Old Gen memory.  My conclusion from that was that although
> > > objects were being promoted to Old Gen (and I tried unsuccessfully to
> > > prevent that from occurring, see
> > >
> > >
> >
> http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-td4686450.html
> > > ),
> > > nearly all of them were unreachable by the time a full GC actually
> > > occurred.
> > >
> > > So if you're seeing continual full GCs (not just Old Gen GCs if you're
> > > using G1) that don't actually free any Old Gen memory, then what you're
> > > seeing different behavior than I saw, and it means that the objects in
> > Old
> > > Gen are still reachable.  One possible reason for that would be
> messages
> > > still being held in destinations waiting to be consumed; look for
> queues
> > > without consumers (especially DLQs), as well as durable subscribers
> that
> > > are offline.  If you're certain that's not the case, maybe you can post
> > > some of the results of analyzing the heap snapshot so that people who
> > know
> > > the codebase better could see if anything jumps out?
> > >
> > > On Sat, Dec 20, 2014 at 1:51 PM, Kevin Burton <[hidden email]>
> > wrote:
> > >
> > > > I’m trying to diagnose a long term memory leak with ActiveMQ.
> > > >
> > > > Basically , my app runs fine for about a week or so, then goes to
> 100%
> > > CPU
> > > > doing continually full GCs back to back.
> > > >
> > > > No work is done during that period.
> > > >
> > > > I have a large number of sessions to the AMQ box, but things are fine
> > on
> > > > startup.
> > > >
> > > > It’s entirely possible that y app isn’t releasing resources, but I”m
> > > trying
> > > > to figure out the best way to track that down.
> > > >
> > > > I’m using org.apache.activemq.UseDedicatedTaskRunner=false so that
> > thread
> > > > pools are used.  Which apparently can cause a bit of wasted memory.
> > > >
> > > > I have a heap snapshot.  I loaded that into the Eclipse Memory
> Analyzer
> > > and
> > > > didn’t see any obvious candidates but of course I’m not an expert on
> > the
> > > > ActiveMQ code base.
> > > >
> > > > Are there any solid JMX counters I can track during this process?
> > Number
> > > > of sessions? etc.
> > > >
> > > > --
> > > >
> > > > Founder/CEO Spinn3r.com
> > > > Location: *San Francisco, CA*
> > > > blog: http://burtonator.wordpress.com
> > > > … or check out my Google+ profile
> > > > <https://plus.google.com/102718274791889610666/posts>
> > > > <http://spinn3r.com>
> > > >
> > >
> >
> >
> > --
> >
> > Founder/CEO Spinn3r.com
> > Location: *San Francisco, CA*
> > blog: http://burtonator.wordpress.com
> > … or check out my Google+ profile
> > <https://plus.google.com/102718274791889610666/posts>
> > <http://spinn3r.com>
> >
>



--

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>
Reply | Threaded
Open this post in threaded view
|

Re: Diagnosing long term ActiveMQ memory leaks?

Tim Bain
Yeah, parallel is still the default one even in Java 8, as far as I know.
So the CMS concerns sound like a non-issue.

On Tue, Dec 23, 2014 at 4:15 PM, Kevin Burton <[hidden email]> wrote:

> The default GC which I think is still parallel.  But I should explicitly
> set it.  I’m working on getting JMX monitors up so that I can track
> ActiveMQ counters but also GC state.
>
> I can’t get a reliable and short term failure to occur so right now I’ve
> focused on mitigating the issue and easily rebuilding my queue when it
> crashes.
>
> On Tue, Dec 23, 2014 at 8:55 AM, Tim Bain <[hidden email]> wrote:
>
> > BTW, you never did answer the question about which GC strategy you're
> > using, and it occurred to me that if you're using CMS, lots of full GCs
> > that don't actually reclaim much memory after a long time being up is the
> > classic failure scenario for CMS.  It happens when Old Gen gets
> fragmented,
> > which in turn happens because CMS is a non-compacting GC strategy in Old
> > Gen.  If you're using CMS and seeing continual full GCs, you should look
> at
> > whether G1 GC would be better for your needs.
> >
> > On Mon, Dec 22, 2014 at 11:03 PM, Kevin Burton <[hidden email]>
> wrote:
> >
> > > Great feedback.  Thanks btw. I’m working on getting up better JMX
> > monitors
> > > so I can track memory here more aggressively.  Bumping up memory by
> 1.5G
> > > temporarily fixed the problem. However, it seems correlated to the
> number
> > > of connections.  So I suspect I’ll just hit this again in the next few
> > > weeks.
> > >
> > > By that time I plan to have better JMX monitors in place to resolve
> this.
> > >
> > > On Sat, Dec 20, 2014 at 10:28 PM, Tim Bain <[hidden email]>
> > wrote:
> > > >
> > > > What JVM are you using, and what GC strategy with which options?  And
> > for
> > > > that matter, what broker version?
> > > >
> > > > With Hotspot 7u21 and G1GC while running a long-running performance
> > > stress
> > > > test I've observed that Old Gen use increases over time (despite the
> > fact
> > > > that G1GC is supposed to collect Old Gen during its normal collection
> > > > operations), and GCs against Old Gen happen semi-continually after
> Old
> > > Gen
> > > > hits a certain memory threshold.  However, unlike what you're
> > observing,
> > > 1)
> > > > the GCs I saw were Old Gen GCs but not full GCs (G1 allows GCing Old
> > Gen
> > > > during incremental GCs), 2) the broker remains responsive with
> > reasonable
> > > > pause times close to my target, and 3) once Old Gen hits the 90%
> > > threshold
> > > > that forces a full GC, that full GC is able to successfully collect
> > > nearly
> > > > all of the Old Gen memory.  My conclusion from that was that although
> > > > objects were being promoted to Old Gen (and I tried unsuccessfully to
> > > > prevent that from occurring, see
> > > >
> > > >
> > >
> >
> http://activemq.2283324.n4.nabble.com/Potential-Bug-in-Master-Slave-with-Replicated-LevelDB-Store-td4686450.html
> > > > ),
> > > > nearly all of them were unreachable by the time a full GC actually
> > > > occurred.
> > > >
> > > > So if you're seeing continual full GCs (not just Old Gen GCs if
> you're
> > > > using G1) that don't actually free any Old Gen memory, then what
> you're
> > > > seeing different behavior than I saw, and it means that the objects
> in
> > > Old
> > > > Gen are still reachable.  One possible reason for that would be
> > messages
> > > > still being held in destinations waiting to be consumed; look for
> > queues
> > > > without consumers (especially DLQs), as well as durable subscribers
> > that
> > > > are offline.  If you're certain that's not the case, maybe you can
> post
> > > > some of the results of analyzing the heap snapshot so that people who
> > > know
> > > > the codebase better could see if anything jumps out?
> > > >
> > > > On Sat, Dec 20, 2014 at 1:51 PM, Kevin Burton <[hidden email]>
> > > wrote:
> > > >
> > > > > I’m trying to diagnose a long term memory leak with ActiveMQ.
> > > > >
> > > > > Basically , my app runs fine for about a week or so, then goes to
> > 100%
> > > > CPU
> > > > > doing continually full GCs back to back.
> > > > >
> > > > > No work is done during that period.
> > > > >
> > > > > I have a large number of sessions to the AMQ box, but things are
> fine
> > > on
> > > > > startup.
> > > > >
> > > > > It’s entirely possible that y app isn’t releasing resources, but
> I”m
> > > > trying
> > > > > to figure out the best way to track that down.
> > > > >
> > > > > I’m using org.apache.activemq.UseDedicatedTaskRunner=false so that
> > > thread
> > > > > pools are used.  Which apparently can cause a bit of wasted memory.
> > > > >
> > > > > I have a heap snapshot.  I loaded that into the Eclipse Memory
> > Analyzer
> > > > and
> > > > > didn’t see any obvious candidates but of course I’m not an expert
> on
> > > the
> > > > > ActiveMQ code base.
> > > > >
> > > > > Are there any solid JMX counters I can track during this process?
> > > Number
> > > > > of sessions? etc.
> > > > >
> > > > > --
> > > > >
> > > > > Founder/CEO Spinn3r.com
> > > > > Location: *San Francisco, CA*
> > > > > blog: http://burtonator.wordpress.com
> > > > > … or check out my Google+ profile
> > > > > <https://plus.google.com/102718274791889610666/posts>
> > > > > <http://spinn3r.com>
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Founder/CEO Spinn3r.com
> > > Location: *San Francisco, CA*
> > > blog: http://burtonator.wordpress.com
> > > … or check out my Google+ profile
> > > <https://plus.google.com/102718274791889610666/posts>
> > > <http://spinn3r.com>
> > >
> >
>
>
>
> --
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
>