** JMS Client HANGING - AMQ 5.9, AIX 6.1

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

** JMS Client HANGING - AMQ 5.9, AIX 6.1

cdelgado
This post was updated on .
Hi all,

We're facing an issue that is stopping us for going to production, this is a huge blocker for us.

The problem is that one of our consumers is hanging (randomly, aparently) and stops consuming messages. From a JMX we can see that is consuming heap memory and throwing full GCs quite often.

I'm attaching a javacore dump generated sending a kill -3 to the process. There you can see all the details and thread statuses.

javacore.txt

Basically, we have 90.7% of the threads waiting on condition, 3.5% Parked and 5.7% Running.

The Parked threads have different stacktraces, but generally they end in the same block:

at sun/misc/Unsafe.park(Native Method)
at java/util/concurrent/locks/LockSupport.parkNanos(LockSupport.java:222(Compiled Code))
at java/util/concurrent/SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:435(Compiled Code))

at java/util/concurrent/SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:334(Compiled Code))
at java/util/concurrent/SynchronousQueue.poll(SynchronousQueue.java:885(Compiled Code))
at java/util/concurrent/ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:966(Compiled Code))
at java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:928)
at java/lang/Thread.run(Thread.java:761)

Any *quick* help would be much appreciated, I'm a bit ost here.. :S

Carlos
Reply | Threaded
Open this post in threaded view
|

Re: ** JMS Client HANGING - AMQ 5.9, AIX 6.1

Tim Bain
The stack trace you quoted is irrelevant; it's just executors waiting to be
given work to do.  There are also lots of threads trying to read messages
from sockets in
org/apache/activemq/transport/tcp/TcpBufferedInputStream.fill() or waiting
for a message to be available during a call to
org/apache/activemq/SimplePriorityMessageDispatchChannel.dequeue(); both of
those are also irrelevant, because they're just ActiveMQ waiting to be
given work.

There are two threads waiting for responses to synchronous sends in
org/apache/activemq/ActiveMQConnection.syncSendPacket().  Those might
simply be victims of the inability to read messages, or they might be
relevant to what's going on; it's hard to tell from what you've sent.  One
thing I'd check based on them (and one thing I'd always check in general,
so hopefully you've already done this) is whether there are any errors in
the ActiveMQ broker logs, and specifically whether there are any messages
about producer flow control kicking in.  Depending on how PFC is
configured, I believe I've seen at least one JIRA or wiki page describing
the potential for PFC to cause deadlock when synchronous sends are used by
preventing the acks from being read.  If you see PFC-related lines in the
broker logs, we'll go from there; if not, then don't worry about this.

My overall thought, however, is that ActiveMQ (and the Spring JMS library
you're using) on its own isn't likely to run your client out of memory
unless your messages are VERY large, because there are limits on how many
messages will be transferred to your client at any one time.  Plus this
code has been run by LOTS of people over the years; if it caused OOMs on
its own, the cause would almost certainly have already been found.  So it's
most likely that this behavior is caused by something your own code is
doing, and the most likely guess is that you may be wrongly holding a
reference to objects that could otherwise be GCed, increasing heap memory
over time till you eventually run out.  You'll probably want to use tools
such as JVisualVM to analyze your memory usage and figure out what objects
are the ones causing it to grow and what's holding a reference to them.

One other possibility is that your algorithm is correct, but processing
each message is memory-intensive (using over half the heap in total across
however many messages you're processing in parallel) and so lots of objects
are getting forced into Old Gen even though they're actually short-lived
objects, and they are only getting removed from Old Gen via full GCs.  I
think this is far less likely than the other things I've described, but if
it's the problem, you could 1) increase the JVM's heap size if possible, 2)
tweak the percentages allocated to Old Gen and Young Gen to give more to
Young Gen in the hopes that more things will stay in Young Gen for longer,
or 3) look into other GC strategies (I'd recommend G1GC, but you appear to
be on the IBM JVM and I've never used it or researched it so I don't know
what GC strategies it offers).  But I think you'd really want to prove to
yourself that this is your problem (i.e. that none of the other things I've
mentioned are) before you go down this path, because throwing more memory
at a memory leak doesn't fix it, it just delays it and makes it harder to
troubleshoot.

Tim

On Fri, Jun 26, 2015 at 1:53 AM, cdelgado <[hidden email]>
wrote:

> Hi all,
>
> We're facing an issues that is stopping us for going to production, this is
> a huge blocker for us.
>
> The problem is that one of our consumers is hanging (randomly, aparently)
> and stops consuming messages. From a JMX we can see that is consuming
> memory
> and performing quite a lot full GCs.
>
> I'm attaching a javacore dump generated sending a kill -3 to the process.
> There you can see all the details and thread statuses.
>
> javacore.txt
> <http://activemq.2283324.n4.nabble.com/file/n4698204/javacore.txt>
>
> Basically, we have 90.7% of the threads waiting on condition, 3.5% Parked
> and 5.7% Running.
>
> The Parked threads have different stacktraces, but generally they end in
> the
> same block:
>
> *at sun/misc/Unsafe.park(Native Method)
> at
>
> java/util/concurrent/locks/LockSupport.parkNanos(LockSupport.java:222(Compiled
> Code))
> at
>
> java/util/concurrent/SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:435(Compiled
> Code)) *
> at
>
> java/util/concurrent/SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:334(Compiled
> Code))
> at
>
> java/util/concurrent/SynchronousQueue.poll(SynchronousQueue.java:885(Compiled
> Code))
> at
>
> java/util/concurrent/ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:966(Compiled
> Code))
> at
>
> java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:928)
> at java/lang/Thread.run(Thread.java:761)
>
> Any *quick* help would be much appreciated, I'm a bit ost here.. :S
>
> Carlos
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/JMS-Client-HANGING-AMQ-5-9-AIX-6-1-tp4698204.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: ** JMS Client HANGING - AMQ 5.9, AIX 6.1

paulgale
Carlos,

You appear to running ActiveMQ 5.9.0 on Java 6 both of which are quite old.
Trying upgrading to the latest ActiveMQ (5.11.1) with Java 7 or 8 and, as
Tim pointed out, enable the G1GC garbage collector. Once you've done that
remove the -Xms and -Xmn flags. See if that helps.

On the off chance the broker is being asked to handle messages that are
larger than you're expecting add the option
wireFormat.maxFrameSize=<some_byte_value> to the TCP/NIO transport
connector definition in activemq.xml. This will cause the sending of
messages larger than the configured threshold to fail.


Thanks,
Paul

On Fri, Jun 26, 2015 at 10:01 AM, Tim Bain <[hidden email]> wrote:

> The stack trace you quoted is irrelevant; it's just executors waiting to be
> given work to do.  There are also lots of threads trying to read messages
> from sockets in
> org/apache/activemq/transport/tcp/TcpBufferedInputStream.fill() or waiting
> for a message to be available during a call to
> org/apache/activemq/SimplePriorityMessageDispatchChannel.dequeue(); both of
> those are also irrelevant, because they're just ActiveMQ waiting to be
> given work.
>
> There are two threads waiting for responses to synchronous sends in
> org/apache/activemq/ActiveMQConnection.syncSendPacket().  Those might
> simply be victims of the inability to read messages, or they might be
> relevant to what's going on; it's hard to tell from what you've sent.  One
> thing I'd check based on them (and one thing I'd always check in general,
> so hopefully you've already done this) is whether there are any errors in
> the ActiveMQ broker logs, and specifically whether there are any messages
> about producer flow control kicking in.  Depending on how PFC is
> configured, I believe I've seen at least one JIRA or wiki page describing
> the potential for PFC to cause deadlock when synchronous sends are used by
> preventing the acks from being read.  If you see PFC-related lines in the
> broker logs, we'll go from there; if not, then don't worry about this.
>
> My overall thought, however, is that ActiveMQ (and the Spring JMS library
> you're using) on its own isn't likely to run your client out of memory
> unless your messages are VERY large, because there are limits on how many
> messages will be transferred to your client at any one time.  Plus this
> code has been run by LOTS of people over the years; if it caused OOMs on
> its own, the cause would almost certainly have already been found.  So it's
> most likely that this behavior is caused by something your own code is
> doing, and the most likely guess is that you may be wrongly holding a
> reference to objects that could otherwise be GCed, increasing heap memory
> over time till you eventually run out.  You'll probably want to use tools
> such as JVisualVM to analyze your memory usage and figure out what objects
> are the ones causing it to grow and what's holding a reference to them.
>
> One other possibility is that your algorithm is correct, but processing
> each message is memory-intensive (using over half the heap in total across
> however many messages you're processing in parallel) and so lots of objects
> are getting forced into Old Gen even though they're actually short-lived
> objects, and they are only getting removed from Old Gen via full GCs.  I
> think this is far less likely than the other things I've described, but if
> it's the problem, you could 1) increase the JVM's heap size if possible, 2)
> tweak the percentages allocated to Old Gen and Young Gen to give more to
> Young Gen in the hopes that more things will stay in Young Gen for longer,
> or 3) look into other GC strategies (I'd recommend G1GC, but you appear to
> be on the IBM JVM and I've never used it or researched it so I don't know
> what GC strategies it offers).  But I think you'd really want to prove to
> yourself that this is your problem (i.e. that none of the other things I've
> mentioned are) before you go down this path, because throwing more memory
> at a memory leak doesn't fix it, it just delays it and makes it harder to
> troubleshoot.
>
> Tim
>
> On Fri, Jun 26, 2015 at 1:53 AM, cdelgado <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > We're facing an issues that is stopping us for going to production, this
> is
> > a huge blocker for us.
> >
> > The problem is that one of our consumers is hanging (randomly, aparently)
> > and stops consuming messages. From a JMX we can see that is consuming
> > memory
> > and performing quite a lot full GCs.
> >
> > I'm attaching a javacore dump generated sending a kill -3 to the process.
> > There you can see all the details and thread statuses.
> >
> > javacore.txt
> > <http://activemq.2283324.n4.nabble.com/file/n4698204/javacore.txt>
> >
> > Basically, we have 90.7% of the threads waiting on condition, 3.5% Parked
> > and 5.7% Running.
> >
> > The Parked threads have different stacktraces, but generally they end in
> > the
> > same block:
> >
> > *at sun/misc/Unsafe.park(Native Method)
> > at
> >
> >
> java/util/concurrent/locks/LockSupport.parkNanos(LockSupport.java:222(Compiled
> > Code))
> > at
> >
> >
> java/util/concurrent/SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:435(Compiled
> > Code)) *
> > at
> >
> >
> java/util/concurrent/SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:334(Compiled
> > Code))
> > at
> >
> >
> java/util/concurrent/SynchronousQueue.poll(SynchronousQueue.java:885(Compiled
> > Code))
> > at
> >
> >
> java/util/concurrent/ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:966(Compiled
> > Code))
> > at
> >
> >
> java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:928)
> > at java/lang/Thread.run(Thread.java:761)
> >
> > Any *quick* help would be much appreciated, I'm a bit ost here.. :S
> >
> > Carlos
> >
> >
> >
> > --
> > View this message in context:
> >
> http://activemq.2283324.n4.nabble.com/JMS-Client-HANGING-AMQ-5-9-AIX-6-1-tp4698204.html
> > Sent from the ActiveMQ - User mailing list archive at Nabble.com.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: ** JMS Client HANGING - AMQ 5.9, AIX 6.1

Tim Bain
Carlos,

Although the suggestions Paul and I gave are good solutions to a number of
problems that you MIGHT have, I think it's important first to figure out
what problem(s) you DO have before you go changing a bunch of stuff that
might or might not apply.

The first question you need to answer is whether you have a memory leak
(i.e. a continual increase in memory usage that will eventually run you out
of memory no matter what JVM size or GC strategy you use) or simply a
non-increasing memory usage that happens to be hitting lots of full GCs.
In either of those situations, you also need to characterize and understand
your memory usage so you can figure out what uses memory and how much it
uses.

The second question you need to answer is whether there is any indication
of misbehavior by the ActiveMQ broker or by the ActiveMQ client code that
runs in your client process.

Once you know those things, you can decide which ones of the suggestions
Paul and I gave are appropriate for you to apply.

Tim
On Jun 26, 2015 9:53 AM, "Paul Gale" <[hidden email]> wrote:

> Carlos,
>
> You appear to running ActiveMQ 5.9.0 on Java 6 both of which are quite old.
> Trying upgrading to the latest ActiveMQ (5.11.1) with Java 7 or 8 and, as
> Tim pointed out, enable the G1GC garbage collector. Once you've done that
> remove the -Xms and -Xmn flags. See if that helps.
>
> On the off chance the broker is being asked to handle messages that are
> larger than you're expecting add the option
> wireFormat.maxFrameSize=<some_byte_value> to the TCP/NIO transport
> connector definition in activemq.xml. This will cause the sending of
> messages larger than the configured threshold to fail.
>
>
> Thanks,
> Paul
>
> On Fri, Jun 26, 2015 at 10:01 AM, Tim Bain <[hidden email]> wrote:
>
> > The stack trace you quoted is irrelevant; it's just executors waiting to
> be
> > given work to do.  There are also lots of threads trying to read messages
> > from sockets in
> > org/apache/activemq/transport/tcp/TcpBufferedInputStream.fill() or
> waiting
> > for a message to be available during a call to
> > org/apache/activemq/SimplePriorityMessageDispatchChannel.dequeue(); both
> of
> > those are also irrelevant, because they're just ActiveMQ waiting to be
> > given work.
> >
> > There are two threads waiting for responses to synchronous sends in
> > org/apache/activemq/ActiveMQConnection.syncSendPacket().  Those might
> > simply be victims of the inability to read messages, or they might be
> > relevant to what's going on; it's hard to tell from what you've sent.
> One
> > thing I'd check based on them (and one thing I'd always check in general,
> > so hopefully you've already done this) is whether there are any errors in
> > the ActiveMQ broker logs, and specifically whether there are any messages
> > about producer flow control kicking in.  Depending on how PFC is
> > configured, I believe I've seen at least one JIRA or wiki page describing
> > the potential for PFC to cause deadlock when synchronous sends are used
> by
> > preventing the acks from being read.  If you see PFC-related lines in the
> > broker logs, we'll go from there; if not, then don't worry about this.
> >
> > My overall thought, however, is that ActiveMQ (and the Spring JMS library
> > you're using) on its own isn't likely to run your client out of memory
> > unless your messages are VERY large, because there are limits on how many
> > messages will be transferred to your client at any one time.  Plus this
> > code has been run by LOTS of people over the years; if it caused OOMs on
> > its own, the cause would almost certainly have already been found.  So
> it's
> > most likely that this behavior is caused by something your own code is
> > doing, and the most likely guess is that you may be wrongly holding a
> > reference to objects that could otherwise be GCed, increasing heap memory
> > over time till you eventually run out.  You'll probably want to use tools
> > such as JVisualVM to analyze your memory usage and figure out what
> objects
> > are the ones causing it to grow and what's holding a reference to them.
> >
> > One other possibility is that your algorithm is correct, but processing
> > each message is memory-intensive (using over half the heap in total
> across
> > however many messages you're processing in parallel) and so lots of
> objects
> > are getting forced into Old Gen even though they're actually short-lived
> > objects, and they are only getting removed from Old Gen via full GCs.  I
> > think this is far less likely than the other things I've described, but
> if
> > it's the problem, you could 1) increase the JVM's heap size if possible,
> 2)
> > tweak the percentages allocated to Old Gen and Young Gen to give more to
> > Young Gen in the hopes that more things will stay in Young Gen for
> longer,
> > or 3) look into other GC strategies (I'd recommend G1GC, but you appear
> to
> > be on the IBM JVM and I've never used it or researched it so I don't know
> > what GC strategies it offers).  But I think you'd really want to prove to
> > yourself that this is your problem (i.e. that none of the other things
> I've
> > mentioned are) before you go down this path, because throwing more memory
> > at a memory leak doesn't fix it, it just delays it and makes it harder to
> > troubleshoot.
> >
> > Tim
> >
> > On Fri, Jun 26, 2015 at 1:53 AM, cdelgado <[hidden email]>
> > wrote:
> >
> > > Hi all,
> > >
> > > We're facing an issues that is stopping us for going to production,
> this
> > is
> > > a huge blocker for us.
> > >
> > > The problem is that one of our consumers is hanging (randomly,
> aparently)
> > > and stops consuming messages. From a JMX we can see that is consuming
> > > memory
> > > and performing quite a lot full GCs.
> > >
> > > I'm attaching a javacore dump generated sending a kill -3 to the
> process.
> > > There you can see all the details and thread statuses.
> > >
> > > javacore.txt
> > > <http://activemq.2283324.n4.nabble.com/file/n4698204/javacore.txt>
> > >
> > > Basically, we have 90.7% of the threads waiting on condition, 3.5%
> Parked
> > > and 5.7% Running.
> > >
> > > The Parked threads have different stacktraces, but generally they end
> in
> > > the
> > > same block:
> > >
> > > *at sun/misc/Unsafe.park(Native Method)
> > > at
> > >
> > >
> >
> java/util/concurrent/locks/LockSupport.parkNanos(LockSupport.java:222(Compiled
> > > Code))
> > > at
> > >
> > >
> >
> java/util/concurrent/SynchronousQueue$TransferStack.awaitFulfill(SynchronousQueue.java:435(Compiled
> > > Code)) *
> > > at
> > >
> > >
> >
> java/util/concurrent/SynchronousQueue$TransferStack.transfer(SynchronousQueue.java:334(Compiled
> > > Code))
> > > at
> > >
> > >
> >
> java/util/concurrent/SynchronousQueue.poll(SynchronousQueue.java:885(Compiled
> > > Code))
> > > at
> > >
> > >
> >
> java/util/concurrent/ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:966(Compiled
> > > Code))
> > > at
> > >
> > >
> >
> java/util/concurrent/ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:928)
> > > at java/lang/Thread.run(Thread.java:761)
> > >
> > > Any *quick* help would be much appreciated, I'm a bit ost here.. :S
> > >
> > > Carlos
> > >
> > >
> > >
> > > --
> > > View this message in context:
> > >
> >
> http://activemq.2283324.n4.nabble.com/JMS-Client-HANGING-AMQ-5-9-AIX-6-1-tp4698204.html
> > > Sent from the ActiveMQ - User mailing list archive at Nabble.com.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: ** JMS Client HANGING - AMQ 5.9, AIX 6.1

cdelgadob
Hi Tim, Paul,

Thanks a lot for your suggestions, both very concise and thoughtfull.

Re your questions:

- Message size: Actually they are quite small, it's just a, identifier that points to a record in a DB. So the DB has to be accesed in order to process the message (I know it's odd and probably wrong, but I can't figure out why this is done like that :S)

- Message processing: Not too memory intensive. There's something in the code called top and bottom transaction, which I can't figure out what they mean. Maybe they are blocking something and I haven't seen it yet.

- Memory usage: When a full GC is performed, the memory level falls to it's previous usage, so there's no memory leak, it's just something it's doing and GCed. What I don't know is what it is. I'll try to use VisualVM to find out...

- GC algorithm: We are using gencon, and I think what you say about the old and new object heap zone *might* be relevant. I'll check this out.

- Update Java and ActiveMQ versions: Yes, that was my first proposal. It should be done even without problems. But this is an enterprise environment, and everything goes really slow. I'll try to update the AMQ libraries, assuming the can tal with a 5.9 broker without problem, right? :D Tha Java update will come afterwards.

- Broker logs: There's nothing that shows errors in the broker, not even the PFC (unfortunatelly).

Paul I fully agree with your bottomline: figure out what's going on befor changind parameters and everything, let's see if I'm lucky today.


Thanks again for your suggestions, I'll keep working on this.

Best

 Carlos



Reply | Threaded
Open this post in threaded view
|

Re: ** JMS Client HANGING - AMQ 5.9, AIX 6.1

Tim Bain
When this is happening, how often are full GCs happening and how long are
they taking on average, and are young gen GCs occurring between the old gen
GCs?

Also, do you believe that the compaction phase (
https://www-01.ibm.com/support/knowledgecenter/mobile/#!/SSYKE2_6.0.0/com.ibm.java.doc.diagnostics.60/diag/understanding/mm_gc_compact.html)
is occurring during the full GCs?  Old Gen fragmentation can cause
near-continuous GCs under HotSpot's CMS strategy because its Old Gen GCs
don't compact Old Gen, so I wonder if something similar could be happening
here.

Tim
On Jun 29, 2015 2:38 AM, "cdelgadob" <[hidden email]> wrote:

> Hi Tim, Paul,
>
> Thanks a lot for your suggestions, both very concise and thoughtfull.
>
> Re your questions:
>
> - Message size: Actually they are quite small, it's just a, identifier that
> points to a record in a DB. So the DB has to be accesed in order to process
> the message (I know it's odd and probably wrong, but I can't figure out why
> this is done like that :S)
>
> - Message processing: Not too memory intensive. There's something in the
> code called top and bottom transaction, which I can't figure out what they
> mean. Maybe they are blocking something and I haven't seen it yet.
>
> - Memory usage: When a full GC is performed, the memory level falls to it's
> previous usage, so there's no memory leak, it's just something it's doing
> and GCed. What I don't know is what it is. I'll try to use VisualVM to find
> out...
>
> - GC algorithm: We are using gencon, and I think what you say about the old
> and new object heap zone *might* be relevant. I'll check this out.
>
> - Update Java and ActiveMQ versions: Yes, that was my first proposal. It
> should be done even without problems. But this is an enterprise
> environment,
> and everything goes really slow. I'll try to update the AMQ libraries,
> assuming the can tal with a 5.9 broker without problem, right? :D Tha Java
> update will come afterwards.
>
> - Broker logs: There's nothing that shows errors in the broker, not even
> the
> PFC (unfortunatelly).
>
> Paul I fully agree with your bottomline: figure out what's going on befor
> changind parameters and everything, let's see if I'm lucky today.
>
>
> Thanks again for your suggestions, I'll keep working on this.
>
> Best
>
>  Carlos
>
>
>
>
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/JMS-Client-HANGING-AMQ-5-9-AIX-6-1-tp4698204p4698321.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>