Thread deadlock during message processing

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Thread deadlock during message processing

Mario Mahovlić

Hello all,

We run Artemis embedded on our Spring service and occasionally get this issue where all our producer threads become blocked so no messages can be produced to queue (happened 3 times in 2 weeks). We produce both regular and large messages to the queue. All we get during this time from artemis are timeout exceptions on our producer client side:

nested exception is javax.jms.JMSException: AMQ219014: Timed out after waiting 30,000 ms for response when sending packet 71

We took a thread dump of the service while issue was in effect (attached).

From what we can see Thread t@11854 and thread Thread t@202 are permanently locked, all other BLOCKED threads are blocked on session creation:

org.apache.activemq.artemis.jms.client.ActiveMQConnection.createSession(ActiveMQConnection.java:234)

There seems to be a race condition that can cause thread deadlock during large message processing when satisfied, example:

  1. Large message is produced
  2. Thread-1 JournalImpl.appendAddRecord() -> appendExecutor (JournalImpl:946) is delayed for some reason
  3. Thread-2 JournalImpl.appendDeleteRecord() is triggered because appendAdd is async-> takes lock on LargeMessage object (PostOfficeImpl:1305) and gets stuck on appendExecutor queue behind Thread-1 (JournalImpl:1058)
  4. Thread-1 JournalImpl.appendAddRecord() -> appendExecutor gets to the part where it needs the lock on same LargeMessage object (LargeServerMessageImpl:173), but it can't get it because Thread-2 keeps it
So deadlock is: Thread-1 is waiting for lock on LargeMessage object that will never be released by Thread-2 because it is waiting for processing on appendExecutor queue behind Thread-1

We are still having issues actually reproducing this deadlock because Thread-2 when processing JournalImpl.appendDeleteRecord() checks records/pendingRecords for recordId before reaching appendExecutor, and in all cases we managed to reproduce it is always present in  (JournalImpl:1051)

Service is running within a docker container and folder containing the journal is mapped to the host machine. Metrics for the node on which service was running show no disk I/O issues at that time.

Artemis version: 2.6.4, Spring boot version: 2.1.5.RELEASE

Relevant artemis settings (rest of the settings are default):

durable: true
max-size-bytes : 1GB
address-full-policy: FAIL
journal-sync-non-transactional : false
journal-sync-transactional: false

We are in the process of analyzing the issue further but wanted to report this as soon as possible, so someone else can also take a look. If you need any additional info, we will provide it.

Kind regards,

Mario

Reply | Threaded
Open this post in threaded view
|

Re: Thread deadlock during message processing

Mario Mahovlić
Hello all,

Has anyone managed to check this out? It seems to be a pretty major race
condition issue with large message processing.
Unfortunately I just realized that attachments get removed from e-mails, so
here are relevant thread stacks...

Stacks of 2 *deadlocked *threads:

"Thread-1074
(ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@3738bd7a)"
- Thread t@11854
   java.lang.Thread.State: BLOCKED
        at
org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl.getEncodeSize(LargeServerMessageImpl.java:174)
        - waiting to lock <1eb68b32> (a
org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl)
owned by "Thread-25
(ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@59ccdde4)"
t@202
        at
org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:59)
        at
org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:25)
        at
org.apache.activemq.artemis.core.journal.impl.dataformat.JournalAddRecord.getEncodeSize(JournalAddRecord.java:79)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendRecord(JournalImpl.java:2785)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.access$100(JournalImpl.java:91)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl$1.run(JournalImpl.java:843)
        at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
        at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
Source)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)

   Locked ownable synchronizers:
        - locked <4c6aa3d9> (a
java.util.concurrent.ThreadPoolExecutor$Worker)


"Thread-25
(ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@59ccdde4)"
- Thread t@202
   java.lang.Thread.State: WAITING
        at jdk.internal.misc.Unsafe.park(Native Method)
        - parking to wait for <7807bddf> (a
java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
        at
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
        at
org.apache.activemq.artemis.utils.SimpleFutureImpl.get(SimpleFutureImpl.java:62)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.checkKnownRecordID(JournalImpl.java:1073)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:943)
        at
org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.confirmPendingLargeMessage(AbstractJournalStorageManager.java:360)
        at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.confirmLargeMessageSend(PostOfficeImpl.java:1308)
        - locked <1eb68b32> (a
org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl)
        at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.processRoute(PostOfficeImpl.java:1254)
        at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:904)
        at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:799)
        at
org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.doSend(ServerSessionImpl.java:1801)
        - locked <3396ca5d> (a
org.apache.activemq.artemis.core.server.impl.ServerSessionImpl)
        at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.sendContinuations(ServerSessionPacketHandler.java:993)
        - locked <67aa6e6> (a java.lang.Object)
        at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.slowPacketHandler(ServerSessionPacketHandler.java:314)
        at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.onMessagePacket(ServerSessionPacketHandler.java:287)
        at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler$$Lambda$1694/0x000000080219f840.onMessage(Unknown
Source)
        at
org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
Source)
        at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
        at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
Source)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)

   Locked ownable synchronizers:
        - locked <7fc2876c> (a
java.util.concurrent.ThreadPoolExecutor$Worker)

As described in previous emails, these two threads are locked like this:

   1. *Thread t@11854*   JournalImpl.appendAddRecord() -> appendExecutor
   (JournalImpl:946) is delayed for some reason
   2. *Thread t@202*  JournalImpl.appendDeleteRecord() is triggered because
   appendAdd is async-> takes lock on LargeMessage object
   (PostOfficeImpl:1305) and gets stuck on appendExecutor queue behind
   Thread-1 (JournalImpl:1058)
   3. *Thread t@11854*   JournalImpl.appendAddRecord() -> appendExecutor
   gets to the part where it needs the lock on same LargeMessage object
   (LargeServerMessageImpl:173), but it can't get it because Thread-2 keeps it

So deadlock is: *Thread t@11854* is waiting for lock on LargeMessage object
that will never be released by *Thread t@202* because it is waiting for
processing on appendExecutor queue behind *Thread t@11854*

This eventually blocks all other producer threads, so *no messages can be
produced to the queue*:

"flow-action-completed-355" - Thread t@12775
   java.lang.Thread.State: TIMED_WAITING
        at jdk.internal.misc.Unsafe.park(Native Method)
        - parking to wait for <4876219f> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:235)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2211)
        at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:389)
        - locked <5582112e> (a java.lang.Object)
        at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:319)
        at
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sessionClose(ActiveMQSessionContext.java:543)
        at
org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.close(ClientSessionImpl.java:1234)
        at
org.apache.activemq.artemis.jms.client.ActiveMQSession.close(ActiveMQSession.java:280)
        - locked <5fe5842b> (a
org.apache.activemq.artemis.jms.client.ActiveMQConnection)


"flow-action-completed-351" - Thread t@12771
   java.lang.Thread.State: BLOCKED
        at
org.apache.activemq.artemis.jms.client.ActiveMQConnection.createSession(ActiveMQConnection.java:234)
        - waiting to lock <5fe5842b> (a
org.apache.activemq.artemis.jms.client.ActiveMQConnection) owned by
"flow-action-completed-355" t@12775
        at
org.springframework.jms.connection.SingleConnectionFactory.createSession(SingleConnectionFactory.java:481)
        at
org.springframework.jms.connection.CachingConnectionFactory.getSession(CachingConnectionFactory.java:232)
        at
org.springframework.jms.connection.SingleConnectionFactory$SharedConnectionInvocationHandler.invoke(SingleConnectionFactory.java:648)

For now we increased *minLargeMessageSize *so we process all messages as
regular messages because we run artemis embedded, keep entire messages in
memory, and never actually get the benefit of processing the message as a
large message.
We would just like for someone to confirm that there is a bug present here.
If you need more info we will provide it.

Kind regards,
Mario

On Wed, 20 Nov 2019 at 18:23, Mario Mahovlić <[hidden email]>
wrote:

> Hello all,
>
> We run Artemis embedded on our Spring service and occasionally get this
> issue where all our producer threads become blocked so no messages can be
> produced to queue (happened 3 times in 2 weeks). We produce both regular
> and large messages to the queue. All we get during this time from artemis
> are timeout exceptions on our producer client side:
>
> nested exception is javax.jms.JMSException: AMQ219014: Timed out after waiting 30,000 ms for response when sending packet 71
>
> We took a thread dump of the service while issue was in effect (attached).
>
> From what we can see Thread t@11854 and thread Thread t@202 are
> permanently locked, all other BLOCKED threads are blocked on session
> creation:
>
> org.apache.activemq.artemis.jms.client.ActiveMQConnection.createSession(ActiveMQConnection.java:234)
>
> There seems to be a race condition that can cause thread deadlock during
> large message processing when satisfied, example:
>
>
>    1. Large message is produced
>    2. *Thread-1* JournalImpl.appendAddRecord() -> appendExecutor
>    (JournalImpl:946) is delayed for some reason
>    3. *Thread-2* JournalImpl.appendDeleteRecord() is triggered because
>    appendAdd is async-> takes lock on LargeMessage object
>    (PostOfficeImpl:1305) and gets stuck on appendExecutor queue behind
>    Thread-1 (JournalImpl:1058)
>    4. *Thread-1* JournalImpl.appendAddRecord() -> appendExecutor gets to
>    the part where it needs the lock on same LargeMessage object
>    (LargeServerMessageImpl:173), but it can't get it because Thread-2 keeps it
>
> So deadlock is: Thread-1 is waiting for lock on LargeMessage object that
> will never be released by Thread-2 because it is waiting for processing on
> appendExecutor queue behind Thread-1
>
> We are still having issues actually reproducing this deadlock because
> Thread-2 when processing JournalImpl.appendDeleteRecord() checks
> records/pendingRecords for recordId before reaching appendExecutor, and in
> all cases we managed to reproduce it is always present in
> (JournalImpl:1051)
>
> Service is running within a docker container and folder containing the
> journal is mapped to the host machine. Metrics for the node on which
> service was running show no disk I/O issues at that time.
>
> Artemis version: 2.6.4, Spring boot version: 2.1.5.RELEASE
>
> Relevant artemis settings (rest of the settings are default):
>
> durable: true
> max-size-bytes : 1GB
> address-full-policy: FAIL
> journal-sync-non-transactional : false
> journal-sync-transactional: false
>
> We are in the process of analyzing the issue further but wanted to report
> this as soon as possible, so someone else can also take a look. If you need
> any additional info, we will provide it.
>
> Kind regards,
>
> Mario
>