Embedded Artemis message producer times out when producing message (no error/warn logged on server side)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Embedded Artemis message producer times out when producing message (no error/warn logged on server side)

Mario Mahovlić

We run Artemis embedded on our Spring service, it ran ok for over a year, however at some point we started getting timeout exceptions when producing messages to the queue. (stack traces for regular/large messages attached).

We produce both regular and large messages to the queue and we got timeouts for both types (large messages are ~130kb on average). Message production rate to the queue at the time of incident was ~100k per hour.

Artemis is running in persistent mode using file journal on disk. As mentioned in the title no error or warn level logs were logged on artemis server side and timeouts stopped after service restart.


Service is running within a docker container and folder containing the journal is mapped to the host machine.

Metrics for the node on which service was running show no disk I/O issues at that time.

Artemis version: 2.6.4, Spring boot version: 2.1.5.RELEASE

Relevant artemis settings (rest of the settings are default):

durable: true
max-size-bytes : 1GB
address-full-policy: FAIL
journal-sync-non-transactional : false
journal-sync-transactional: false

If more info is needed we will try to provide it on request.

After some debugging we came to the conclusion that either threads writing to the journal were blocked for an extended period of time, or journal compact operation lasted a long time/was blocked for some reason and held write lock on journal during that time. 

Unfortunately we took no thread dumps during the incident to see where exactly the threads were stuck. We didn't manage to find any similar incidents reported on these boards so we would like to check out if anyone has any other idea what might cause this behavior? 

Reply | Threaded
Open this post in threaded view
|

Re: Embedded Artemis message producer times out when producing message (no error/warn logged on server side)

jbertram
> stack traces for regular/large messages attached

Email attachments are scrubbed so none came through.

> Service is running within a docker container and folder containing the
journal is mapped to the host machine.

Could you be more specific here? How exactly is the folder mapped? Is a
networked file-system involved here?

> After some debugging we came to the conclusion that either threads
writing to the journal were blocked for an extended period of time, or
journal compact operation lasted a long time/was blocked for some reason
and held write lock on journal during that time.

What specifically led you to this conclusion?

It's hard to offer insight without additional details, especially thread
dumps from the time of the incident. Have you seen this issue just once in
over a year of service?


Justin

On Mon, Nov 18, 2019 at 11:34 AM Mario Mahovlić <[hidden email]>
wrote:

> We run Artemis embedded on our Spring service, it ran ok for over a year,
> however at some point we started getting timeout exceptions when producing
> messages to the queue. (stack traces for regular/large messages attached).
>
> We produce both regular and large messages to the queue and we got
> timeouts for both types (large messages are ~130kb on average). Message
> production rate to the queue at the time of incident was ~100k per hour.
>
> Artemis is running in persistent mode using file journal on disk. As
> mentioned in the title no error or warn level logs were logged on artemis
> server side and timeouts stopped after service restart.
>
>
> Service is running within a docker container and folder containing the
> journal is mapped to the host machine.
>
> Metrics for the node on which service was running show no disk I/O issues
> at that time.
>
> Artemis version: 2.6.4, Spring boot version: 2.1.5.RELEASE
>
> Relevant artemis settings (rest of the settings are default):
>
> durable: true
> max-size-bytes : 1GB
> address-full-policy: FAIL
> journal-sync-non-transactional : false
> journal-sync-transactional: false
>
> If more info is needed we will try to provide it on request.
>
> After some debugging we came to the conclusion that either threads writing
> to the journal were blocked for an extended period of time, or journal
> compact operation lasted a long time/was blocked for some reason and held
> write lock on journal during that time.
>
> Unfortunately we took no thread dumps during the incident to see where
> exactly the threads were stuck. We didn't manage to find any similar
> incidents reported on these boards so we would like to check out if anyone
> has any other idea what might cause this behavior?
>
Reply | Threaded
Open this post in threaded view
|

Re: Embedded Artemis message producer times out when producing message (no error/warn logged on server side)

Mario Mahovlić
In reply to this post by Mario Mahovlić
Hello Justin,

We managed to get thread dump of the service while issue is in effect,
relevant thread stacks are below.
These thread stacks were taken with artemis version *2.6.4*, we also
managed to reproduce the issue locally on latest version of artemis (
*2.10.1*).

Stacks of 2 *deadlocked *threads:

"Thread-1074
(ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@3738bd7a)"
- *Thread t@11854*
   java.lang.Thread.State: BLOCKED
        at
org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl.getEncodeSize(LargeServerMessageImpl.java:174)
        - waiting to lock <1eb68b32> (a
org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl)
owned by "Thread-25
(ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@59ccdde4)"
t@202
        at
org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:59)
        at
org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:25)
        at
org.apache.activemq.artemis.core.journal.impl.dataformat.JournalAddRecord.getEncodeSize(JournalAddRecord.java:79)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendRecord(JournalImpl.java:2785)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.access$100(JournalImpl.java:91)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl$1.run(JournalImpl.java:843)
        at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
        at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
Source)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)

   Locked ownable synchronizers:
        - locked <4c6aa3d9> (a
java.util.concurrent.ThreadPoolExecutor$Worker)


"Thread-25
(ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@59ccdde4)"
- *Thread t@202*
   java.lang.Thread.State: WAITING
        at jdk.internal.misc.Unsafe.park(Native Method)
        - parking to wait for <7807bddf> (a
java.util.concurrent.CountDownLatch$Sync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
        at
java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
        at
org.apache.activemq.artemis.utils.SimpleFutureImpl.get(SimpleFutureImpl.java:62)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.checkKnownRecordID(JournalImpl.java:1073)
        at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:943)
        at
org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.confirmPendingLargeMessage(AbstractJournalStorageManager.java:360)
        at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.confirmLargeMessageSend(PostOfficeImpl.java:1308)
        - locked <1eb68b32> (a
org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl)
        at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.processRoute(PostOfficeImpl.java:1254)
        at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:904)
        at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:799)
        at
org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.doSend(ServerSessionImpl.java:1801)
        - locked <3396ca5d> (a
org.apache.activemq.artemis.core.server.impl.ServerSessionImpl)
        at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.sendContinuations(ServerSessionPacketHandler.java:993)
        - locked <67aa6e6> (a java.lang.Object)
        at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.slowPacketHandler(ServerSessionPacketHandler.java:314)
        at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.onMessagePacket(ServerSessionPacketHandler.java:287)
        at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler$$Lambda$1694/0x000000080219f840.onMessage(Unknown
Source)
        at
org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
Source)
        at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
        at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
        at
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
Source)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
        at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)

   Locked ownable synchronizers:
        - locked <7fc2876c> (a
java.util.concurrent.ThreadPoolExecutor$Worker)

From what we can see this issue occurs only on large message processing,
these two threads are locked like this:

   1. *Thread t@11854*   JournalImpl.appendAddRecord() -> appendExecutor
   (JournalImpl:946) is delayed for some reason
   2. *Thread t@202*  JournalImpl.appendDeleteRecord() is executed because
   appendAdd is async-> takes lock on LargeMessage object
   (PostOfficeImpl:1305) and gets stuck on appendExecutor queue behind *Thread
   t@11854* (JournalImpl:1058)
   3. *Thread t@11854*   JournalImpl.appendAddRecord() -> appendExecutor
   gets to the part where it needs the lock on same LargeMessage object
   (LargeServerMessageImpl:173), but it can't get it because *Thread
   t@11854* keeps it

So deadlock is: *Thread t@11854* is waiting for lock on LargeMessage object
that will never be released by *Thread t@202* because it is waiting for
processing on appendExecutor queue behind *Thread t@11854*

This eventually blocks all other producer threads, so *no messages can be
produced to the queue*:

"flow-action-completed-355" - Thread t@12775
   java.lang.Thread.State: TIMED_WAITING
        at jdk.internal.misc.Unsafe.park(Native Method)
        - parking to wait for <4876219f> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:235)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2211)
        at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:389)
        - locked <5582112e> (a java.lang.Object)
        at
org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:319)
        at
org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sessionClose(ActiveMQSessionContext.java:543)
        at
org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.close(ClientSessionImpl.java:1234)
        at
org.apache.activemq.artemis.jms.client.ActiveMQSession.close(ActiveMQSession.java:280)
        - locked <5fe5842b> (a
org.apache.activemq.artemis.jms.client.ActiveMQConnection)


"flow-action-completed-351" - Thread t@12771
   java.lang.Thread.State: BLOCKED
        at
org.apache.activemq.artemis.jms.client.ActiveMQConnection.createSession(ActiveMQConnection.java:234)
        - waiting to lock <5fe5842b> (a
org.apache.activemq.artemis.jms.client.ActiveMQConnection) owned by
"flow-action-completed-355" t@12775
        at
org.springframework.jms.connection.SingleConnectionFactory.createSession(SingleConnectionFactory.java:481)
        at
org.springframework.jms.connection.CachingConnectionFactory.getSession(CachingConnectionFactory.java:232)
        at
org.springframework.jms.connection.SingleConnectionFactory$SharedConnectionInvocationHandler.invoke(SingleConnectionFactory.java:648)

This issue occurred 3 times in the course of 2 weeks.
In the meantime we updated the artemis version to the latest one (2.10.1)
and for now we increased *minLargeMessageSize *so we process all messages
as regular messages because we run artemis embedded, keep entire messages
in memory, and never actually get the benefit of processing the message as
a large message. This way we avoid part of the code that causes the
deadlock to occur.
We would just like for someone to confirm that there is a bug present here.
If more info is needed we will provide it.

Kind regards,
Mario

 On Mon, 18 Nov 2019 18:21, Justin Bertram <[hidden email]> wrote:


> > stack traces for regular/large messages attached

Email attachments are scrubbed so none came through.

> Service is running within a docker container and folder containing the
> journal is mapped to the host machine.

Could you be more specific here? How exactly is the folder mapped? Is a
> networked file-system involved here?

> After some debugging we came to the conclusion that either threads
> writing to the journal were blocked for an extended period of time, or
> journal compact operation lasted a long time/was blocked for some reason
> and held write lock on journal during that time.

What specifically led you to this conclusion? It's hard to offer insight
> without additional details, especially thread

dumps from the time of the incident. Have you seen this issue just once in
> over a year of service?

Justin


On Mon, 18 Nov 2019 at 16:14, Mario Mahovlić <[hidden email]>
wrote:

> We run Artemis embedded on our Spring service, it ran ok for over a year,
> however at some point we started getting timeout exceptions when producing
> messages to the queue. (stack traces for regular/large messages attached).
>
> We produce both regular and large messages to the queue and we got
> timeouts for both types (large messages are ~130kb on average). Message
> production rate to the queue at the time of incident was ~100k per hour.
>
> Artemis is running in persistent mode using file journal on disk. As
> mentioned in the title no error or warn level logs were logged on artemis
> server side and timeouts stopped after service restart.
>
>
> Service is running within a docker container and folder containing the
> journal is mapped to the host machine.
>
> Metrics for the node on which service was running show no disk I/O issues
> at that time.
>
> Artemis version: 2.6.4, Spring boot version: 2.1.5.RELEASE
>
> Relevant artemis settings (rest of the settings are default):
>
> durable: true
> max-size-bytes : 1GB
> address-full-policy: FAIL
> journal-sync-non-transactional : false
> journal-sync-transactional: false
>
> If more info is needed we will try to provide it on request.
>
> After some debugging we came to the conclusion that either threads writing
> to the journal were blocked for an extended period of time, or journal
> compact operation lasted a long time/was blocked for some reason and held
> write lock on journal during that time.
>
> Unfortunately we took no thread dumps during the incident to see where
> exactly the threads were stuck. We didn't manage to find any similar
> incidents reported on these boards so we would like to check out if anyone
> has any other idea what might cause this behavior?
>
Reply | Threaded
Open this post in threaded view
|

Re: Embedded Artemis message producer times out when producing message (no error/warn logged on server side)

jbertram
> We would just like for someone to confirm that there is a bug present
here.

Yes, I believe there is a bug here. It's impossible to see directly from
the thread dump due to the involvement of the ordered executor, but
following the flow of the source code it's relatively apparent.

Your analysis was a bit difficult to follow. In step #1 you reference
"JournalImpl:946" as it relates to JournalImpl.appendAddRecord(). However,
line 946 of JournalImpl in version 2.6.4 [1] is actually in
JournalImpl.appendDeleteRecord() instead. I believe the reference here
should be to line 837 [2] as that is what actually corresponds to the
stack-trace for thread 11854.

In any event, the next step is to reliably reproduce the deadlock in order
to validate any fix and mitigate against regressions in the future.
Reproducing deadlocks can be tricky, but it will help to know as much
detail about your use-case as possible. Can you provide your broker.xml,
client configuration (e.g. minLargeMessageSize), a general description of
your architecture, and a run-down of the steps required to trigger this
deadlock in your environment?


Justin

[1]
https://github.com/apache/activemq-artemis/blob/2.6.4/artemis-journal/src/main/java/org/apache/activemq/artemis/core/journal/impl/JournalImpl.java#L946
[2]
https://github.com/apache/activemq-artemis/blob/2.6.4/artemis-journal/src/main/java/org/apache/activemq/artemis/core/journal/impl/JournalImpl.java#L837

On Fri, Dec 6, 2019 at 6:54 AM Mario Mahovlić <[hidden email]>
wrote:

> Hello Justin,
>
> We managed to get thread dump of the service while issue is in effect,
> relevant thread stacks are below.
> These thread stacks were taken with artemis version *2.6.4*, we also
> managed to reproduce the issue locally on latest version of artemis (
> *2.10.1*).
>
> Stacks of 2 *deadlocked *threads:
>
> "Thread-1074
>
> (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@3738bd7a
> )"
> - *Thread t@11854*
>    java.lang.Thread.State: BLOCKED
>         at
>
> org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl.getEncodeSize(LargeServerMessageImpl.java:174)
>         - waiting to lock <1eb68b32> (a
>
> org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl)
> owned by "Thread-25
>
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@59ccdde4
> )"
> t@202
>         at
>
> org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:59)
>         at
>
> org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:25)
>         at
>
> org.apache.activemq.artemis.core.journal.impl.dataformat.JournalAddRecord.getEncodeSize(JournalAddRecord.java:79)
>         at
>
> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendRecord(JournalImpl.java:2785)
>         at
>
> org.apache.activemq.artemis.core.journal.impl.JournalImpl.access$100(JournalImpl.java:91)
>         at
>
> org.apache.activemq.artemis.core.journal.impl.JournalImpl$1.run(JournalImpl.java:843)
>         at
>
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at
>
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at
>
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
>         at
>
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
> Source)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at
>
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
>
>    Locked ownable synchronizers:
>         - locked <4c6aa3d9> (a
> java.util.concurrent.ThreadPoolExecutor$Worker)
>
>
> "Thread-25
>
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@59ccdde4
> )"
> - *Thread t@202*
>    java.lang.Thread.State: WAITING
>         at jdk.internal.misc.Unsafe.park(Native Method)
>         - parking to wait for <7807bddf> (a
> java.util.concurrent.CountDownLatch$Sync)
>         at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
>         at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
>         at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
>         at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
>         at
> java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
>         at
>
> org.apache.activemq.artemis.utils.SimpleFutureImpl.get(SimpleFutureImpl.java:62)
>         at
>
> org.apache.activemq.artemis.core.journal.impl.JournalImpl.checkKnownRecordID(JournalImpl.java:1073)
>         at
>
> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:943)
>         at
>
> org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.confirmPendingLargeMessage(AbstractJournalStorageManager.java:360)
>         at
>
> org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.confirmLargeMessageSend(PostOfficeImpl.java:1308)
>         - locked <1eb68b32> (a
>
> org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl)
>         at
>
> org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.processRoute(PostOfficeImpl.java:1254)
>         at
>
> org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:904)
>         at
>
> org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:799)
>         at
>
> org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.doSend(ServerSessionImpl.java:1801)
>         - locked <3396ca5d> (a
> org.apache.activemq.artemis.core.server.impl.ServerSessionImpl)
>         at
>
> org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.sendContinuations(ServerSessionPacketHandler.java:993)
>         - locked <67aa6e6> (a java.lang.Object)
>         at
>
> org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.slowPacketHandler(ServerSessionPacketHandler.java:314)
>         at
>
> org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.onMessagePacket(ServerSessionPacketHandler.java:287)
>         at
>
> org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler$$Lambda$1694/0x000000080219f840.onMessage(Unknown
> Source)
>         at
> org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33)
>         at
>
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
>         at
>
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
> Source)
>         at
>
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
>         at
>
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
>         at
>
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
>         at
>
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
> Source)
>         at
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at
>
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
>
>    Locked ownable synchronizers:
>         - locked <7fc2876c> (a
> java.util.concurrent.ThreadPoolExecutor$Worker)
>
> From what we can see this issue occurs only on large message processing,
> these two threads are locked like this:
>
>    1. *Thread t@11854*   JournalImpl.appendAddRecord() -> appendExecutor
>    (JournalImpl:946) is delayed for some reason
>    2. *Thread t@202*  JournalImpl.appendDeleteRecord() is executed because
>    appendAdd is async-> takes lock on LargeMessage object
>    (PostOfficeImpl:1305) and gets stuck on appendExecutor queue behind
> *Thread
>    t@11854* (JournalImpl:1058)
>    3. *Thread t@11854*   JournalImpl.appendAddRecord() -> appendExecutor
>    gets to the part where it needs the lock on same LargeMessage object
>    (LargeServerMessageImpl:173), but it can't get it because *Thread
>    t@11854* keeps it
>
> So deadlock is: *Thread t@11854* is waiting for lock on LargeMessage
> object
> that will never be released by *Thread t@202* because it is waiting for
> processing on appendExecutor queue behind *Thread t@11854*
>
> This eventually blocks all other producer threads, so *no messages can be
> produced to the queue*:
>
> "flow-action-completed-355" - Thread t@12775
>    java.lang.Thread.State: TIMED_WAITING
>         at jdk.internal.misc.Unsafe.park(Native Method)
>         - parking to wait for <4876219f> (a
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
>         at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:235)
>         at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2211)
>         at
>
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:389)
>         - locked <5582112e> (a java.lang.Object)
>         at
>
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:319)
>         at
>
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sessionClose(ActiveMQSessionContext.java:543)
>         at
>
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.close(ClientSessionImpl.java:1234)
>         at
>
> org.apache.activemq.artemis.jms.client.ActiveMQSession.close(ActiveMQSession.java:280)
>         - locked <5fe5842b> (a
> org.apache.activemq.artemis.jms.client.ActiveMQConnection)
>
>
> "flow-action-completed-351" - Thread t@12771
>    java.lang.Thread.State: BLOCKED
>         at
>
> org.apache.activemq.artemis.jms.client.ActiveMQConnection.createSession(ActiveMQConnection.java:234)
>         - waiting to lock <5fe5842b> (a
> org.apache.activemq.artemis.jms.client.ActiveMQConnection) owned by
> "flow-action-completed-355" t@12775
>         at
>
> org.springframework.jms.connection.SingleConnectionFactory.createSession(SingleConnectionFactory.java:481)
>         at
>
> org.springframework.jms.connection.CachingConnectionFactory.getSession(CachingConnectionFactory.java:232)
>         at
>
> org.springframework.jms.connection.SingleConnectionFactory$SharedConnectionInvocationHandler.invoke(SingleConnectionFactory.java:648)
>
> This issue occurred 3 times in the course of 2 weeks.
> In the meantime we updated the artemis version to the latest one (2.10.1)
> and for now we increased *minLargeMessageSize *so we process all messages
> as regular messages because we run artemis embedded, keep entire messages
> in memory, and never actually get the benefit of processing the message as
> a large message. This way we avoid part of the code that causes the
> deadlock to occur.
> We would just like for someone to confirm that there is a bug present here.
> If more info is needed we will provide it.
>
> Kind regards,
> Mario
>
>  On Mon, 18 Nov 2019 18:21, Justin Bertram <[hidden email]> wrote:
>
>
> > > stack traces for regular/large messages attached
>
> Email attachments are scrubbed so none came through.
>
> > Service is running within a docker container and folder containing the
> > journal is mapped to the host machine.
>
> Could you be more specific here? How exactly is the folder mapped? Is a
> > networked file-system involved here?
>
> > After some debugging we came to the conclusion that either threads
> > writing to the journal were blocked for an extended period of time, or
> > journal compact operation lasted a long time/was blocked for some reason
> > and held write lock on journal during that time.
>
> What specifically led you to this conclusion? It's hard to offer insight
> > without additional details, especially thread
>
> dumps from the time of the incident. Have you seen this issue just once in
> > over a year of service?
>
> Justin
>
>
> On Mon, 18 Nov 2019 at 16:14, Mario Mahovlić <[hidden email]>
> wrote:
>
> > We run Artemis embedded on our Spring service, it ran ok for over a year,
> > however at some point we started getting timeout exceptions when
> producing
> > messages to the queue. (stack traces for regular/large messages
> attached).
> >
> > We produce both regular and large messages to the queue and we got
> > timeouts for both types (large messages are ~130kb on average). Message
> > production rate to the queue at the time of incident was ~100k per hour.
> >
> > Artemis is running in persistent mode using file journal on disk. As
> > mentioned in the title no error or warn level logs were logged on artemis
> > server side and timeouts stopped after service restart.
> >
> >
> > Service is running within a docker container and folder containing the
> > journal is mapped to the host machine.
> >
> > Metrics for the node on which service was running show no disk I/O issues
> > at that time.
> >
> > Artemis version: 2.6.4, Spring boot version: 2.1.5.RELEASE
> >
> > Relevant artemis settings (rest of the settings are default):
> >
> > durable: true
> > max-size-bytes : 1GB
> > address-full-policy: FAIL
> > journal-sync-non-transactional : false
> > journal-sync-transactional: false
> >
> > If more info is needed we will try to provide it on request.
> >
> > After some debugging we came to the conclusion that either threads
> writing
> > to the journal were blocked for an extended period of time, or journal
> > compact operation lasted a long time/was blocked for some reason and held
> > write lock on journal during that time.
> >
> > Unfortunately we took no thread dumps during the incident to see where
> > exactly the threads were stuck. We didn't manage to find any similar
> > incidents reported on these boards so we would like to check out if
> anyone
> > has any other idea what might cause this behavior?
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Embedded Artemis message producer times out when producing message (no error/warn logged on server side)

Mario Mahovlić
Hello Justin,

I apologize for the error in my analysis which made it difficult to follow,
I found some time now to get back to this issue.
Below are thread stacks of deadlock I reproduced locally with the latest of
version of artemis (2.10.1) so line indexes are up to date.

"Thread-0
(ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$7@2624dd2c)"
#81 prio=5 os_prio=0 cpu=46.88ms elapsed=756.49s tid=0x0000020d4d744800
nid=0x554 waiting for monitor entry  [0x0000009598bfe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl.getEncodeSize(LargeServerMessageImpl.java:178)
- waiting to lock <0x000000071798bb60> (a
org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl)
at
org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:59)
at
org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:25)
at
org.apache.activemq.artemis.core.journal.impl.dataformat.JournalAddRecord.getEncodeSize(JournalAddRecord.java:79)
at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendRecord(JournalImpl.java:2792)
at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.access$100(JournalImpl.java:93)
at
org.apache.activemq.artemis.core.journal.impl.JournalImpl$1.run(JournalImpl.java:850)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
at
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$915/0x0000000801bb9440.run(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@12.0.1
/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@12.0.1
/ThreadPoolExecutor.java:628)
at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)

   Locked ownable synchronizers:
- <0x0000000704e5d0b8> (a java.util.concurrent.ThreadPoolExecutor$Worker)

"Thread-5
(ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@1dbf21c5)"
#163 prio=5 os_prio=0 cpu=46.88ms elapsed=725.74s tid=0x0000020d4e508000
nid=0x2c94 waiting on condition  [0x000000959b8fe000]
   java.lang.Thread.State: WAITING (parking)
at jdk.internal.misc.Unsafe.park(java.base@12.0.1/Native Method)
- parking to wait for  <0x00000007179d1ae0> (a
java.util.concurrent.CountDownLatch$Sync)
at java.util.concurrent.locks.LockSupport.park(java.base@12.0.1
/LockSupport.java:194)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(java.base@12.0.1
/AbstractQueuedSynchronizer.java:885)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(java.base@12.0.1
/AbstractQueuedSynchronizer.java:1039)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(java.base@12.0.1
/AbstractQueuedSynchronizer.java:1345)
at java.util.concurrent.CountDownLatch.await(java.base@12.0.1
/CountDownLatch.java:232)
at
org.apache.activemq.artemis.utils.SimpleFutureImpl.get(SimpleFutureImpl.java:62)
at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.checkKnownRecordID(JournalImpl.java:1080)
at
org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:950)
at
org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.confirmPendingLargeMessage(AbstractJournalStorageManager.java:362)
at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.confirmLargeMessageSend(PostOfficeImpl.java:1413)
- locked <0x000000071798bb60> (a
org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl)
at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.processRoute(PostOfficeImpl.java:1359)
at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:1003)
at
org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:894)
at
org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.doSend(ServerSessionImpl.java:2073)
- locked <0x00000007179aede8> (a
org.apache.activemq.artemis.core.server.impl.ServerSessionImpl)
at
org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.doSend(ServerSessionImpl.java:2017)
- locked <0x00000007179aede8> (a
org.apache.activemq.artemis.core.server.impl.ServerSessionImpl)
at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.sendContinuations(ServerSessionPacketHandler.java:1034)
- locked <0x00000007179aea10> (a java.lang.Object)
at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.slowPacketHandler(ServerSessionPacketHandler.java:312)
at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.onMessagePacket(ServerSessionPacketHandler.java:285)
at
org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler$$Lambda$1381/0x0000000801f3d040.onMessage(Unknown
Source)
at org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33)
at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
at
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$915/0x0000000801bb9440.run(Unknown
Source)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
at
org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
at
org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
at
org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$915/0x0000000801bb9440.run(Unknown
Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(java.base@12.0.1
/ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(java.base@12.0.1
/ThreadPoolExecutor.java:628)
at
org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)

   Locked ownable synchronizers:
- <0x00000007179d9d58> (a java.util.concurrent.ThreadPoolExecutor$Worker)

As i said we run artemis embedded on our spring service, here is the
relevant server configuration (rest is default):

durable: true
> max-size-bytes : 1GB
> address-full-policy: FAIL
> journal-sync-non-transactional : false
> journal-sync-transactional: false


Client configuration:

> min-large-message-size : 100kB
> max-delivery-attempts: 4


Unfortunately as you said reproducing deadlocks is tricky and I was not
able to write a test which would successfully reproduce this issue, as I
was unable to find a way
to control thread execution at that point in the code.

The only way I managed to reproduce it is by debugging the service locally
and producing a single large message to artemis along with using
breakpoints to pause the threads at specific times
to trigger the deadlock so I will describe that (lines correspond to
artemis version 2.10.1):

   1. Place a breakpoint at JournalImpl:848 within appendAddRecord() method
   (use suspend thread breakpoints).
   2. Place a breakpoint at JournalImpl:1058 within checkKnownRecordID()
   method which is called by appendDeleteRecord().
   3. Produce a large message to the queue using JmsTemplate.
   4. *Thread 1* should remain paused at JournalImpl:848 while *Thread 2*
   reaches if statement at JournalImpl:1058 (note that *Thread 2* takes
   lock on LargeServerMessageImpl at PostOfficeImpl:1410)
   5. This is where it gets tricky as I was not able to reproduce this
   part, from our production thread stacks we can see that this if statement
   was not satisfied (which would cause checkKnownRecordID() to return),
   instead method execution proceeded and got stuck on appendExecutor
behind *Thread
   1*. I never managed to produce a state where this if is not satisfied
   (records/pendingRecords not containing recordId), so to reproduce this part
   I altered state of records/pendingRecords map/set manually to let the
   checkKnownRecordID() method execution proceed to appendExecutor section.
   6. *Thread 2* now waits behind *Thread 1* on appendExecutor  at
   JournalImpl:1080 and keeps lock on LargeServerMessageImpl object.
   7.  Let *Thread 1* run and it will reach
   LargeServerMessageImpl.getEncodeSize() method which is synchronized and
   will remain deadlocked with *Thread 2*.

Kind regards,
Mario


On Fri, 6 Dec 2019 at 21:42, Justin Bertram <[hidden email]> wrote:

> > We would just like for someone to confirm that there is a bug present
> here.
>
> Yes, I believe there is a bug here. It's impossible to see directly from
> the thread dump due to the involvement of the ordered executor, but
> following the flow of the source code it's relatively apparent.
>
> Your analysis was a bit difficult to follow. In step #1 you reference
> "JournalImpl:946" as it relates to JournalImpl.appendAddRecord(). However,
> line 946 of JournalImpl in version 2.6.4 [1] is actually in
> JournalImpl.appendDeleteRecord() instead. I believe the reference here
> should be to line 837 [2] as that is what actually corresponds to the
> stack-trace for thread 11854.
>
> In any event, the next step is to reliably reproduce the deadlock in order
> to validate any fix and mitigate against regressions in the future.
> Reproducing deadlocks can be tricky, but it will help to know as much
> detail about your use-case as possible. Can you provide your broker.xml,
> client configuration (e.g. minLargeMessageSize), a general description of
> your architecture, and a run-down of the steps required to trigger this
> deadlock in your environment?
>
>
> Justin
>
> [1]
>
> https://github.com/apache/activemq-artemis/blob/2.6.4/artemis-journal/src/main/java/org/apache/activemq/artemis/core/journal/impl/JournalImpl.java#L946
> [2]
>
> https://github.com/apache/activemq-artemis/blob/2.6.4/artemis-journal/src/main/java/org/apache/activemq/artemis/core/journal/impl/JournalImpl.java#L837
>
> On Fri, Dec 6, 2019 at 6:54 AM Mario Mahovlić <[hidden email]>
> wrote:
>
> > Hello Justin,
> >
> > We managed to get thread dump of the service while issue is in effect,
> > relevant thread stacks are below.
> > These thread stacks were taken with artemis version *2.6.4*, we also
> > managed to reproduce the issue locally on latest version of artemis (
> > *2.10.1*).
> >
> > Stacks of 2 *deadlocked *threads:
> >
> > "Thread-1074
> >
> >
> (ActiveMQ-IO-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$6@3738bd7a
> > )"
> > - *Thread t@11854*
> >    java.lang.Thread.State: BLOCKED
> >         at
> >
> >
> org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl.getEncodeSize(LargeServerMessageImpl.java:174)
> >         - waiting to lock <1eb68b32> (a
> >
> >
> org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl)
> > owned by "Thread-25
> >
> >
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@59ccdde4
> > )"
> > t@202
> >         at
> >
> >
> org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:59)
> >         at
> >
> >
> org.apache.activemq.artemis.core.persistence.impl.journal.codec.LargeMessagePersister.getEncodeSize(LargeMessagePersister.java:25)
> >         at
> >
> >
> org.apache.activemq.artemis.core.journal.impl.dataformat.JournalAddRecord.getEncodeSize(JournalAddRecord.java:79)
> >         at
> >
> >
> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendRecord(JournalImpl.java:2785)
> >         at
> >
> >
> org.apache.activemq.artemis.core.journal.impl.JournalImpl.access$100(JournalImpl.java:91)
> >         at
> >
> >
> org.apache.activemq.artemis.core.journal.impl.JournalImpl$1.run(JournalImpl.java:843)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
> > Source)
> >         at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> >         at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> >
> >    Locked ownable synchronizers:
> >         - locked <4c6aa3d9> (a
> > java.util.concurrent.ThreadPoolExecutor$Worker)
> >
> >
> > "Thread-25
> >
> >
> (ActiveMQ-server-org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$5@59ccdde4
> > )"
> > - *Thread t@202*
> >    java.lang.Thread.State: WAITING
> >         at jdk.internal.misc.Unsafe.park(Native Method)
> >         - parking to wait for <7807bddf> (a
> > java.util.concurrent.CountDownLatch$Sync)
> >         at
> > java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
> >         at
> >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:885)
> >         at
> >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1039)
> >         at
> >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1345)
> >         at
> > java.util.concurrent.CountDownLatch.await(CountDownLatch.java:232)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.SimpleFutureImpl.get(SimpleFutureImpl.java:62)
> >         at
> >
> >
> org.apache.activemq.artemis.core.journal.impl.JournalImpl.checkKnownRecordID(JournalImpl.java:1073)
> >         at
> >
> >
> org.apache.activemq.artemis.core.journal.impl.JournalImpl.appendDeleteRecord(JournalImpl.java:943)
> >         at
> >
> >
> org.apache.activemq.artemis.core.persistence.impl.journal.AbstractJournalStorageManager.confirmPendingLargeMessage(AbstractJournalStorageManager.java:360)
> >         at
> >
> >
> org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.confirmLargeMessageSend(PostOfficeImpl.java:1308)
> >         - locked <1eb68b32> (a
> >
> >
> org.apache.activemq.artemis.core.persistence.impl.journal.LargeServerMessageImpl)
> >         at
> >
> >
> org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.processRoute(PostOfficeImpl.java:1254)
> >         at
> >
> >
> org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:904)
> >         at
> >
> >
> org.apache.activemq.artemis.core.postoffice.impl.PostOfficeImpl.route(PostOfficeImpl.java:799)
> >         at
> >
> >
> org.apache.activemq.artemis.core.server.impl.ServerSessionImpl.doSend(ServerSessionImpl.java:1801)
> >         - locked <3396ca5d> (a
> > org.apache.activemq.artemis.core.server.impl.ServerSessionImpl)
> >         at
> >
> >
> org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.sendContinuations(ServerSessionPacketHandler.java:993)
> >         - locked <67aa6e6> (a java.lang.Object)
> >         at
> >
> >
> org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.slowPacketHandler(ServerSessionPacketHandler.java:314)
> >         at
> >
> >
> org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler.onMessagePacket(ServerSessionPacketHandler.java:287)
> >         at
> >
> >
> org.apache.activemq.artemis.core.protocol.core.ServerSessionPacketHandler$$Lambda$1694/0x000000080219f840.onMessage(Unknown
> > Source)
> >         at
> > org.apache.activemq.artemis.utils.actors.Actor.doTask(Actor.java:33)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
> > Source)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:42)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.actors.OrderedExecutor.doTask(OrderedExecutor.java:31)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.actors.ProcessorBase.executePendingTasks(ProcessorBase.java:66)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.actors.ProcessorBase$$Lambda$1194/0x0000000801ecd440.run(Unknown
> > Source)
> >         at
> >
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> >         at
> >
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> >         at
> >
> >
> org.apache.activemq.artemis.utils.ActiveMQThreadFactory$1.run(ActiveMQThreadFactory.java:118)
> >
> >    Locked ownable synchronizers:
> >         - locked <7fc2876c> (a
> > java.util.concurrent.ThreadPoolExecutor$Worker)
> >
> > From what we can see this issue occurs only on large message processing,
> > these two threads are locked like this:
> >
> >    1. *Thread t@11854*   JournalImpl.appendAddRecord() -> appendExecutor
> >    (JournalImpl:946) is delayed for some reason
> >    2. *Thread t@202*  JournalImpl.appendDeleteRecord() is executed
> because
> >    appendAdd is async-> takes lock on LargeMessage object
> >    (PostOfficeImpl:1305) and gets stuck on appendExecutor queue behind
> > *Thread
> >    t@11854* (JournalImpl:1058)
> >    3. *Thread t@11854*   JournalImpl.appendAddRecord() -> appendExecutor
> >    gets to the part where it needs the lock on same LargeMessage object
> >    (LargeServerMessageImpl:173), but it can't get it because *Thread
> >    t@11854* keeps it
> >
> > So deadlock is: *Thread t@11854* is waiting for lock on LargeMessage
> > object
> > that will never be released by *Thread t@202* because it is waiting for
> > processing on appendExecutor queue behind *Thread t@11854*
> >
> > This eventually blocks all other producer threads, so *no messages can be
> > produced to the queue*:
> >
> > "flow-action-completed-355" - Thread t@12775
> >    java.lang.Thread.State: TIMED_WAITING
> >         at jdk.internal.misc.Unsafe.park(Native Method)
> >         - parking to wait for <4876219f> (a
> > java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
> >         at
> > java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:235)
> >         at
> >
> >
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2211)
> >         at
> >
> >
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:389)
> >         - locked <5582112e> (a java.lang.Object)
> >         at
> >
> >
> org.apache.activemq.artemis.core.protocol.core.impl.ChannelImpl.sendBlocking(ChannelImpl.java:319)
> >         at
> >
> >
> org.apache.activemq.artemis.core.protocol.core.impl.ActiveMQSessionContext.sessionClose(ActiveMQSessionContext.java:543)
> >         at
> >
> >
> org.apache.activemq.artemis.core.client.impl.ClientSessionImpl.close(ClientSessionImpl.java:1234)
> >         at
> >
> >
> org.apache.activemq.artemis.jms.client.ActiveMQSession.close(ActiveMQSession.java:280)
> >         - locked <5fe5842b> (a
> > org.apache.activemq.artemis.jms.client.ActiveMQConnection)
> >
> >
> > "flow-action-completed-351" - Thread t@12771
> >    java.lang.Thread.State: BLOCKED
> >         at
> >
> >
> org.apache.activemq.artemis.jms.client.ActiveMQConnection.createSession(ActiveMQConnection.java:234)
> >         - waiting to lock <5fe5842b> (a
> > org.apache.activemq.artemis.jms.client.ActiveMQConnection) owned by
> > "flow-action-completed-355" t@12775
> >         at
> >
> >
> org.springframework.jms.connection.SingleConnectionFactory.createSession(SingleConnectionFactory.java:481)
> >         at
> >
> >
> org.springframework.jms.connection.CachingConnectionFactory.getSession(CachingConnectionFactory.java:232)
> >         at
> >
> >
> org.springframework.jms.connection.SingleConnectionFactory$SharedConnectionInvocationHandler.invoke(SingleConnectionFactory.java:648)
> >
> > This issue occurred 3 times in the course of 2 weeks.
> > In the meantime we updated the artemis version to the latest one (2.10.1)
> > and for now we increased *minLargeMessageSize *so we process all messages
> > as regular messages because we run artemis embedded, keep entire messages
> > in memory, and never actually get the benefit of processing the message
> as
> > a large message. This way we avoid part of the code that causes the
> > deadlock to occur.
> > We would just like for someone to confirm that there is a bug present
> here.
> > If more info is needed we will provide it.
> >
> > Kind regards,
> > Mario
> >
> >  On Mon, 18 Nov 2019 18:21, Justin Bertram <[hidden email]> wrote:
> >
> >
> > > > stack traces for regular/large messages attached
> >
> > Email attachments are scrubbed so none came through.
> >
> > > Service is running within a docker container and folder containing the
> > > journal is mapped to the host machine.
> >
> > Could you be more specific here? How exactly is the folder mapped? Is a
> > > networked file-system involved here?
> >
> > > After some debugging we came to the conclusion that either threads
> > > writing to the journal were blocked for an extended period of time, or
> > > journal compact operation lasted a long time/was blocked for some
> reason
> > > and held write lock on journal during that time.
> >
> > What specifically led you to this conclusion? It's hard to offer insight
> > > without additional details, especially thread
> >
> > dumps from the time of the incident. Have you seen this issue just once
> in
> > > over a year of service?
> >
> > Justin
> >
> >
> > On Mon, 18 Nov 2019 at 16:14, Mario Mahovlić <[hidden email]>
> > wrote:
> >
> > > We run Artemis embedded on our Spring service, it ran ok for over a
> year,
> > > however at some point we started getting timeout exceptions when
> > producing
> > > messages to the queue. (stack traces for regular/large messages
> > attached).
> > >
> > > We produce both regular and large messages to the queue and we got
> > > timeouts for both types (large messages are ~130kb on average). Message
> > > production rate to the queue at the time of incident was ~100k per
> hour.
> > >
> > > Artemis is running in persistent mode using file journal on disk. As
> > > mentioned in the title no error or warn level logs were logged on
> artemis
> > > server side and timeouts stopped after service restart.
> > >
> > >
> > > Service is running within a docker container and folder containing the
> > > journal is mapped to the host machine.
> > >
> > > Metrics for the node on which service was running show no disk I/O
> issues
> > > at that time.
> > >
> > > Artemis version: 2.6.4, Spring boot version: 2.1.5.RELEASE
> > >
> > > Relevant artemis settings (rest of the settings are default):
> > >
> > > durable: true
> > > max-size-bytes : 1GB
> > > address-full-policy: FAIL
> > > journal-sync-non-transactional : false
> > > journal-sync-transactional: false
> > >
> > > If more info is needed we will try to provide it on request.
> > >
> > > After some debugging we came to the conclusion that either threads
> > writing
> > > to the journal were blocked for an extended period of time, or journal
> > > compact operation lasted a long time/was blocked for some reason and
> held
> > > write lock on journal during that time.
> > >
> > > Unfortunately we took no thread dumps during the incident to see where
> > > exactly the threads were stuck. We didn't manage to find any similar
> > > incidents reported on these boards so we would like to check out if
> > anyone
> > > has any other idea what might cause this behavior?
> > >
> >
>