Artemis 2.4.0 message loss in durability tests upon system power-off

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Artemis 2.4.0 message loss in durability tests upon system power-off

Anindya Haldar
We are in the process of qualifying Artemis 2.4.0 for our stack. We ran some message durability related tests in the face of a power failure. The broker is running in a VirtualBox VM, and is set up in a system where disk caching is disabled. The VM runs OEL Linux 7, and the VirtualBox Manger itself is running under Windows 7 Enterprise.

 

We use JMS API and persistent messaging. The transaction batch size in the producers is 1, and the message size for the tests in 1024 bytes. No consumers are running at this time, and we let the queues build up. Then the VirtualBox VM running the broker is 'powered off' (using VirtualBox facilities) 5 minutes along the way. The producers detect the broker's absence and stop.

 

Then we resume the VM and the broker. The broker starts up and we get the queue stats from it before anything else:

 

|NAME                     |ADDRESS                  |CONSUMER_COUNT |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
|testQueue1               |testQueue1               |0              |106988        |106988         |0                |0              |
|testQueue2               |testQueue2               |0              |107077        |107077         |0                |0              |
|testQueue3               |testQueue3               |0              |106996        |106996         |0                |0              |
|testQueue4               |testQueue4               |0              |107076        |107076         |0                |0              |

 

The total message count across the queues is 428137.

Now we start the consumers (no producers this time). Finally when the consumers finish, we get the stats again. The consumers are claiming that they received and acknowledged 428126 messages, which is corroborated by the broker in the MESSAGES_ACKED column.

 

|NAME                     |ADDRESS                  |CONSUMER_COUNT |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |

|testQueue1               |testQueue1               |0              |0             |106988         |0                |106984         |

|testQueue2               |testQueue2               |0              |0             |107077         |0                |107074         |

|testQueue3               |testQueue3               |0              |0             |106996         |0                |106992         |

|testQueue4               |testQueue4               |0              |0             |107076         |0                |107076         |

 

You can clearly see some apparent anomalies:

1)      Post failure, and upon resumption, the broker said it had 428,137 messages in the test queues, all combined (column MESSAGES_ADDED).

2)      When the consumers consumed it got 428,126 messages and acknowledged all of them. That is 11 short of 428,137.

3)      The broker, upon the consumers' completion reported 0 queue depth, but also said it got acknowledgements on 428,126 messages (column MESSAGES_ACKED).

 

Questions:

1)      If we assume the 'MESSAGES_ADDED' column is accurate, then what happed to additional 11 messages that the consumers never received, and, as a result never acknowledged?

2)      If, according to the broker, the number of acknowledged messages is 11 less than the number of messages added to the queue, why did it declare the queues to be empty when 11 of the messages were not acknowledged?

3)      If we trust the 'MESSAGES_ADDED' stats as a baseline number then the system lost messages. And if we do not trust that statistic then what do we trust, and how do we know if it lost messages?

 

The system ran into this issue 3 out of 4 times I ran the VM power failure test (with slightly different statistics, of course). We are very concerned that it is symptom of message loss in the system, and are also concerned about how to explain the anomalies. Will greatly appreciate any pointer that can help us understand and address the underlying issue here.

 

Thanks,

Anindya Haldar

Oracle Marketing Cloud

 
Reply | Threaded
Open this post in threaded view
|

Re: Artemis 2.4.0 message loss in durability tests upon system power-off

Martyn Taylor
Hi Anindya,

I would be very surprised if messages were lost due to an ActiveMQ Artemis
broker issue.  We have a lot of very extensive testing around these exact
use cases.  Could you please check a couple of things:

1. Can you confirm that the same guarantees for disk sync apply to a host
VM as they do for an OS on bare metal.
2. In case of a temporary discrepancy in the queue metrics.  Could you
measure the number of messages sent by the producers and compare that to
the the number consumed.

Thanks


On Tue, Feb 6, 2018 at 1:11 AM, Anindya Haldar <[hidden email]>
wrote:

> We are in the process of qualifying Artemis 2.4.0 for our stack. We ran
> some message durability related tests in the face of a power failure. The
> broker is running in a VirtualBox VM, and is set up in a system where disk
> caching is disabled. The VM runs OEL Linux 7, and the VirtualBox Manger
> itself is running under Windows 7 Enterprise.
>
>
>
> We use JMS API and persistent messaging. The transaction batch size in the
> producers is 1, and the message size for the tests in 1024 bytes. No
> consumers are running at this time, and we let the queues build up. Then
> the VirtualBox VM running the broker is 'powered off' (using VirtualBox
> facilities) 5 minutes along the way. The producers detect the broker's
> absence and stop.
>
>
>
> Then we resume the VM and the broker. The broker starts up and we get the
> queue stats from it before anything else:
>
>
>
> |NAME                     |ADDRESS                  |CONSUMER_COUNT
> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
> |testQueue1               |testQueue1               |0
> |106988        |106988         |0                |0              |
> |testQueue2               |testQueue2               |0
> |107077        |107077         |0                |0              |
> |testQueue3               |testQueue3               |0
> |106996        |106996         |0                |0              |
> |testQueue4               |testQueue4               |0
> |107076        |107076         |0                |0              |
>
>
>
> The total message count across the queues is 428137.
>
> Now we start the consumers (no producers this time). Finally when the
> consumers finish, we get the stats again. The consumers are claiming that
> they received and acknowledged 428126 messages, which is corroborated by
> the broker in the MESSAGES_ACKED column.
>
>
>
> |NAME                     |ADDRESS                  |CONSUMER_COUNT
> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
>
> |testQueue1               |testQueue1               |0              |0
>          |106988         |0                |106984         |
>
> |testQueue2               |testQueue2               |0              |0
>          |107077         |0                |107074         |
>
> |testQueue3               |testQueue3               |0              |0
>          |106996         |0                |106992         |
>
> |testQueue4               |testQueue4               |0              |0
>          |107076         |0                |107076         |
>
>
>
> You can clearly see some apparent anomalies:
>
> 1)      Post failure, and upon resumption, the broker said it had 428,137
> messages in the test queues, all combined (column MESSAGES_ADDED).
>
> 2)      When the consumers consumed it got 428,126 messages and
> acknowledged all of them. That is 11 short of 428,137.
>
> 3)      The broker, upon the consumers' completion reported 0 queue depth,
> but also said it got acknowledgements on 428,126 messages (column
> MESSAGES_ACKED).
>
>
>
> Questions:
>
> 1)      If we assume the 'MESSAGES_ADDED' column is accurate, then what
> happed to additional 11 messages that the consumers never received, and, as
> a result never acknowledged?
>
> 2)      If, according to the broker, the number of acknowledged messages
> is 11 less than the number of messages added to the queue, why did it
> declare the queues to be empty when 11 of the messages were not
> acknowledged?
>
> 3)      If we trust the 'MESSAGES_ADDED' stats as a baseline number then
> the system lost messages. And if we do not trust that statistic then what
> do we trust, and how do we know if it lost messages?
>
>
>
> The system ran into this issue 3 out of 4 times I ran the VM power failure
> test (with slightly different statistics, of course). We are very concerned
> that it is symptom of message loss in the system, and are also concerned
> about how to explain the anomalies. Will greatly appreciate any pointer
> that can help us understand and address the underlying issue here.
>
>
>
> Thanks,
>
> Anindya Haldar
>
> Oracle Marketing Cloud
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis 2.4.0 message loss in durability tests upon system power-off

jbertram
In reply to this post by Anindya Haldar
The "messages added" metric for a queue is volatile so when the broker is
stopped it will be reset.  When the broker is started again the "messages
added" will be 0.  In your test you say the broker is "powered off" and
then you "resume" the broker.  What exactly does this mean?  It seems clear
that you aren't actually shutting down the broker otherwise the "messages
added" would be 0 when you started your consumers.  Please clarify.

Also, how do the broker's metrics compare with the producer's and
consumer's metrics?  I assume here that the producer and consumer are both
tracking the number of messages they produce/consume.

Also, do you have a way to reproduce this without a VM?


Justin

On Mon, Feb 5, 2018 at 7:11 PM, Anindya Haldar <[hidden email]>
wrote:

> We are in the process of qualifying Artemis 2.4.0 for our stack. We ran
> some message durability related tests in the face of a power failure. The
> broker is running in a VirtualBox VM, and is set up in a system where disk
> caching is disabled. The VM runs OEL Linux 7, and the VirtualBox Manger
> itself is running under Windows 7 Enterprise.
>
>
>
> We use JMS API and persistent messaging. The transaction batch size in the
> producers is 1, and the message size for the tests in 1024 bytes. No
> consumers are running at this time, and we let the queues build up. Then
> the VirtualBox VM running the broker is 'powered off' (using VirtualBox
> facilities) 5 minutes along the way. The producers detect the broker's
> absence and stop.
>
>
>
> Then we resume the VM and the broker. The broker starts up and we get the
> queue stats from it before anything else:
>
>
>
> |NAME                     |ADDRESS                  |CONSUMER_COUNT
> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
> |testQueue1               |testQueue1               |0
> |106988        |106988         |0                |0              |
> |testQueue2               |testQueue2               |0
> |107077        |107077         |0                |0              |
> |testQueue3               |testQueue3               |0
> |106996        |106996         |0                |0              |
> |testQueue4               |testQueue4               |0
> |107076        |107076         |0                |0              |
>
>
>
> The total message count across the queues is 428137.
>
> Now we start the consumers (no producers this time). Finally when the
> consumers finish, we get the stats again. The consumers are claiming that
> they received and acknowledged 428126 messages, which is corroborated by
> the broker in the MESSAGES_ACKED column.
>
>
>
> |NAME                     |ADDRESS                  |CONSUMER_COUNT
> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
>
> |testQueue1               |testQueue1               |0              |0
>          |106988         |0                |106984         |
>
> |testQueue2               |testQueue2               |0              |0
>          |107077         |0                |107074         |
>
> |testQueue3               |testQueue3               |0              |0
>          |106996         |0                |106992         |
>
> |testQueue4               |testQueue4               |0              |0
>          |107076         |0                |107076         |
>
>
>
> You can clearly see some apparent anomalies:
>
> 1)      Post failure, and upon resumption, the broker said it had 428,137
> messages in the test queues, all combined (column MESSAGES_ADDED).
>
> 2)      When the consumers consumed it got 428,126 messages and
> acknowledged all of them. That is 11 short of 428,137.
>
> 3)      The broker, upon the consumers' completion reported 0 queue depth,
> but also said it got acknowledgements on 428,126 messages (column
> MESSAGES_ACKED).
>
>
>
> Questions:
>
> 1)      If we assume the 'MESSAGES_ADDED' column is accurate, then what
> happed to additional 11 messages that the consumers never received, and, as
> a result never acknowledged?
>
> 2)      If, according to the broker, the number of acknowledged messages
> is 11 less than the number of messages added to the queue, why did it
> declare the queues to be empty when 11 of the messages were not
> acknowledged?
>
> 3)      If we trust the 'MESSAGES_ADDED' stats as a baseline number then
> the system lost messages. And if we do not trust that statistic then what
> do we trust, and how do we know if it lost messages?
>
>
>
> The system ran into this issue 3 out of 4 times I ran the VM power failure
> test (with slightly different statistics, of course). We are very concerned
> that it is symptom of message loss in the system, and are also concerned
> about how to explain the anomalies. Will greatly appreciate any pointer
> that can help us understand and address the underlying issue here.
>
>
>
> Thanks,
>
> Anindya Haldar
>
> Oracle Marketing Cloud
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Artemis 2.4.0 message loss in durability tests upon system power-off

Anindya Haldar
In reply to this post by Martyn Taylor
Hi Martyn,

Thanks for your reply.

1) On the disk sync guarantee for the VirtualBox VM I don't know the precise answer, and did not find a good answer when I looked around. So we ran alternate experiments where we mounted a Windows host folder (instead of a virtual disk image) into the VM, and have Artemis use that. Here's what I found when I powered off the VM, and restarted it again.

    a) Artemis failed to even start up successfully if it is told to use that mounted external directory also for paging, on top of journaling. The broker would fail early, and won’t even bind successfully to the listener port.

    b) Artemis starts up successfully if we let it use its default paging directory, and have it use the mounted external directory for journaling only. Then the broker can resume after we restart the system, although the apparent message loss symptom does not change when we account for the number of committed messages by the producers before the power failure test, and the number we found in the queues upon recovery.

2) My biggest concern is that when we browse the queues (or consume the messages from them) after the system recovers from power failure is triggered, the total number of messages detected (or delivered to the consumer) is less than the total number of messages committed successfully by the producer before the power failure was triggered. To us that is ominous.

BTW, the Windows host system hard disk caching is tuned off. And we never power off the host system, but only the VirtualBox VM during the tests.

Thanks,
Anindya Haldar

Oracle Marketing Cloud


-----Original Message-----
From: Martyn Taylor [mailto:[hidden email]]
Sent: Wednesday, February 14, 2018 2:42 AM
To: [hidden email]
Subject: Re: Artemis 2.4.0 message loss in durability tests upon system power-off

Hi Anindya,

I would be very surprised if messages were lost due to an ActiveMQ Artemis broker issue.  We have a lot of very extensive testing around these exact use cases.  Could you please check a couple of things:

1. Can you confirm that the same guarantees for disk sync apply to a host VM as they do for an OS on bare metal.
2. In case of a temporary discrepancy in the queue metrics.  Could you measure the number of messages sent by the producers and compare that to the the number consumed.

Thanks


On Tue, Feb 6, 2018 at 1:11 AM, Anindya Haldar <[hidden email]>
wrote:

> We are in the process of qualifying Artemis 2.4.0 for our stack. We
> ran some message durability related tests in the face of a power
> failure. The broker is running in a VirtualBox VM, and is set up in a
> system where disk caching is disabled. The VM runs OEL Linux 7, and
> the VirtualBox Manger itself is running under Windows 7 Enterprise.
>
>
>
> We use JMS API and persistent messaging. The transaction batch size in
> the producers is 1, and the message size for the tests in 1024 bytes.
> No consumers are running at this time, and we let the queues build up.
> Then the VirtualBox VM running the broker is 'powered off' (using
> VirtualBox
> facilities) 5 minutes along the way. The producers detect the broker's
> absence and stop.
>
>
>
> Then we resume the VM and the broker. The broker starts up and we get
> the queue stats from it before anything else:
>
>
>
> |NAME                     |ADDRESS                  |CONSUMER_COUNT
> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
> |testQueue1               |testQueue1               |0
> |106988        |106988         |0                |0              |
> |testQueue2               |testQueue2               |0
> |107077        |107077         |0                |0              |
> |testQueue3               |testQueue3               |0
> |106996        |106996         |0                |0              |
> |testQueue4               |testQueue4               |0
> |107076        |107076         |0                |0              |
>
>
>
> The total message count across the queues is 428137.
>
> Now we start the consumers (no producers this time). Finally when the
> consumers finish, we get the stats again. The consumers are claiming
> that they received and acknowledged 428126 messages, which is
> corroborated by the broker in the MESSAGES_ACKED column.
>
>
>
> |NAME                     |ADDRESS                  |CONSUMER_COUNT
> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
>
> |testQueue1               |testQueue1               |0              |0
>          |106988         |0                |106984         |
>
> |testQueue2               |testQueue2               |0              |0
>          |107077         |0                |107074         |
>
> |testQueue3               |testQueue3               |0              |0
>          |106996         |0                |106992         |
>
> |testQueue4               |testQueue4               |0              |0
>          |107076         |0                |107076         |
>
>
>
> You can clearly see some apparent anomalies:
>
> 1)      Post failure, and upon resumption, the broker said it had 428,137
> messages in the test queues, all combined (column MESSAGES_ADDED).
>
> 2)      When the consumers consumed it got 428,126 messages and
> acknowledged all of them. That is 11 short of 428,137.
>
> 3)      The broker, upon the consumers' completion reported 0 queue depth,
> but also said it got acknowledgements on 428,126 messages (column
> MESSAGES_ACKED).
>
>
>
> Questions:
>
> 1)      If we assume the 'MESSAGES_ADDED' column is accurate, then what
> happed to additional 11 messages that the consumers never received,
> and, as a result never acknowledged?
>
> 2)      If, according to the broker, the number of acknowledged messages
> is 11 less than the number of messages added to the queue, why did it
> declare the queues to be empty when 11 of the messages were not
> acknowledged?
>
> 3)      If we trust the 'MESSAGES_ADDED' stats as a baseline number then
> the system lost messages. And if we do not trust that statistic then
> what do we trust, and how do we know if it lost messages?
>
>
>
> The system ran into this issue 3 out of 4 times I ran the VM power
> failure test (with slightly different statistics, of course). We are
> very concerned that it is symptom of message loss in the system, and
> are also concerned about how to explain the anomalies. Will greatly
> appreciate any pointer that can help us understand and address the underlying issue here.
>
>
>
> Thanks,
>
> Anindya Haldar
>
> Oracle Marketing Cloud
>
>
>
Reply | Threaded
Open this post in threaded view
|

RE: Artemis 2.4.0 message loss in durability tests upon system power-off

Anindya Haldar
In reply to this post by jbertram
We powered off the VM while the producers were kicking and alive, and no one was consuming. Then we tallied for the number of committed messages by the producers. After that we restart the VM, and then restart the broker, and take the queue stats. Then we use the JMS QueueBrowser API to count the number of actual messages in the queues. Finally we consumer all messages from the queues and tally them against the number of messages committed by the producers at the time the failure was triggered.

We are looking forward to run the tests using a bare metal system in order to eliminate VirtualBox VM from the picture.

Thanks,
Anindya Haldar

Oracle Marking Cloud

-----Original Message-----
From: Justin Bertram [mailto:[hidden email]]
Sent: Wednesday, February 14, 2018 6:58 AM
To: [hidden email]
Subject: Re: Artemis 2.4.0 message loss in durability tests upon system power-off

The "messages added" metric for a queue is volatile so when the broker is stopped it will be reset.  When the broker is started again the "messages added" will be 0.  In your test you say the broker is "powered off" and then you "resume" the broker.  What exactly does this mean?  It seems clear that you aren't actually shutting down the broker otherwise the "messages added" would be 0 when you started your consumers.  Please clarify.

Also, how do the broker's metrics compare with the producer's and consumer's metrics?  I assume here that the producer and consumer are both tracking the number of messages they produce/consume.

Also, do you have a way to reproduce this without a VM?


Justin

On Mon, Feb 5, 2018 at 7:11 PM, Anindya Haldar <[hidden email]>
wrote:

> We are in the process of qualifying Artemis 2.4.0 for our stack. We
> ran some message durability related tests in the face of a power
> failure. The broker is running in a VirtualBox VM, and is set up in a
> system where disk caching is disabled. The VM runs OEL Linux 7, and
> the VirtualBox Manger itself is running under Windows 7 Enterprise.
>
>
>
> We use JMS API and persistent messaging. The transaction batch size in
> the producers is 1, and the message size for the tests in 1024 bytes.
> No consumers are running at this time, and we let the queues build up.
> Then the VirtualBox VM running the broker is 'powered off' (using
> VirtualBox
> facilities) 5 minutes along the way. The producers detect the broker's
> absence and stop.
>
>
>
> Then we resume the VM and the broker. The broker starts up and we get
> the queue stats from it before anything else:
>
>
>
> |NAME                     |ADDRESS                  |CONSUMER_COUNT
> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
> |testQueue1               |testQueue1               |0
> |106988        |106988         |0                |0              |
> |testQueue2               |testQueue2               |0
> |107077        |107077         |0                |0              |
> |testQueue3               |testQueue3               |0
> |106996        |106996         |0                |0              |
> |testQueue4               |testQueue4               |0
> |107076        |107076         |0                |0              |
>
>
>
> The total message count across the queues is 428137.
>
> Now we start the consumers (no producers this time). Finally when the
> consumers finish, we get the stats again. The consumers are claiming
> that they received and acknowledged 428126 messages, which is
> corroborated by the broker in the MESSAGES_ACKED column.
>
>
>
> |NAME                     |ADDRESS                  |CONSUMER_COUNT
> |MESSAGE_COUNT |MESSAGES_ADDED |DELIVERING_COUNT |MESSAGES_ACKED |
>
> |testQueue1               |testQueue1               |0              |0
>          |106988         |0                |106984         |
>
> |testQueue2               |testQueue2               |0              |0
>          |107077         |0                |107074         |
>
> |testQueue3               |testQueue3               |0              |0
>          |106996         |0                |106992         |
>
> |testQueue4               |testQueue4               |0              |0
>          |107076         |0                |107076         |
>
>
>
> You can clearly see some apparent anomalies:
>
> 1)      Post failure, and upon resumption, the broker said it had 428,137
> messages in the test queues, all combined (column MESSAGES_ADDED).
>
> 2)      When the consumers consumed it got 428,126 messages and
> acknowledged all of them. That is 11 short of 428,137.
>
> 3)      The broker, upon the consumers' completion reported 0 queue depth,
> but also said it got acknowledgements on 428,126 messages (column
> MESSAGES_ACKED).
>
>
>
> Questions:
>
> 1)      If we assume the 'MESSAGES_ADDED' column is accurate, then what
> happed to additional 11 messages that the consumers never received,
> and, as a result never acknowledged?
>
> 2)      If, according to the broker, the number of acknowledged messages
> is 11 less than the number of messages added to the queue, why did it
> declare the queues to be empty when 11 of the messages were not
> acknowledged?
>
> 3)      If we trust the 'MESSAGES_ADDED' stats as a baseline number then
> the system lost messages. And if we do not trust that statistic then
> what do we trust, and how do we know if it lost messages?
>
>
>
> The system ran into this issue 3 out of 4 times I ran the VM power
> failure test (with slightly different statistics, of course). We are
> very concerned that it is symptom of message loss in the system, and
> are also concerned about how to explain the anomalies. Will greatly
> appreciate any pointer that can help us understand and address the underlying issue here.
>
>
>
> Thanks,
>
> Anindya Haldar
>
> Oracle Marketing Cloud
>
>
>