ActiveMQ and Artemis reliability - Messages lost

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

ActiveMQ and Artemis reliability - Messages lost

veaceslavdoina
Hello everyone,

As was recently mentioned [1], we started to analyze the currently used
queue system. During the first iteration, we started to test reliability.

Testing environment:
AWS

Hardware:
| Producer   | m5.large   | 2vCPU / 8GiB |
| Consumer | t3.medium | 2vCPU / 2GiB |

Software:
OS: CentOS Linux release 7.5.1804
Java: Oracle JDK SE 8 Update 191
ActiveMQ: ActiveMQ 5.15.6
Artemis: ActiveMQ Artemis 2.6.3
Client: JmsTools [2]

Configuration:
ActiveMQ stand-alone:
1. activemq.xml: added 'individualDeadLetterStrategy'
2. ACTIVEMQ_OPTS=-Xms512M -Xmx7G
3. Started as systemd service

Artemis stand-alone:
1. artemis.profile: modified JAVA_ARGS to -Xms512M -Xmx7G
2. jolokia-access.xml: modified to <allow-origin>*</allow-origin>
3. Started as systemd service

Methodology:
1. Run producer for 5 minutes on the queue.
2. Run consumer to consume all messages from the same queue.
3. Iterate 5 times.
4. Restart activemq or artemis service during producer/consumer work,
depending on the test.

Client:
Producer:
java -Xms512m -Xmx3G -jar AmqJmsProducer-1.8-jar-with-dependencies.jar
-url failover:\(tcp://10.0.20.28:61616,tcp://10.0.20.28:61616\) \
-notran \
  -log Logs \
  -t 10 \
  -duration 5 \
  -id \
  -type TEXT \
  -queue Test

Consumer:
java -Xms512m -Xmx3G -jar AmqJmsConsumer-1.8-jar-with-dependencies.jar \
-url failover:\(tcp://10.0.20.28:61616,tcp://10.0.20.28:61616\) \
-log Logs \
-timeout 60000 \
-t 10 \
-verify \
-drain \
-queue Test

Restarts:
service=activemq
#service=artemis
while :; do
  interval=30
  sudo systemctl restart $service
  sudo systemctl status $service | grep "Active:.*;"
  sleep $interval
done

> Note: Artemis service was restarted every 120 seconds and sometimes
> restarts were stopped because of Artemis long start because of journal
> checking.

Results:
<http://activemq.2283324.n4.nabble.com/file/t379260/Apache-ActiveMQ-v5.png>
<http://activemq.2283324.n4.nabble.com/file/t379260/Apache-Artemis-v2.png>

Question:
1. Is there any configuration from the default ones which may permit us to
achieve zero messages loss with ActiveMQ an Artemis?
2. If we assume that used tools may have issues/bugs, which one may we use
for such type of tests?
3. Any advices.


Thank you!

Slava.


Links:
[1] -
http://activemq.2283324.n4.nabble.com/ActiveMQ-5-x-reliability-and-ActiveMQ-Artemis-questions-tt4744108.html
[2] - https://github.com/erik-wramner/JmsTools



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

veaceslavdoina
Found a typo:

Hardware:
| ActiveMQ/Artemis    | m5.large   | 2vCPU / 8GiB |
| Producer/Consumer | t3.medium | 2vCPU / 2GiB |


Thank you!

Slava.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

Tim Bain
When you end the test, how do you ensure that your consumer gets to finish
consuming the remaining messages on the broker, without the producer
producing new messages? Put another way: you assert that messages are lost,
but I don't see anything in your results that prove that they're actually
lost vs. still on the broker (which is not lost). Can you demonstrate that
the problem here isn't actually with your test setup?

Tim

On Mon, Oct 29, 2018, 12:19 AM veaceslavdoina <[hidden email]> wrote:

> Found a typo:
>
> Hardware:
> | ActiveMQ/Artemis    | m5.large   | 2vCPU / 8GiB |
> | Producer/Consumer | t3.medium | 2vCPU / 2GiB |
>
>
> Thank you!
>
> Slava.
>
>
>
> --
> Sent from:
> http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
>
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

veaceslavdoina
Hello Tim,

Yes, you are right, thank you for observation!
I've added one more line in the table - 'Messages remained'.

During the test, each iteration uses its own queue name. The consumer is
configured to consume all messages from the queue with a 30 seconds timeout.

At the end of the test, I just looked if there any messages remained in the
queue or in an appropriate DLQ if it was created.

Please find an example of the finished test and updated tables with the
results:
<http://activemq.2283324.n4.nabble.com/file/t379260/Screen_Shot_2018-10-28_at_11.png>
<http://activemq.2283324.n4.nabble.com/file/t379260/Screen_Shot_2018-10-29_at_16.png>
<http://activemq.2283324.n4.nabble.com/file/t379260/Screen_Shot_2018-10-29_at_16.png>



Thank you!

Slava.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

jbertram
Is there a project or anything set up where users can easily reproduce the
results you're seeing (e.g. using their own machine)?


Justin


On Mon, Oct 29, 2018 at 9:56 AM veaceslavdoina <[hidden email]> wrote:

> Hello Tim,
>
> Yes, you are right, thank you for observation!
> I've added one more line in the table - 'Messages remained'.
>
> During the test, each iteration uses its own queue name. The consumer is
> configured to consume all messages from the queue with a 30 seconds
> timeout.
>
> At the end of the test, I just looked if there any messages remained in the
> queue or in an appropriate DLQ if it was created.
>
> Please find an example of the finished test and updated tables with the
> results:
> <
> http://activemq.2283324.n4.nabble.com/file/t379260/Screen_Shot_2018-10-28_at_11.png>
>
> <
> http://activemq.2283324.n4.nabble.com/file/t379260/Screen_Shot_2018-10-29_at_16.png>
>
> <
> http://activemq.2283324.n4.nabble.com/file/t379260/Screen_Shot_2018-10-29_at_16.png>
>
>
>
>
> Thank you!
>
> Slava.
>
>
>
> --
> Sent from:
> http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
>
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

veaceslavdoina
In reply to this post by veaceslavdoina
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

veaceslavdoina
In reply to this post by jbertram
jbertram,

Almost all things were done using Ansible and shell script.
1. Ansible - Create AWS environment.
2. Ansible - Install AMQ/Artemis.
3. Shell - script for Jms Tools to produce the tests and generate the
reports.
4. A line of code to restart systemd service.

If you are interested, I may share all or just a component you are
interested in.


Thank you!

Slava.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

Tim Bain
Can you post the code for your producer? Is it sending persistent messages?
Does it set a TTL on the messages when it sends them?

Your ActiveMQ 5.x console screenshot shows that those queues were never
used. What is the screenshot expected to show?

Tim

On Mon, Oct 29, 2018, 9:26 AM veaceslavdoina <[hidden email]> wrote:

> jbertram,
>
> Almost all things were done using Ansible and shell script.
> 1. Ansible - Create AWS environment.
> 2. Ansible - Install AMQ/Artemis.
> 3. Shell - script for Jms Tools to produce the tests and generate the
> reports.
> 4. A line of code to restart systemd service.
>
> If you are interested, I may share all or just a component you are
> interested in.
>
>
> Thank you!
>
> Slava.
>
>
>
> --
> Sent from:
> http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
>
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

veaceslavdoina
Tim,

I'm not a developer at the moment... This is why I tried to find the right
tool for such tests. I have tried more than 10 from the publicly accessible.
And also asked in the initial post which tool may be used for such type of
tests, maybe someone know a good one.

Currently, for the tests, https://github.com/erik-wramner/JmsTools was used
as it is most suitable for such type of tests from the publicly accessible
ones which I currently discovered.

Just tried to send messages to the AMQ:
java -jar AmqJmsProducer-1.7-jar-with-dependencies.jar -url
tcp://localhost:61616 -count 100 -id -type TEXT -queue Test

Expiration: 0
Persistence: Persistent

Yes, from the screenshots we may see only that there 0 messages remained in
the queues used for tests. This is because during the tests AMQ service was
restarted every 30 seconds:
Retained - Number Of Pending Messages
Not retained - Messages Enqueued
Not retained - Messages Dequeued

Currently, AMQ doesn't retain some data between application restarts.

Probably I should follow Justin's idea to create a full test description
with all required data for easy reproduction.


Thank you!

Slava.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

veaceslavdoina
Hello,

Project with description and all required data for tests reproducing were
posted on Github: https://github.com/veaceslavdoina/messages-brokers-testing

The issue was raised in Jira: https://issues.apache.org/jira/browse/AMQ-7096


Slava.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

veaceslavdoina
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

jbertram
I strongly recommend you simplify your test for this.  Whatever the issue
is (assuming an issue actually exists), it will need to be replicated with
a test in the Artemis test-suite (i.e. using one or more embedded brokers
on a single machine).  I'd start with peeling back infrastructure layers
until you can reproduce the problem on a single machine.  Once that is done
you'll have a much better chance of getting someone to investigate.

For what it's worth, this is the standard operating procedure for reporting
issues.  See, for example, this recommendation from Stack Overflow [1].


Justin

[1] https://stackoverflow.com/help/mcve

On Mon, Nov 12, 2018 at 8:08 AM veaceslavdoina <[hidden email]> wrote:

> The issue for Artemis was raised:
> https://issues.apache.org/jira/browse/ARTEMIS-2173
>
> Slava.
>
>
>
> --
> Sent from:
> http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
>
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

veaceslavdoina
Justin, thank you for advice!

Just run a test using a single instance for the broker and
producer/consumer. All communication was done using localhost.

Got similar results:
https://github.com/veaceslavdoina/messages-brokers-testing/blob/master/RESULTS.md#test-artemis-263-standalone-local-20181112-172958

Provided playbooks can create a single instance with one Artemis broker:

Create an environment with one broker.
Install Artemis on the created instance.
Run test on the broker using localhost for producer/consumer.
The goal of the provided project is to permit an easy test results
reproduction.

Related to the Artemis test-suite, probably it is another task.

Slava.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

jgenender
I looked at the jmstool, and I am not convinced of any bugs in Artemis and
AMQ.  It uses an external logging mechanism to consider the state of
messages, which is likely not what will be in ActiveMQ or Artemis, nor may
it be truly representative of messages that have been passed (or not).  What
does the tool do from a log analysis is the consumption dies due to a
disconnect and reconnects?

If you want to test this, then you need consumers/producers with counts and
at the end, verify the enqueue/dequeue counts and message states WITHIN the
brokers.

I further believe that the transactional logging mechanisms within the
brokers themselves mostly guarantee little loss.  Especially if XA is being
used.  In fact, if you restart the broker (at least for AMQ), and you hit
the PREPARE phase of the XA, those will be in a PREPARED state until you
decide to do something with them, of which is not a "lost message" in any
sense.  By spec you need to either RECOVER that message, heuristically
decide on its disposition, or it remains in PREPARED until you do something
with it.   With regard to jmstool, it would likely flag those as
gone/missing.

I'm in full agreement with jbertram on this.  You need to simplify your
tests and verify the location of the messages.   If you are killing the
broker in the manner that you are, you are more likely to get partial writes
(assuming the broker's shutdown hook doesn't fire).  Your broker is only as
good as the writes to the disk are verified and completed.

For AMQ and Artemis, my bets are on very little message loss if using full
persistence and you will need a more robust way of testing if messages have
been produced/consumed/lost.  The tool that you are using doesn't quite cut
it due to the LogAnalyzer which likely wasn't made to handle
disconnects/failover/re-reads.   It lacks serious error handling within its
logging to be more accurate on whether messages are truly produced and
consumed.  I'm sure it works mostly good on a stable broker to measure
throughput, etc, but if you are going to create broken connections, it needs
a lot more code to handle that for accuracy of where the messages are and
whether there really are dups or other negative events.

Just my .02.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

erik-wramner
I'm the author of JmsTools. I don't have the time to deep-dive into this
issue now, I just want to point out a few things.

The tool simulates an application, so it ignores whatever is in
ActiveMQ/Artemis. So will a real application. If a message is lost to the
tool it is also lost to the real application and then it is lost.

All sent messages are stamped with unique ids. The log analyzer verifies
that all sent messages are received. That is better than using counters. One
duplicate and one lost message will even out with a counter, but will be
detected with unique ids.

The tool does handle disconnects and reconnects in the same way as a real
application and the errors are logged. There are some places where they are
race conditions that are not detected by the log analyzer. One would need to
read the tool's detailed logs and investigate, possibly also in the broker
logs. However, with very few exceptions it is just as stable (or more) as a
real JMS application would be. If it detects an issue a real application
would also have an issue.

The tool will recover XA transactions if allowed to run long enough and it
will log heuristic transactions as such (? instead of C or R). If the tool
is stopped before recovery has completed (which is likely in a short test)
then the prepared transactions may remain in the broker, of course. That
only applies when using XA, I don't think this test did that? With normal
transactions the messages should remain in the queue, in a DLQ or they
should have been delivered.

I've worked with and detected issues with standard ActiveMQ before. Some of
them can be worked around with configuration (both on client and broker),
others are harder and can also be difficult to reproduce as they happen due
to race conditions. If they only happen in corner cases under high load it
is very hard to write an isolated unit test.

I don't mean to say that the tool is perfect and this is not "fire and
forget", a significant amount of analysis work is needed to find out where
the problems are. However, if the tool gets in trouble most if not all
real-world JMS applications would get in trouble too and then something
should be done. Usually tweaking the configuration or investigating and
filing a detailed bug report.

-Erik




--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

jgenender
Hi Erik,

Thank you for responding.  Your tool certainly is useful from a
client/application perspective.  However to be clear, I wasn’t knocking your
tool, but it was what was used to state that AMQ and Artemis (and Rabbit?)
are losing/duplicating messages, then bugs were opened on this, which I
believe was putting the cart way before the horse.  Your tool is a great
exerciser of the JMS brokers, but I did identify several areas where it
would break down… and as you alluded, this is not the forum for that
discussion.  If it's going to measure true message loss, etc, then there is
certainly a lot more room to make the tool more robust.  The user utilized
the tool, made strong claims and opened bugs on the report findings, which I
believe are not as accurate as they could be.  

What I want to be very clear about, is that killing a broker during a
mid-write is going to lose something.  There is no getting around that.
That's not a bug.  That will happen with brokers, databases, and just about
any other process that writes to disk.  The brokers do a fine job of
attempting to keep that as much to a minimum as possible... and if you want
to minimize that even more, tuning of the OS and the NFS/EFS platform can go
a long way to ensuring buffer writes/flushes are completed.  But at the end
of the day, preventing that 100% is not possible.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: ActiveMQ and Artemis reliability - Messages lost

veaceslavdoina