Need help investigating committed messages resurfacing after restart.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Need help investigating committed messages resurfacing after restart.

hakanj
We have a very strange problem with our ActiveMQ broker that happens when we
restart it after
running it for several months. We have not been able to reproduce it
in-house. It only happens in
production at our customer's site. At this time we have seen this issue at
least three times.

What happens is that some very old (~1 month) messages come back from the
dead when we restart the
broker. The last time this happened we saw ~1600 resurrected messages. We
can see in our logs that
these messages have already been processed. There have been more messages
written to the same
queue, but these did not come back to life.

At the time of broker shutdown the problematic queue was empty with no
consumers or producers
connected (they were stopped before the broker was stopped).

We tried deleting the kahadb directory when this happened the last time, but
that didn't help.

The log does not contain any errors during startup. We get a couple of rows
similar to
"Recovery replayed 170629 operations" in the log, but this is normal as far
as I know. If I am
wrong about this being normal, please let me know.

---

We have a mix of applications that communicate using an ActiveMQ broker.
The Java applications, like Wildfly, use the OpenWire protocol.
The C++ applications use the AMQP protocol, with "transport.transformer=jms"
setting in
"activemq.xml".

The C++ applications use transactional sessions to consume messages from the
broker. The
problematic queues are consumed by one of these C++ applications. The
messages are produced by one
of the Java/Wildfly applications.

Most, but not all, messages that are sent through the broker are
non-persistent. The messages on
the problem queues are persistent.

We use queues for everything. Topics are not used.

There is only a single ActiveMQ broker instance.

ActiveMQ version: 5.14.5.

---

We have tried to trigger this error in-house, but without any success. We
have only seen it in our
customer's production environment.

Our own hypothesis is that the kahadb data file that contains the acks is
deleted, but not the file
containing the actual messages. After restart it then looks like the
messages were never sent. Even
if this is true we have no idea how it can happen. All our manual tests show
that ActiveMQ does the
right thing. There must be some special case that we cannot find.

Does anyone have an idea what could be the cause of this issue?
Has anyone seen anything like this before?
Any ideas on how get a reproducible case?




--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: Need help investigating committed messages resurfacing after restart.

Tim Bain
I've never seen anything like this on the mailing list, and the one project
where I've been directly involved in managing brokers used non-persistent
messaging exclusively so I've never seen it personally.

Here are a few crazy ideas of what might possibly cause the behavior you
described:

   - As you said, maybe a data file containing acks for earlier (old)
   messages was deleted.
      - Maybe a bug in the ack compaction feature that was added recently
      to prevent a few old messages from keeping alive chains of data files?
      - Maybe something within the environment and completely unrelated to
      ActiveMQ (a cron job that archives files that haven't been touched in a
      certain amount of time, for example)?
   - Maybe the acks were never actually put into KahaDB to begin with, and
   the client was working off of state that existed only in memory.
      - How did you determine that all messages on the queue were consumed
      before the restart? I'd recommend you confirm that EnqueueCount and
      DequeueCount match for the queue when viewed via a JMX viewer such as
      JConsole.
      - Maybe the consumption of the messages was within a transaction that
   was rolled back for some reason after some long period of time? I don't
   think transactions actually allow that, but I don't know them well enough
   to be sure.

All of those seem pretty crazy, and the least-crazy one does sound like the
deletion of a data file, but I'd try to rule the other two out if you can.

Assuming it really is a problem of KahaDB losing the log files, I'd suggest
getting the customer to do the following:

   1. Archive all KahaDB data files to ensure that you've got the full set
   to work with.
   2. Enable the TRACE-level logging as described in
   http://activemq.apache.org/why-do-kahadb-log-files-remain-after-cleanup.htm
   .
   3. When a problem occurs, pull the full set of data files to your
   production environment. Walk backwards through the TRACE-level logs,
   recreating the set of data files that existed after each checkpoint
   operation and restarting the broker. You're looking for the point where
   upon restart of the broker, the "resurrected" messages aren't considered to
   be live anymore. Then you know which deletion was the one that caused it,
   and you can scan the surrounding logs to see if there's any indication of
   what might have gone wrong.

It's not a very strong debugging plan, but I'm struggling to think of
another way to troubleshoot this, so maybe it'll be useful to you.

Tim

On Mon, Oct 23, 2017 at 6:00 AM, hakanj <[hidden email]>
wrote:

> We have a very strange problem with our ActiveMQ broker that happens when
> we
> restart it after
> running it for several months. We have not been able to reproduce it
> in-house. It only happens in
> production at our customer's site. At this time we have seen this issue at
> least three times.
>
> What happens is that some very old (~1 month) messages come back from the
> dead when we restart the
> broker. The last time this happened we saw ~1600 resurrected messages. We
> can see in our logs that
> these messages have already been processed. There have been more messages
> written to the same
> queue, but these did not come back to life.
>
> At the time of broker shutdown the problematic queue was empty with no
> consumers or producers
> connected (they were stopped before the broker was stopped).
>
> We tried deleting the kahadb directory when this happened the last time,
> but
> that didn't help.
>
> The log does not contain any errors during startup. We get a couple of rows
> similar to
> "Recovery replayed 170629 operations" in the log, but this is normal as far
> as I know. If I am
> wrong about this being normal, please let me know.
>
> ---
>
> We have a mix of applications that communicate using an ActiveMQ broker.
> The Java applications, like Wildfly, use the OpenWire protocol.
> The C++ applications use the AMQP protocol, with
> "transport.transformer=jms"
> setting in
> "activemq.xml".
>
> The C++ applications use transactional sessions to consume messages from
> the
> broker. The
> problematic queues are consumed by one of these C++ applications. The
> messages are produced by one
> of the Java/Wildfly applications.
>
> Most, but not all, messages that are sent through the broker are
> non-persistent. The messages on
> the problem queues are persistent.
>
> We use queues for everything. Topics are not used.
>
> There is only a single ActiveMQ broker instance.
>
> ActiveMQ version: 5.14.5.
>
> ---
>
> We have tried to trigger this error in-house, but without any success. We
> have only seen it in our
> customer's production environment.
>
> Our own hypothesis is that the kahadb data file that contains the acks is
> deleted, but not the file
> containing the actual messages. After restart it then looks like the
> messages were never sent. Even
> if this is true we have no idea how it can happen. All our manual tests
> show
> that ActiveMQ does the
> right thing. There must be some special case that we cannot find.
>
> Does anyone have an idea what could be the cause of this issue?
> Has anyone seen anything like this before?
> Any ideas on how get a reproducible case?
>
>
>
>
> --
> Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-
> f2341805.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Need help investigating committed messages resurfacing after restart.

hakanj
It is possible that there is a bug in the ack compaction feature, but if so
it only shows itself in
some very special cases. I have tried stressing it, but to no avail. In all
my tests the broker
does the right thing.

There are no cron jobs deleting any files in this system. The broker thinks
that the KahaDB
database is structurally sound. No complaints about any missing files during
startup.

When we looked in the logs of our other applications we could see that the
resurrected messages
were processed as expected, when expected, including a transactional commit
to the broker. Before
the broker was taken down the queue was empty, with matching EnqueueCount
and DequeueCount.

The TRACE level idea is great and will probably show what is happening, but
I am afraid of running
out of disk space. The logs would be huge as we need to run the system for
several months before
the issue happens. The broker is running on a virtual machine with limited
available disk. There is
also the performance impact to take into consideration. We would need to do
some performance
measurements before trying this in production. If the performance impact is
not too bad, then we
could ask for a proper disk to be added to the machine and move the logging
there with the TRACE
enabled.

My current plan is to take a copy of the kahadb directory once a week and
take a look at it. I have
found some utilities that shows the contents of those files. Right now the
error has not yet
occurred, but it has only been running for a week so far since the last
restart.




--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: Need help investigating committed messages resurfacing after restart.

Tim Bain
Bumping the logging level is key to this; even if you succeed in finding
data files that demonstrate what happened, without the more detailed
logging it may be impossible to tell *why* it happened. If you get a large
enough disk attached to the machine to hold months of logs, that's great,
just make sure that you turn off Log4J's feature to only keep a limited
number of files (I think it's set to 10 by default). Or you could run a
script in a tight loop that looks to see if a log file has rolled over and
copies the newly-rolled-over file to a network file located or to cloud
storage such as S3.

I'd suggest you keep at least the last two sets of saved data files (that
is, don't delete your current set when you analyze it and it doesn't have
the problem, wait till the next one checks out before you delete it),
because once you find the first bad set, you'll want to compare it to the
last good set to see what changed. There's a good chance that you already
thought of this (so please ignore the suggestion if I'm saying something
you already figured out), but since this takes months I'd hate to see you
get to the end and then realize that you don't have everything you
want/need.

I'd also suggest you run the analysis more frequently (hourly, maybe, or
even more often if possible), to try to limit the size of the delta between
the last good and first bad snapshots. The closer they are, the better the
odds that you'll be able to identify which difference between the files is
the relevant difference (because most of the differences will be valid and
unrelated). You're looking for a needle in a haystack and there's a strong
possibility that the needle is invisible and its presence can only be
deduced from context, so anything you can do to shrink the size of the
haystack is to your best interest, especially since this takes so long to
manifest itself.

Sorry I don't have any better ideas for how to analyze this.

Tim

On Oct 27, 2017 1:10 AM, "hakanj" <[hidden email]> wrote:

> It is possible that there is a bug in the ack compaction feature, but if so
> it only shows itself in
> some very special cases. I have tried stressing it, but to no avail. In all
> my tests the broker
> does the right thing.
>
> There are no cron jobs deleting any files in this system. The broker thinks
> that the KahaDB
> database is structurally sound. No complaints about any missing files
> during
> startup.
>
> When we looked in the logs of our other applications we could see that the
> resurrected messages
> were processed as expected, when expected, including a transactional commit
> to the broker. Before
> the broker was taken down the queue was empty, with matching EnqueueCount
> and DequeueCount.
>
> The TRACE level idea is great and will probably show what is happening, but
> I am afraid of running
> out of disk space. The logs would be huge as we need to run the system for
> several months before
> the issue happens. The broker is running on a virtual machine with limited
> available disk. There is
> also the performance impact to take into consideration. We would need to do
> some performance
> measurements before trying this in production. If the performance impact is
> not too bad, then we
> could ask for a proper disk to be added to the machine and move the logging
> there with the TRACE
> enabled.
>
> My current plan is to take a copy of the kahadb directory once a week and
> take a look at it. I have
> found some utilities that shows the contents of those files. Right now the
> error has not yet
> occurred, but it has only been running for a week so far since the last
> restart.
>
>
>
>
> --
> Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-
> f2341805.html
>