Artemis failed but HA secondary did not take over

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Artemis failed but HA secondary did not take over

horseatingweeds
I had an incident last night. I can share the actual log files with anyone
interested in seeing the full picture. Below is a summary. I also have OS
resource tracking data.

I have a pair of Artemis servers using a shared store -- an NFS mounted
folder. My use case is enterprise integration. I use artemis to pass
messages between interfaces. For example, the website publishes an order to
an address which has several queues. All the interfaces that need that
message subscribe to their queue.

It looks to me like an issue with the page file. Is the page file in this
case mean virtual memory, or is it talking about the paging folder inside
the data folder?

This morning I could not bring up the console, but I was able to send and
receive messages using JMS. I tried restarting the primary -- which did not
help the console, but after that I could no longer get a JMS connection.
After shutting down the primary, the secondary did not take over. From the
logs it looked like it did, the objects were deployed, but the whole
Artemis tab disappeared from the console and I was not able to make a JMS
connection.

So then I shut both down. Waited a few minutes, and started the primary. It
came up fine. So then I started the secondary, waited a little while, and
shut down the primary, just as a test. This time the secondary took over
quickly. I restarted the primary and it happily took control back.

Any advice would be much appreciated.

Thanks!

Ben



*** Log Summary ***

2021-01-19 14:58:51 UTC this appears in the log:

ERROR [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with
id 33193877: java.lang.IllegalStateException: Cannot find tx with id
33193877

Along with some Java dump, directly followed by a few of these:

WARN  [org.apache.activemq.artemis.core.server] AMQ222104: IO Error
completing the transaction, code = 6, message = Cannot find tx with id
33193877

2021-01-19 15:41:10  The configuration reloads and the objects all
redeploy. But then:

2021-01-19 15:52:51  ERROR
[org.apache.activemq.artemis.core.journal.impl.JournalImpl]
appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with
id 33318276: java.lang.IllegalStateException: Cannot find tx with id
33318276

Then several

2021-01-19 15:52:51,666 WARN  [org.apache.activemq.artemis.core.server]
AMQ222104: IO Error completing the transaction, code = 6, message = Cannot
find tx with id 33318276

And a few minutes later I see about 1 billion of these:

2021-01-19 16:01:21,164 WARN  [org.apache.activemq.artemis.core.server]
AMQ222033: Page file 000000008.page had incomplete records at position
8,081,827 at record number 6,504
2021-01-19 16:01:21,165 WARN  [org.apache.activemq.artemis.core.server]
AMQ222288: Page 8, message 6,504 could not be found on offset 8,081,827,
with starting message 6,504. This represents a logic error or inconsistency
on the data, and the system will try once again from the beggining of the
page file.

These end the following day with several of these:

2021-01-20 03:10:22,764 WARN  [org.apache.activemq.artemis.core.client]
AMQ212037: Connection failure to /13.68.117.113:42767 has been detected:
AMQ229014: Did not receive data from /{IP}:{port} within the 60,000ms
connection TTL. The connection will now be closed.
[code=CONNECTION_TIMEDOUT]

For several different IP addresses and ports. Then I see both of these two
errors happening together, until I get a nice

Java heap space: java.lang.OutOfMemoryError: Java heap space

Followed by a lot of interesting looking Java dump. A bunch of these:

ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
seconds) on acceptor "artemis" during protocol handshake with /
52.179.166.52:61538 has occurred.

Then a bunch of handshake errors

ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
seconds) on acceptor "artemis" during protocol handshake with /
52.179.166.52:5972 has occurred.

A bunch more of these:

ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
seconds) on acceptor "artemis" during protocol handshake with
/{IP}.52:22481 has occurred.

This one appears and looks useful maybe:

2021-01-20 03:30:48,627 ERROR
[io.netty.util.concurrent.DefaultPromise.rejectedExecution] Failed to
submit a listener notification task. Event loop shut down?:
java.util.concurrent.RejectedExecutionException: event executor terminated

More heap space complaints and java dump... Then numerous of these:

WARN  [io.netty.channel.AbstractChannel] Force-closing a channel whose
registration task was not accepted by an event loop: [id: 0x6be932f5, L:/
10.40.50.17:61616 - R:/52.179.166.52:45604]:
java.util.concurrent.RejectedExecutionException: event executor terminated

And that's the state where I found it this morning.
Reply | Threaded
Open this post in threaded view
|

Re: Artemis failed but HA secondary did not take over

jbertram
> java.lang.OutOfMemoryError: Java heap space

My guess is that you were approaching your heap limit which caused the
preceding issues. Once you hit an OutOfMemoryError then in my experience
all bets are off so to speak and you need to restart the broker.

It's not clear why you couldn't connect to the slave.

I recommend you take a look at your global-max-size as well as your
max-size-bytes configuration as these are designed to help prevent OOME
when receiving lots of messages.


Justin

On Wed, Jan 20, 2021 at 9:12 AM Ben Warrick <[hidden email]>
wrote:

> I had an incident last night. I can share the actual log files with anyone
> interested in seeing the full picture. Below is a summary. I also have OS
> resource tracking data.
>
> I have a pair of Artemis servers using a shared store -- an NFS mounted
> folder. My use case is enterprise integration. I use artemis to pass
> messages between interfaces. For example, the website publishes an order to
> an address which has several queues. All the interfaces that need that
> message subscribe to their queue.
>
> It looks to me like an issue with the page file. Is the page file in this
> case mean virtual memory, or is it talking about the paging folder inside
> the data folder?
>
> This morning I could not bring up the console, but I was able to send and
> receive messages using JMS. I tried restarting the primary -- which did not
> help the console, but after that I could no longer get a JMS connection.
> After shutting down the primary, the secondary did not take over. From the
> logs it looked like it did, the objects were deployed, but the whole
> Artemis tab disappeared from the console and I was not able to make a JMS
> connection.
>
> So then I shut both down. Waited a few minutes, and started the primary. It
> came up fine. So then I started the secondary, waited a little while, and
> shut down the primary, just as a test. This time the secondary took over
> quickly. I restarted the primary and it happily took control back.
>
> Any advice would be much appreciated.
>
> Thanks!
>
> Ben
>
>
>
> *** Log Summary ***
>
> 2021-01-19 14:58:51 UTC this appears in the log:
>
> ERROR [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with
> id 33193877: java.lang.IllegalStateException: Cannot find tx with id
> 33193877
>
> Along with some Java dump, directly followed by a few of these:
>
> WARN  [org.apache.activemq.artemis.core.server] AMQ222104: IO Error
> completing the transaction, code = 6, message = Cannot find tx with id
> 33193877
>
> 2021-01-19 15:41:10  The configuration reloads and the objects all
> redeploy. But then:
>
> 2021-01-19 15:52:51  ERROR
> [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with
> id 33318276: java.lang.IllegalStateException: Cannot find tx with id
> 33318276
>
> Then several
>
> 2021-01-19 15:52:51,666 WARN  [org.apache.activemq.artemis.core.server]
> AMQ222104: IO Error completing the transaction, code = 6, message = Cannot
> find tx with id 33318276
>
> And a few minutes later I see about 1 billion of these:
>
> 2021-01-19 16:01:21,164 WARN  [org.apache.activemq.artemis.core.server]
> AMQ222033: Page file 000000008.page had incomplete records at position
> 8,081,827 at record number 6,504
> 2021-01-19 16:01:21,165 WARN  [org.apache.activemq.artemis.core.server]
> AMQ222288: Page 8, message 6,504 could not be found on offset 8,081,827,
> with starting message 6,504. This represents a logic error or inconsistency
> on the data, and the system will try once again from the beggining of the
> page file.
>
> These end the following day with several of these:
>
> 2021-01-20 03:10:22,764 WARN  [org.apache.activemq.artemis.core.client]
> AMQ212037: Connection failure to /13.68.117.113:42767 has been detected:
> AMQ229014: Did not receive data from /{IP}:{port} within the 60,000ms
> connection TTL. The connection will now be closed.
> [code=CONNECTION_TIMEDOUT]
>
> For several different IP addresses and ports. Then I see both of these two
> errors happening together, until I get a nice
>
> Java heap space: java.lang.OutOfMemoryError: Java heap space
>
> Followed by a lot of interesting looking Java dump. A bunch of these:
>
> ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> seconds) on acceptor "artemis" during protocol handshake with /
> 52.179.166.52:61538 has occurred.
>
> Then a bunch of handshake errors
>
> ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> seconds) on acceptor "artemis" during protocol handshake with /
> 52.179.166.52:5972 has occurred.
>
> A bunch more of these:
>
> ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> seconds) on acceptor "artemis" during protocol handshake with
> /{IP}.52:22481 has occurred.
>
> This one appears and looks useful maybe:
>
> 2021-01-20 03:30:48,627 ERROR
> [io.netty.util.concurrent.DefaultPromise.rejectedExecution] Failed to
> submit a listener notification task. Event loop shut down?:
> java.util.concurrent.RejectedExecutionException: event executor terminated
>
> More heap space complaints and java dump... Then numerous of these:
>
> WARN  [io.netty.channel.AbstractChannel] Force-closing a channel whose
> registration task was not accepted by an event loop: [id: 0x6be932f5, L:/
> 10.40.50.17:61616 - R:/52.179.166.52:45604]:
> java.util.concurrent.RejectedExecutionException: event executor terminated
>
> And that's the state where I found it this morning.
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis failed but HA secondary did not take over

horseatingweeds
Hi Justin,

There is 16GB of memory on the VM. And I have JVM memory set to -Xms512M
-Xmx2G in the artemis.profile file, so I can increase those.

I don't have max-size-bytes or global-max-size set. However, according to
the comments in the broker.xml file, global-max-size defaults to half of
-Xmx, so shouldn't Artemis start using the page file after it uses 1G?

I've been tracking heap usage, and it did spike a few times when the
problem started, so you're probably right about this.

Ben

On Wed, Jan 20, 2021 at 12:03 PM Justin Bertram <[hidden email]> wrote:

> > java.lang.OutOfMemoryError: Java heap space
>
> My guess is that you were approaching your heap limit which caused the
> preceding issues. Once you hit an OutOfMemoryError then in my experience
> all bets are off so to speak and you need to restart the broker.
>
> It's not clear why you couldn't connect to the slave.
>
> I recommend you take a look at your global-max-size as well as your
> max-size-bytes configuration as these are designed to help prevent OOME
> when receiving lots of messages.
>
>
> Justin
>
> On Wed, Jan 20, 2021 at 9:12 AM Ben Warrick <[hidden email]>
> wrote:
>
> > I had an incident last night. I can share the actual log files with
> anyone
> > interested in seeing the full picture. Below is a summary. I also have OS
> > resource tracking data.
> >
> > I have a pair of Artemis servers using a shared store -- an NFS mounted
> > folder. My use case is enterprise integration. I use artemis to pass
> > messages between interfaces. For example, the website publishes an order
> to
> > an address which has several queues. All the interfaces that need that
> > message subscribe to their queue.
> >
> > It looks to me like an issue with the page file. Is the page file in this
> > case mean virtual memory, or is it talking about the paging folder inside
> > the data folder?
> >
> > This morning I could not bring up the console, but I was able to send and
> > receive messages using JMS. I tried restarting the primary -- which did
> not
> > help the console, but after that I could no longer get a JMS connection.
> > After shutting down the primary, the secondary did not take over. From
> the
> > logs it looked like it did, the objects were deployed, but the whole
> > Artemis tab disappeared from the console and I was not able to make a JMS
> > connection.
> >
> > So then I shut both down. Waited a few minutes, and started the primary.
> It
> > came up fine. So then I started the secondary, waited a little while, and
> > shut down the primary, just as a test. This time the secondary took over
> > quickly. I restarted the primary and it happily took control back.
> >
> > Any advice would be much appreciated.
> >
> > Thanks!
> >
> > Ben
> >
> >
> >
> > *** Log Summary ***
> >
> > 2021-01-19 14:58:51 UTC this appears in the log:
> >
> > ERROR [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> > appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with
> > id 33193877: java.lang.IllegalStateException: Cannot find tx with id
> > 33193877
> >
> > Along with some Java dump, directly followed by a few of these:
> >
> > WARN  [org.apache.activemq.artemis.core.server] AMQ222104: IO Error
> > completing the transaction, code = 6, message = Cannot find tx with id
> > 33193877
> >
> > 2021-01-19 15:41:10  The configuration reloads and the objects all
> > redeploy. But then:
> >
> > 2021-01-19 15:52:51  ERROR
> > [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> > appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx with
> > id 33318276: java.lang.IllegalStateException: Cannot find tx with id
> > 33318276
> >
> > Then several
> >
> > 2021-01-19 15:52:51,666 WARN  [org.apache.activemq.artemis.core.server]
> > AMQ222104: IO Error completing the transaction, code = 6, message =
> Cannot
> > find tx with id 33318276
> >
> > And a few minutes later I see about 1 billion of these:
> >
> > 2021-01-19 16:01:21,164 WARN  [org.apache.activemq.artemis.core.server]
> > AMQ222033: Page file 000000008.page had incomplete records at position
> > 8,081,827 at record number 6,504
> > 2021-01-19 16:01:21,165 WARN  [org.apache.activemq.artemis.core.server]
> > AMQ222288: Page 8, message 6,504 could not be found on offset 8,081,827,
> > with starting message 6,504. This represents a logic error or
> inconsistency
> > on the data, and the system will try once again from the beggining of the
> > page file.
> >
> > These end the following day with several of these:
> >
> > 2021-01-20 03:10:22,764 WARN  [org.apache.activemq.artemis.core.client]
> > AMQ212037: Connection failure to /13.68.117.113:42767 has been detected:
> > AMQ229014: Did not receive data from /{IP}:{port} within the 60,000ms
> > connection TTL. The connection will now be closed.
> > [code=CONNECTION_TIMEDOUT]
> >
> > For several different IP addresses and ports. Then I see both of these
> two
> > errors happening together, until I get a nice
> >
> > Java heap space: java.lang.OutOfMemoryError: Java heap space
> >
> > Followed by a lot of interesting looking Java dump. A bunch of these:
> >
> > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> > seconds) on acceptor "artemis" during protocol handshake with /
> > 52.179.166.52:61538 has occurred.
> >
> > Then a bunch of handshake errors
> >
> > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> > seconds) on acceptor "artemis" during protocol handshake with /
> > 52.179.166.52:5972 has occurred.
> >
> > A bunch more of these:
> >
> > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> > seconds) on acceptor "artemis" during protocol handshake with
> > /{IP}.52:22481 has occurred.
> >
> > This one appears and looks useful maybe:
> >
> > 2021-01-20 03:30:48,627 ERROR
> > [io.netty.util.concurrent.DefaultPromise.rejectedExecution] Failed to
> > submit a listener notification task. Event loop shut down?:
> > java.util.concurrent.RejectedExecutionException: event executor
> terminated
> >
> > More heap space complaints and java dump... Then numerous of these:
> >
> > WARN  [io.netty.channel.AbstractChannel] Force-closing a channel whose
> > registration task was not accepted by an event loop: [id: 0x6be932f5, L:/
> > 10.40.50.17:61616 - R:/52.179.166.52:45604]:
> > java.util.concurrent.RejectedExecutionException: event executor
> terminated
> >
> > And that's the state where I found it this morning.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis failed but HA secondary did not take over

jbertram
> I don't have max-size-bytes or global-max-size set. However, according to
the comments in the broker.xml file, global-max-size defaults to half of
-Xmx, so shouldn't Artemis start using the page file after it uses 1G?

Yes. However, given a few recent issues like ARTEMIS-3021 [1] and
ARTEMIS-3061 [2] using half of Xmx may not be sufficiently conservative to
mitigate OOME.

I recommend monitoring your broker(s) and explicitly setting max-size-bytes
and/or global-max-size for your use-case.


Justin

[1] https://issues.apache.org/jira/browse/ARTEMIS-3021
[2] https://issues.apache.org/jira/browse/ARTEMIS-3061

On Wed, Jan 20, 2021 at 12:15 PM Ben Warrick <[hidden email]>
wrote:

> Hi Justin,
>
> There is 16GB of memory on the VM. And I have JVM memory set to -Xms512M
> -Xmx2G in the artemis.profile file, so I can increase those.
>
> I don't have max-size-bytes or global-max-size set. However, according to
> the comments in the broker.xml file, global-max-size defaults to half of
> -Xmx, so shouldn't Artemis start using the page file after it uses 1G?
>
> I've been tracking heap usage, and it did spike a few times when the
> problem started, so you're probably right about this.
>
> Ben
>
> On Wed, Jan 20, 2021 at 12:03 PM Justin Bertram <[hidden email]>
> wrote:
>
> > > java.lang.OutOfMemoryError: Java heap space
> >
> > My guess is that you were approaching your heap limit which caused the
> > preceding issues. Once you hit an OutOfMemoryError then in my experience
> > all bets are off so to speak and you need to restart the broker.
> >
> > It's not clear why you couldn't connect to the slave.
> >
> > I recommend you take a look at your global-max-size as well as your
> > max-size-bytes configuration as these are designed to help prevent OOME
> > when receiving lots of messages.
> >
> >
> > Justin
> >
> > On Wed, Jan 20, 2021 at 9:12 AM Ben Warrick <[hidden email]>
> > wrote:
> >
> > > I had an incident last night. I can share the actual log files with
> > anyone
> > > interested in seeing the full picture. Below is a summary. I also have
> OS
> > > resource tracking data.
> > >
> > > I have a pair of Artemis servers using a shared store -- an NFS mounted
> > > folder. My use case is enterprise integration. I use artemis to pass
> > > messages between interfaces. For example, the website publishes an
> order
> > to
> > > an address which has several queues. All the interfaces that need that
> > > message subscribe to their queue.
> > >
> > > It looks to me like an issue with the page file. Is the page file in
> this
> > > case mean virtual memory, or is it talking about the paging folder
> inside
> > > the data folder?
> > >
> > > This morning I could not bring up the console, but I was able to send
> and
> > > receive messages using JMS. I tried restarting the primary -- which did
> > not
> > > help the console, but after that I could no longer get a JMS
> connection.
> > > After shutting down the primary, the secondary did not take over. From
> > the
> > > logs it looked like it did, the objects were deployed, but the whole
> > > Artemis tab disappeared from the console and I was not able to make a
> JMS
> > > connection.
> > >
> > > So then I shut both down. Waited a few minutes, and started the
> primary.
> > It
> > > came up fine. So then I started the secondary, waited a little while,
> and
> > > shut down the primary, just as a test. This time the secondary took
> over
> > > quickly. I restarted the primary and it happily took control back.
> > >
> > > Any advice would be much appreciated.
> > >
> > > Thanks!
> > >
> > > Ben
> > >
> > >
> > >
> > > *** Log Summary ***
> > >
> > > 2021-01-19 14:58:51 UTC this appears in the log:
> > >
> > > ERROR [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> > > appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx
> with
> > > id 33193877: java.lang.IllegalStateException: Cannot find tx with id
> > > 33193877
> > >
> > > Along with some Java dump, directly followed by a few of these:
> > >
> > > WARN  [org.apache.activemq.artemis.core.server] AMQ222104: IO Error
> > > completing the transaction, code = 6, message = Cannot find tx with id
> > > 33193877
> > >
> > > 2021-01-19 15:41:10  The configuration reloads and the objects all
> > > redeploy. But then:
> > >
> > > 2021-01-19 15:52:51  ERROR
> > > [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> > > appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx
> with
> > > id 33318276: java.lang.IllegalStateException: Cannot find tx with id
> > > 33318276
> > >
> > > Then several
> > >
> > > 2021-01-19 15:52:51,666 WARN  [org.apache.activemq.artemis.core.server]
> > > AMQ222104: IO Error completing the transaction, code = 6, message =
> > Cannot
> > > find tx with id 33318276
> > >
> > > And a few minutes later I see about 1 billion of these:
> > >
> > > 2021-01-19 16:01:21,164 WARN  [org.apache.activemq.artemis.core.server]
> > > AMQ222033: Page file 000000008.page had incomplete records at position
> > > 8,081,827 at record number 6,504
> > > 2021-01-19 16:01:21,165 WARN  [org.apache.activemq.artemis.core.server]
> > > AMQ222288: Page 8, message 6,504 could not be found on offset
> 8,081,827,
> > > with starting message 6,504. This represents a logic error or
> > inconsistency
> > > on the data, and the system will try once again from the beggining of
> the
> > > page file.
> > >
> > > These end the following day with several of these:
> > >
> > > 2021-01-20 03:10:22,764 WARN  [org.apache.activemq.artemis.core.client]
> > > AMQ212037: Connection failure to /13.68.117.113:42767 has been
> detected:
> > > AMQ229014: Did not receive data from /{IP}:{port} within the 60,000ms
> > > connection TTL. The connection will now be closed.
> > > [code=CONNECTION_TIMEDOUT]
> > >
> > > For several different IP addresses and ports. Then I see both of these
> > two
> > > errors happening together, until I get a nice
> > >
> > > Java heap space: java.lang.OutOfMemoryError: Java heap space
> > >
> > > Followed by a lot of interesting looking Java dump. A bunch of these:
> > >
> > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> > > seconds) on acceptor "artemis" during protocol handshake with /
> > > 52.179.166.52:61538 has occurred.
> > >
> > > Then a bunch of handshake errors
> > >
> > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> > > seconds) on acceptor "artemis" during protocol handshake with /
> > > 52.179.166.52:5972 has occurred.
> > >
> > > A bunch more of these:
> > >
> > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout (10
> > > seconds) on acceptor "artemis" during protocol handshake with
> > > /{IP}.52:22481 has occurred.
> > >
> > > This one appears and looks useful maybe:
> > >
> > > 2021-01-20 03:30:48,627 ERROR
> > > [io.netty.util.concurrent.DefaultPromise.rejectedExecution] Failed to
> > > submit a listener notification task. Event loop shut down?:
> > > java.util.concurrent.RejectedExecutionException: event executor
> > terminated
> > >
> > > More heap space complaints and java dump... Then numerous of these:
> > >
> > > WARN  [io.netty.channel.AbstractChannel] Force-closing a channel whose
> > > registration task was not accepted by an event loop: [id: 0x6be932f5,
> L:/
> > > 10.40.50.17:61616 - R:/52.179.166.52:45604]:
> > > java.util.concurrent.RejectedExecutionException: event executor
> > terminated
> > >
> > > And that's the state where I found it this morning.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis failed but HA secondary did not take over

horseatingweeds
I see. With 16G on the machine, is this a wise setup?

-Xms1024M -Xmx8G

<global-max-size>4096</global-max-size>

On Wed, Jan 20, 2021 at 1:28 PM Justin Bertram <[hidden email]> wrote:

> > I don't have max-size-bytes or global-max-size set. However, according to
> the comments in the broker.xml file, global-max-size defaults to half of
> -Xmx, so shouldn't Artemis start using the page file after it uses 1G?
>
> Yes. However, given a few recent issues like ARTEMIS-3021 [1] and
> ARTEMIS-3061 [2] using half of Xmx may not be sufficiently conservative to
> mitigate OOME.
>
> I recommend monitoring your broker(s) and explicitly setting max-size-bytes
> and/or global-max-size for your use-case.
>
>
> Justin
>
> [1] https://issues.apache.org/jira/browse/ARTEMIS-3021
> [2] https://issues.apache.org/jira/browse/ARTEMIS-3061
>
> On Wed, Jan 20, 2021 at 12:15 PM Ben Warrick <[hidden email]>
> wrote:
>
> > Hi Justin,
> >
> > There is 16GB of memory on the VM. And I have JVM memory set to -Xms512M
> > -Xmx2G in the artemis.profile file, so I can increase those.
> >
> > I don't have max-size-bytes or global-max-size set. However, according to
> > the comments in the broker.xml file, global-max-size defaults to half of
> > -Xmx, so shouldn't Artemis start using the page file after it uses 1G?
> >
> > I've been tracking heap usage, and it did spike a few times when the
> > problem started, so you're probably right about this.
> >
> > Ben
> >
> > On Wed, Jan 20, 2021 at 12:03 PM Justin Bertram <[hidden email]>
> > wrote:
> >
> > > > java.lang.OutOfMemoryError: Java heap space
> > >
> > > My guess is that you were approaching your heap limit which caused the
> > > preceding issues. Once you hit an OutOfMemoryError then in my
> experience
> > > all bets are off so to speak and you need to restart the broker.
> > >
> > > It's not clear why you couldn't connect to the slave.
> > >
> > > I recommend you take a look at your global-max-size as well as your
> > > max-size-bytes configuration as these are designed to help prevent OOME
> > > when receiving lots of messages.
> > >
> > >
> > > Justin
> > >
> > > On Wed, Jan 20, 2021 at 9:12 AM Ben Warrick <[hidden email]
> >
> > > wrote:
> > >
> > > > I had an incident last night. I can share the actual log files with
> > > anyone
> > > > interested in seeing the full picture. Below is a summary. I also
> have
> > OS
> > > > resource tracking data.
> > > >
> > > > I have a pair of Artemis servers using a shared store -- an NFS
> mounted
> > > > folder. My use case is enterprise integration. I use artemis to pass
> > > > messages between interfaces. For example, the website publishes an
> > order
> > > to
> > > > an address which has several queues. All the interfaces that need
> that
> > > > message subscribe to their queue.
> > > >
> > > > It looks to me like an issue with the page file. Is the page file in
> > this
> > > > case mean virtual memory, or is it talking about the paging folder
> > inside
> > > > the data folder?
> > > >
> > > > This morning I could not bring up the console, but I was able to send
> > and
> > > > receive messages using JMS. I tried restarting the primary -- which
> did
> > > not
> > > > help the console, but after that I could no longer get a JMS
> > connection.
> > > > After shutting down the primary, the secondary did not take over.
> From
> > > the
> > > > logs it looked like it did, the objects were deployed, but the whole
> > > > Artemis tab disappeared from the console and I was not able to make a
> > JMS
> > > > connection.
> > > >
> > > > So then I shut both down. Waited a few minutes, and started the
> > primary.
> > > It
> > > > came up fine. So then I started the secondary, waited a little while,
> > and
> > > > shut down the primary, just as a test. This time the secondary took
> > over
> > > > quickly. I restarted the primary and it happily took control back.
> > > >
> > > > Any advice would be much appreciated.
> > > >
> > > > Thanks!
> > > >
> > > > Ben
> > > >
> > > >
> > > >
> > > > *** Log Summary ***
> > > >
> > > > 2021-01-19 14:58:51 UTC this appears in the log:
> > > >
> > > > ERROR [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> > > > appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx
> > with
> > > > id 33193877: java.lang.IllegalStateException: Cannot find tx with id
> > > > 33193877
> > > >
> > > > Along with some Java dump, directly followed by a few of these:
> > > >
> > > > WARN  [org.apache.activemq.artemis.core.server] AMQ222104: IO Error
> > > > completing the transaction, code = 6, message = Cannot find tx with
> id
> > > > 33193877
> > > >
> > > > 2021-01-19 15:41:10  The configuration reloads and the objects all
> > > > redeploy. But then:
> > > >
> > > > 2021-01-19 15:52:51  ERROR
> > > > [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> > > > appendRollbackRecord:java.lang.IllegalStateException: Cannot find tx
> > with
> > > > id 33318276: java.lang.IllegalStateException: Cannot find tx with id
> > > > 33318276
> > > >
> > > > Then several
> > > >
> > > > 2021-01-19 15:52:51,666 WARN
> [org.apache.activemq.artemis.core.server]
> > > > AMQ222104: IO Error completing the transaction, code = 6, message =
> > > Cannot
> > > > find tx with id 33318276
> > > >
> > > > And a few minutes later I see about 1 billion of these:
> > > >
> > > > 2021-01-19 16:01:21,164 WARN
> [org.apache.activemq.artemis.core.server]
> > > > AMQ222033: Page file 000000008.page had incomplete records at
> position
> > > > 8,081,827 at record number 6,504
> > > > 2021-01-19 16:01:21,165 WARN
> [org.apache.activemq.artemis.core.server]
> > > > AMQ222288: Page 8, message 6,504 could not be found on offset
> > 8,081,827,
> > > > with starting message 6,504. This represents a logic error or
> > > inconsistency
> > > > on the data, and the system will try once again from the beggining of
> > the
> > > > page file.
> > > >
> > > > These end the following day with several of these:
> > > >
> > > > 2021-01-20 03:10:22,764 WARN
> [org.apache.activemq.artemis.core.client]
> > > > AMQ212037: Connection failure to /13.68.117.113:42767 has been
> > detected:
> > > > AMQ229014: Did not receive data from /{IP}:{port} within the 60,000ms
> > > > connection TTL. The connection will now be closed.
> > > > [code=CONNECTION_TIMEDOUT]
> > > >
> > > > For several different IP addresses and ports. Then I see both of
> these
> > > two
> > > > errors happening together, until I get a nice
> > > >
> > > > Java heap space: java.lang.OutOfMemoryError: Java heap space
> > > >
> > > > Followed by a lot of interesting looking Java dump. A bunch of these:
> > > >
> > > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout
> (10
> > > > seconds) on acceptor "artemis" during protocol handshake with /
> > > > 52.179.166.52:61538 has occurred.
> > > >
> > > > Then a bunch of handshake errors
> > > >
> > > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout
> (10
> > > > seconds) on acceptor "artemis" during protocol handshake with /
> > > > 52.179.166.52:5972 has occurred.
> > > >
> > > > A bunch more of these:
> > > >
> > > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout
> (10
> > > > seconds) on acceptor "artemis" during protocol handshake with
> > > > /{IP}.52:22481 has occurred.
> > > >
> > > > This one appears and looks useful maybe:
> > > >
> > > > 2021-01-20 03:30:48,627 ERROR
> > > > [io.netty.util.concurrent.DefaultPromise.rejectedExecution] Failed to
> > > > submit a listener notification task. Event loop shut down?:
> > > > java.util.concurrent.RejectedExecutionException: event executor
> > > terminated
> > > >
> > > > More heap space complaints and java dump... Then numerous of these:
> > > >
> > > > WARN  [io.netty.channel.AbstractChannel] Force-closing a channel
> whose
> > > > registration task was not accepted by an event loop: [id: 0x6be932f5,
> > L:/
> > > > 10.40.50.17:61616 - R:/52.179.166.52:45604]:
> > > > java.util.concurrent.RejectedExecutionException: event executor
> > > terminated
> > > >
> > > > And that's the state where I found it this morning.
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis failed but HA secondary did not take over

jbertram
I can't really say if your proposed configuration is "wise" or not since I
don't know all the details of your use-case. However, I can say a few
things conclusively...

First, the global-max-size is measured in bytes so using 4096 will only
give you 4KB of space. If you actually want 4 gigabytes then you can use
"4G", "4096MB", etc.

Second, since the default global-max-size is calculated as 1/2 of Xmx using
an Xmx of 8G would give you a global-max-size of 4G by default so
explicitly configuring global-max-size to the same value would be
functionally redundant. Since this default calculation wasn't sufficient to
mitigate OOME previously I suspect it probably wouldn't be sufficient to
prevent OOME in the future if your load gets large enough.

As I indicated before, you'll probably need to monitor your broker under
load and potentially adjust the global-max-size if you want to find an
optimal value. Typically this kind of tuning would be done in a
pre-production environment with a work-load equivalent to what you'd
experience in production.

If you just want a safe, "ballpark" type of estimation then I'd recommend
starting with 1/4 of Xmx. Therefore, if your Xmx was 8G then use
<global-max-size>2G</global-max-size>.


Justin

On Wed, Jan 20, 2021 at 6:04 PM Ben Warrick <[hidden email]>
wrote:

> I see. With 16G on the machine, is this a wise setup?
>
> -Xms1024M -Xmx8G
>
> <global-max-size>4096</global-max-size>
>
> On Wed, Jan 20, 2021 at 1:28 PM Justin Bertram <[hidden email]>
> wrote:
>
> > > I don't have max-size-bytes or global-max-size set. However, according
> to
> > the comments in the broker.xml file, global-max-size defaults to half of
> > -Xmx, so shouldn't Artemis start using the page file after it uses 1G?
> >
> > Yes. However, given a few recent issues like ARTEMIS-3021 [1] and
> > ARTEMIS-3061 [2] using half of Xmx may not be sufficiently conservative
> to
> > mitigate OOME.
> >
> > I recommend monitoring your broker(s) and explicitly setting
> max-size-bytes
> > and/or global-max-size for your use-case.
> >
> >
> > Justin
> >
> > [1] https://issues.apache.org/jira/browse/ARTEMIS-3021
> > [2] https://issues.apache.org/jira/browse/ARTEMIS-3061
> >
> > On Wed, Jan 20, 2021 at 12:15 PM Ben Warrick <[hidden email]>
> > wrote:
> >
> > > Hi Justin,
> > >
> > > There is 16GB of memory on the VM. And I have JVM memory set to
> -Xms512M
> > > -Xmx2G in the artemis.profile file, so I can increase those.
> > >
> > > I don't have max-size-bytes or global-max-size set. However, according
> to
> > > the comments in the broker.xml file, global-max-size defaults to half
> of
> > > -Xmx, so shouldn't Artemis start using the page file after it uses 1G?
> > >
> > > I've been tracking heap usage, and it did spike a few times when the
> > > problem started, so you're probably right about this.
> > >
> > > Ben
> > >
> > > On Wed, Jan 20, 2021 at 12:03 PM Justin Bertram <[hidden email]>
> > > wrote:
> > >
> > > > > java.lang.OutOfMemoryError: Java heap space
> > > >
> > > > My guess is that you were approaching your heap limit which caused
> the
> > > > preceding issues. Once you hit an OutOfMemoryError then in my
> > experience
> > > > all bets are off so to speak and you need to restart the broker.
> > > >
> > > > It's not clear why you couldn't connect to the slave.
> > > >
> > > > I recommend you take a look at your global-max-size as well as your
> > > > max-size-bytes configuration as these are designed to help prevent
> OOME
> > > > when receiving lots of messages.
> > > >
> > > >
> > > > Justin
> > > >
> > > > On Wed, Jan 20, 2021 at 9:12 AM Ben Warrick <
> [hidden email]
> > >
> > > > wrote:
> > > >
> > > > > I had an incident last night. I can share the actual log files with
> > > > anyone
> > > > > interested in seeing the full picture. Below is a summary. I also
> > have
> > > OS
> > > > > resource tracking data.
> > > > >
> > > > > I have a pair of Artemis servers using a shared store -- an NFS
> > mounted
> > > > > folder. My use case is enterprise integration. I use artemis to
> pass
> > > > > messages between interfaces. For example, the website publishes an
> > > order
> > > > to
> > > > > an address which has several queues. All the interfaces that need
> > that
> > > > > message subscribe to their queue.
> > > > >
> > > > > It looks to me like an issue with the page file. Is the page file
> in
> > > this
> > > > > case mean virtual memory, or is it talking about the paging folder
> > > inside
> > > > > the data folder?
> > > > >
> > > > > This morning I could not bring up the console, but I was able to
> send
> > > and
> > > > > receive messages using JMS. I tried restarting the primary -- which
> > did
> > > > not
> > > > > help the console, but after that I could no longer get a JMS
> > > connection.
> > > > > After shutting down the primary, the secondary did not take over.
> > From
> > > > the
> > > > > logs it looked like it did, the objects were deployed, but the
> whole
> > > > > Artemis tab disappeared from the console and I was not able to
> make a
> > > JMS
> > > > > connection.
> > > > >
> > > > > So then I shut both down. Waited a few minutes, and started the
> > > primary.
> > > > It
> > > > > came up fine. So then I started the secondary, waited a little
> while,
> > > and
> > > > > shut down the primary, just as a test. This time the secondary took
> > > over
> > > > > quickly. I restarted the primary and it happily took control back.
> > > > >
> > > > > Any advice would be much appreciated.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > Ben
> > > > >
> > > > >
> > > > >
> > > > > *** Log Summary ***
> > > > >
> > > > > 2021-01-19 14:58:51 UTC this appears in the log:
> > > > >
> > > > > ERROR [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> > > > > appendRollbackRecord:java.lang.IllegalStateException: Cannot find
> tx
> > > with
> > > > > id 33193877: java.lang.IllegalStateException: Cannot find tx with
> id
> > > > > 33193877
> > > > >
> > > > > Along with some Java dump, directly followed by a few of these:
> > > > >
> > > > > WARN  [org.apache.activemq.artemis.core.server] AMQ222104: IO Error
> > > > > completing the transaction, code = 6, message = Cannot find tx with
> > id
> > > > > 33193877
> > > > >
> > > > > 2021-01-19 15:41:10  The configuration reloads and the objects all
> > > > > redeploy. But then:
> > > > >
> > > > > 2021-01-19 15:52:51  ERROR
> > > > > [org.apache.activemq.artemis.core.journal.impl.JournalImpl]
> > > > > appendRollbackRecord:java.lang.IllegalStateException: Cannot find
> tx
> > > with
> > > > > id 33318276: java.lang.IllegalStateException: Cannot find tx with
> id
> > > > > 33318276
> > > > >
> > > > > Then several
> > > > >
> > > > > 2021-01-19 15:52:51,666 WARN
> > [org.apache.activemq.artemis.core.server]
> > > > > AMQ222104: IO Error completing the transaction, code = 6, message =
> > > > Cannot
> > > > > find tx with id 33318276
> > > > >
> > > > > And a few minutes later I see about 1 billion of these:
> > > > >
> > > > > 2021-01-19 16:01:21,164 WARN
> > [org.apache.activemq.artemis.core.server]
> > > > > AMQ222033: Page file 000000008.page had incomplete records at
> > position
> > > > > 8,081,827 at record number 6,504
> > > > > 2021-01-19 16:01:21,165 WARN
> > [org.apache.activemq.artemis.core.server]
> > > > > AMQ222288: Page 8, message 6,504 could not be found on offset
> > > 8,081,827,
> > > > > with starting message 6,504. This represents a logic error or
> > > > inconsistency
> > > > > on the data, and the system will try once again from the beggining
> of
> > > the
> > > > > page file.
> > > > >
> > > > > These end the following day with several of these:
> > > > >
> > > > > 2021-01-20 03:10:22,764 WARN
> > [org.apache.activemq.artemis.core.client]
> > > > > AMQ212037: Connection failure to /13.68.117.113:42767 has been
> > > detected:
> > > > > AMQ229014: Did not receive data from /{IP}:{port} within the
> 60,000ms
> > > > > connection TTL. The connection will now be closed.
> > > > > [code=CONNECTION_TIMEDOUT]
> > > > >
> > > > > For several different IP addresses and ports. Then I see both of
> > these
> > > > two
> > > > > errors happening together, until I get a nice
> > > > >
> > > > > Java heap space: java.lang.OutOfMemoryError: Java heap space
> > > > >
> > > > > Followed by a lot of interesting looking Java dump. A bunch of
> these:
> > > > >
> > > > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout
> > (10
> > > > > seconds) on acceptor "artemis" during protocol handshake with /
> > > > > 52.179.166.52:61538 has occurred.
> > > > >
> > > > > Then a bunch of handshake errors
> > > > >
> > > > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout
> > (10
> > > > > seconds) on acceptor "artemis" during protocol handshake with /
> > > > > 52.179.166.52:5972 has occurred.
> > > > >
> > > > > A bunch more of these:
> > > > >
> > > > > ERROR [org.apache.activemq.artemis.core.server] AMQ224088: Timeout
> > (10
> > > > > seconds) on acceptor "artemis" during protocol handshake with
> > > > > /{IP}.52:22481 has occurred.
> > > > >
> > > > > This one appears and looks useful maybe:
> > > > >
> > > > > 2021-01-20 03:30:48,627 ERROR
> > > > > [io.netty.util.concurrent.DefaultPromise.rejectedExecution] Failed
> to
> > > > > submit a listener notification task. Event loop shut down?:
> > > > > java.util.concurrent.RejectedExecutionException: event executor
> > > > terminated
> > > > >
> > > > > More heap space complaints and java dump... Then numerous of these:
> > > > >
> > > > > WARN  [io.netty.channel.AbstractChannel] Force-closing a channel
> > whose
> > > > > registration task was not accepted by an event loop: [id:
> 0x6be932f5,
> > > L:/
> > > > > 10.40.50.17:61616 - R:/52.179.166.52:45604]:
> > > > > java.util.concurrent.RejectedExecutionException: event executor
> > > > terminated
> > > > >
> > > > > And that's the state where I found it this morning.
> > > > >
> > > >
> > >
> >
>