Colocated fail-back not working correctly

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Colocated fail-back not working correctly

wazburrows
This post was updated on .
Hi,

I have a 4 node artemis 2.10 cluster on Linux configured for replication and
colocated HA servers. I have been testing the failover and fail back but its
not working as I would expect.  When shutting down one server (A) in an HA
pair the colocated backup on the second server (B) activate and process
messages for the original server A. Now it doesn't process all the messages
sent though but that's a different problem. The problems start when I bring
up the original server A again. Server A starts, becomes live and joins the
cluster but looking in the console there is no longer a collocated_backup_1
listing to show that it is providing a collocated backup to server B.  It
also seems to cause the server that was failed over to, server B, to go
offline and not be "live" anymore. Server B also doesn't have the
collocated_backup_1 shown in its console. Server B seems to be part of the
cluster still but in the UI there is no green master node shown for it
anymore - just a red slave node circle. Server B doesn't list any addresses
or acceptors in the UI and connections to it fail.  It seems like its
shutdown its "live" server and is running as a backup only.  If I shut
server B down and bring it up, the roles are swapped.  Now server B becomes
live again and is shown as a master node (still no collocated_backup_01
though) and server A goes offline and appears only as a slave node. in the
UI. Whether server A or B is in this "offline" backup-only state the value
of the Node property in the cluster attributes shown in the UI is the same
value for both. Prior to doing the failover test they have different node
ids which makes sense.
 So there is a problem with the fail back in that it seems to only allow one
of the nodes in the HA pair to ever be live after a failover event.

The only way to fix the issue with the pair is to stop both servers and
remove everything under the broker "data" directory on both
boxes before starting them again.  At which point they come up correctly and
both are live and they pair up as HA backups for each other again


--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: Colocated fail-back not working correctly

wazburrows

Can anyone from the dev team help with this?  I'm not finding much on the
web about Artemis colocated HA configurations although I did find this bug
report for Jboss Wildfly from back in 2016:

https://issues.jboss.org/browse/WFLY-5979

It hasn't been closed or resolved so has this been known issue since 2016?





--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: Colocated fail-back not working correctly

wazburrows
I've posted this to stack overflow in hope of getting more eyes on this
issue.





--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: Colocated fail-back not working correctly

plgarcia
This post was updated on .
I have exactly the same problem, and I found no help on the Internet. None of
the people having this problem seems to have solved this problem and been
brought a working answer, or at least did not provide the solution when
problem is solved.
Did you find a solution?
I use Artemis 2.11.0 (the latest at the time. I have 3 servers.
The behavior expected is to be able to consume the message on one server
there is a consumer on the queue whatever the server the message has been
submitted. This works absolutely fine.

When a server fails (kill -9 for example) then I want the messages stored on
this server to be handled by another server and consumed. This does not work
at all.
However, when the server comes back the messages are not lost and are
distributed.

I would also like that when the server that has failed comes back, the
messages it was handling when failing are not distributed a second time. I
have not been able to test that, as I have not been able to setup a
configuration for the previous step to work.

Is there a defect recorded on this subject? I did not find one.


I created an issue:
https://issues.apache.org/jira/browse/ARTEMIS-2609

Regards



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html