Artemis failback doesn't work in our scenario

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Artemis failback doesn't work in our scenario

Jo Stenberg
Hi,

we setup an Artemis cluster with three broker instances installed on AWS
each in a different availability zones (A,B,C).
Only one of the broker instances is configured as master, because Artemis
message redistribution does not
take message filters into account
(https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#redistribution-and-filters-selectors)
and we use/need message filters.

Last week we experienced a network outage of one AWS availability zone. When
the network connectivity was restored we ended up with two of the broker
instances claiming to be active/live.

I tried to replicate this by manually blocking all TCP connections between
the servers, this is the behavior I see:

1) I started all three broker instances.
  => The broker instance in AZ A reports to be "live", the instance in AZ B
reports to be "backup server", and the instance in AZ C reports to be
"stopped".
2) I then cut the network connection between the server in AZ A and the
other servers but left the broker process on the Server in AZ A running.
  => Now broker B upgrades from backup to live server and broker C starts a
backup server.
  => broker A still thinks it is live - which is not a problem for as, as no
client can reach the broker.
4) I re-enabled TCP connections between the server in AZ A and the servers
in the other AZs.
  => Now broker A and broker B are permanently stay "live".

How can we achieve that after 4) either broker A or broker B shuts down?


Here is the cluster configuration we are currently using:

Broker Cluster Config in AZ A:
------------------------------
<connectors>
  <connector name="local-node-connector">tcp://broker-a:61617</connector>
  <connector name="remote-node-connector-0">tcp://broker-b:61617</connector>
  <connector name="remote-node-connector-1">tcp://broker-c:61617</connector>
</connectors>
<cluster-connections><cluster-connection name="cluster1">
  <message-load-balancing>ON_DEMAND</message-load-balancing>
  <connector-ref>local-cluster-node-connector</connector-ref>
  <static-connectors allow-direct-connections-only="true">
    <connector-ref>remote-node-connector-0</connector-ref>
    <connector-ref>remote-node-connector-1</connector-ref>
  </static-connectors>
</cluster-connection></cluster-connections>
<ha-policy><replication>
  <master>
    <cluser-name>cluster1</cluser-name>
    <check-for-live-server>true</check-for-live-server>
    <vote-on-replication-failure>true</vote-on-replication-failure>
  </master>
</replication></ha-policy>


Broker Cluster Config in AZ B:
------------------------------
<connectors>
  <connector name="local-node-connector">tcp://broker-b:61617</connector>
  <connector name="remote-node-connector-0">tcp://broker-a:61617</connector>
  <connector name="remote-node-connector-1">tcp://broker-c:61617</connector>
</connectors>
<cluster-connections><cluster-connection name="cluster1">
  <message-load-balancing>ON_DEMAND</message-load-balancing>
  <connector-ref>local-node-connector</connector-ref>
  <static-connectors allow-direct-connections-only="true">
    <connector-ref>remote-node-connector-0</connector-ref>
    <connector-ref>remote-node-connector-1</connector-ref>
  </static-connectors>
</cluster-connection></cluster-connections>
<ha-policy><replication>
  <slave>
    <cluser-name>cluster1</cluser-name>
    <allow-failback>true</allow-failback>
    <restart-backup>true</restart-backup>
    <quorum-vote-wait>15</quorum-vote-wait>
    <vote-retries>12</vote-retries>
    <vote-retry-wait>5000</vote-retry-wait>
  </slave>
</replication></ha-policy>


Broker Cluster Config in AZ C:
------------------------------
<connectors>
  <connector name="local-node-connector">tcp://broker-c:61617</connector>
  <connector name="remote-node-connector-0">tcp://broker-a:61617</connector>
  <connector name="remote-node-connector-1">tcp://broker-b:61617</connector>
</connectors>
<cluster-connections>
  <cluster-connection name="cluster1">
  <message-load-balancing>ON_DEMAND</message-load-balancing>
  <connector-ref>local-node-connector</connector-ref>
  <static-connectors allow-direct-connections-only="true">
    <connector-ref>remote-node-connector-0</connector-ref>
    <connector-ref>remote-node-connector-1</connector-ref>
  </static-connectors>
  </cluster-connection>
</cluster-connections>
<ha-policy><replication>
  <slave>
    <cluser-name>cluster1</cluser-name>
    <allow-failback>true</allow-failback>
    <restart-backup>true</restart-backup>
    <quorum-vote-wait>15</quorum-vote-wait>
    <vote-retries>12</vote-retries>
    <vote-retry-wait>5000</vote-retry-wait>
  </slave>
</replication></ha-policy>


Thanks for any help,
Jo




--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: Artemis failback doesn't work in our scenario

jbertram
The results of your test are not surprising. You've essentially reproduced
the situation discussed in the network isolation documentation [1]. Since
you only have 1 live and 2 backups you have no real quorum which can be
used to make decisions with when situations like this arise. You also
haven't configured the network pinger which is intended to mitigate the
risk of split-brain when a valid quorum is not available.

Since you can't actually use a cluster (due to your selector/filter
requirements) I recommend you either move to a shared-store configuration
or configure the network pinger to mitigate the risk of split-brain.


Justin

[1]
http://activemq.apache.org/components/artemis/documentation/latest/network-isolation.html

On Tue, Nov 19, 2019 at 10:23 AM Jo Stenberg <[hidden email]>
wrote:

> Hi,
>
> we setup an Artemis cluster with three broker instances installed on AWS
> each in a different availability zones (A,B,C).
> Only one of the broker instances is configured as master, because Artemis
> message redistribution does not
> take message filters into account
> (
> https://activemq.apache.org/components/artemis/documentation/latest/clusters.html#redistribution-and-filters-selectors
> )
> and we use/need message filters.
>
> Last week we experienced a network outage of one AWS availability zone.
> When
> the network connectivity was restored we ended up with two of the broker
> instances claiming to be active/live.
>
> I tried to replicate this by manually blocking all TCP connections between
> the servers, this is the behavior I see:
>
> 1) I started all three broker instances.
>   => The broker instance in AZ A reports to be "live", the instance in AZ B
> reports to be "backup server", and the instance in AZ C reports to be
> "stopped".
> 2) I then cut the network connection between the server in AZ A and the
> other servers but left the broker process on the Server in AZ A running.
>   => Now broker B upgrades from backup to live server and broker C starts a
> backup server.
>   => broker A still thinks it is live - which is not a problem for as, as
> no
> client can reach the broker.
> 4) I re-enabled TCP connections between the server in AZ A and the servers
> in the other AZs.
>   => Now broker A and broker B are permanently stay "live".
>
> How can we achieve that after 4) either broker A or broker B shuts down?
>
>
> Here is the cluster configuration we are currently using:
>
> Broker Cluster Config in AZ A:
> ------------------------------
> <connectors>
>   <connector name="local-node-connector">tcp://broker-a:61617</connector>
>   <connector
> name="remote-node-connector-0">tcp://broker-b:61617</connector>
>   <connector
> name="remote-node-connector-1">tcp://broker-c:61617</connector>
> </connectors>
> <cluster-connections><cluster-connection name="cluster1">
>   <message-load-balancing>ON_DEMAND</message-load-balancing>
>   <connector-ref>local-cluster-node-connector</connector-ref>
>   <static-connectors allow-direct-connections-only="true">
>     <connector-ref>remote-node-connector-0</connector-ref>
>     <connector-ref>remote-node-connector-1</connector-ref>
>   </static-connectors>
> </cluster-connection></cluster-connections>
> <ha-policy><replication>
>   <master>
>     <cluser-name>cluster1</cluser-name>
>     <check-for-live-server>true</check-for-live-server>
>     <vote-on-replication-failure>true</vote-on-replication-failure>
>   </master>
> </replication></ha-policy>
>
>
> Broker Cluster Config in AZ B:
> ------------------------------
> <connectors>
>   <connector name="local-node-connector">tcp://broker-b:61617</connector>
>   <connector
> name="remote-node-connector-0">tcp://broker-a:61617</connector>
>   <connector
> name="remote-node-connector-1">tcp://broker-c:61617</connector>
> </connectors>
> <cluster-connections><cluster-connection name="cluster1">
>   <message-load-balancing>ON_DEMAND</message-load-balancing>
>   <connector-ref>local-node-connector</connector-ref>
>   <static-connectors allow-direct-connections-only="true">
>     <connector-ref>remote-node-connector-0</connector-ref>
>     <connector-ref>remote-node-connector-1</connector-ref>
>   </static-connectors>
> </cluster-connection></cluster-connections>
> <ha-policy><replication>
>   <slave>
>     <cluser-name>cluster1</cluser-name>
>     <allow-failback>true</allow-failback>
>     <restart-backup>true</restart-backup>
>     <quorum-vote-wait>15</quorum-vote-wait>
>     <vote-retries>12</vote-retries>
>     <vote-retry-wait>5000</vote-retry-wait>
>   </slave>
> </replication></ha-policy>
>
>
> Broker Cluster Config in AZ C:
> ------------------------------
> <connectors>
>   <connector name="local-node-connector">tcp://broker-c:61617</connector>
>   <connector
> name="remote-node-connector-0">tcp://broker-a:61617</connector>
>   <connector
> name="remote-node-connector-1">tcp://broker-b:61617</connector>
> </connectors>
> <cluster-connections>
>   <cluster-connection name="cluster1">
>   <message-load-balancing>ON_DEMAND</message-load-balancing>
>   <connector-ref>local-node-connector</connector-ref>
>   <static-connectors allow-direct-connections-only="true">
>     <connector-ref>remote-node-connector-0</connector-ref>
>     <connector-ref>remote-node-connector-1</connector-ref>
>   </static-connectors>
>   </cluster-connection>
> </cluster-connections>
> <ha-policy><replication>
>   <slave>
>     <cluser-name>cluster1</cluser-name>
>     <allow-failback>true</allow-failback>
>     <restart-backup>true</restart-backup>
>     <quorum-vote-wait>15</quorum-vote-wait>
>     <vote-retries>12</vote-retries>
>     <vote-retry-wait>5000</vote-retry-wait>
>   </slave>
> </replication></ha-policy>
>
>
> Thanks for any help,
> Jo
>
>
>
>
> --
> Sent from:
> http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis failback doesn't work in our scenario

Jo Stenberg
Network pinger solves it. Thank you!




--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html