Re: Cluster, both brokers are "live"

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Re: Cluster, both brokers are "live"

Justin Bertram
When using network replication between a live and a backup it is extremely
important that the network connection between the two brokers is reliable
because if the network connection dies and there are only those 2 nodes
operating in the cluster then there will be a "split brain" where both the
live and the backup are active simultaneously.

To mitigate this risk you should configure multiple live-backup pairs to
participate in the cluster so that the backup can perform a legitimate
quorum vote when the live dies (or the network connection between the two
dies).  You can also use the network monitor [1] to mitigate this as well
as mentioned on another thread regarding this issue.  In general, I don't
recommend people run a single live/backup pair as the risk of split brain
is typically just too high.


Justin

[1] https://activemq.apache.org/artemis/docs/latest/network-isolation.html

On Fri, Sep 22, 2017 at 10:03 AM, boris_snp <[hidden email]>
wrote:

> I have to restart my 2-broker cluster on daily basis due to the following
> sequence of events:
> ------------------------------------------------------------
> ----------------------
> master
> 04:51:14,501    AMQ212037: Connection failure has been detected:
> AMQ119014: Did
> not receive data from /10.202.147.99:58739 within the 60,000ms connection
> TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
> 04:51:14,510    AMQ222092: Connection to the backup node failed, removing
> replication now:
> ActiveMQConnectionTimedOutException[errorType=CONNECTION_TIMEDOUT
> message=AMQ119014: Did not receive data from /10.202.147.99:58739 within
> the
> 60,000ms connection TTL. The connection will now be closed.]
> 04:51:24,517    AMQ212041: Timed out waiting for netty channel to close
> 04:51:24,517    AMQ212037: Connection failure has been detected:
> AMQ119014: Did
> not receive data from /10.202.147.99:58738 within the 60,000ms connection
> TTL. The connection will now be closed. [code=CONNECTION_TIMEDOUT]
> ------------------------------------------------------------
> ----------------------
> slave
> 04:51:42,306
> AMQ212037: Connection failure has been detected: AMQ119011: Did not receive
> data from server for
> org.apache.activemq.artemis.core.remoting.impl.netty.
> NettyConnection@1c54a4bc[local=
> /10.202.147.99:58738, remote=nj09mhf0681/10.202.147.99:41410]
> [code=CONNECTION_TIMEDOUT]
> 04:51:42,316
> AMQ212037: Connection failure has been detected: AMQ119011: Did not receive
> data from server for
> org.apache.activemq.artemis.core.remoting.impl.netty.
> NettyConnection@65ace922[local=
> /10.202.147.99:58739, remote=nj09mhf0681/10.202.147.99:41410]
> [code=CONNECTION_TIMEDOUT]
> 04:51:46,955    AMQ221037:
> ActiveMQServerImpl::serverUUID=7ffa29a0-7c48-11e7-9784-e83935127b09 to
> become 'live'
> 04:51:59,360    AMQ221014: 40% loaded
> 04:52:01,854    AMQ221014: 81% loaded
> 04:52:03,037    AMQ222028: Could not find page cache for page
> PagePositionImpl
> [pageNr=8, messageNr=-1, recordID=8662153341] removing it from the journal
> 04:52:03,051    AMQ222028: Could not find page cache for page
> PagePositionImpl
> [pageNr=13, messageNr=-1, recordID=8662204094] removing it from the journal
> 04:52:03,208    AMQ221003: Deploying queue jms.queue.DLQ
> 04:52:03,281    AMQ221003: Deploying queue jms.queue.ExpiryQueue
> 04:52:03,827    AMQ212034: There are more than one servers on the network
> broadcasting the same node id.
> ------------------------------------------------------------
> ----------------------
> master
> 04:52:03,827    AMQ212034: There are more than one servers on the network
> broadcasting the same node id.
> ------------------------------------------------------------
> ----------------------
> slave
> 04:52:03,910    AMQ221007: Server is now live
> 04:52:04,003    AMQ221020: Started Acceptor at nj09mhf0681:41411 for
> protocols
> [CORE,MQTT,AMQP,STOMP,HORNETQ,OPENWIRE]
> 04:52:11,949    AMQ212034: There are more than one servers on the network
> broadcasting the same node id.
> ------------------------------------------------------------
> ----------------------
> I understand that at some point master (live now) loses slave and closes
> connection to it. Slave (backup now) in turn detects that no master is
> available and becomes live itself. Now both brokers are live and never
> recover from such state.
> How can I avoid restarts and have brokers recover to usable state by
> themselves?
> Thank you.
>
>
>
>
> --
> Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-
> f2341805.html
>