Artemis: Recommended topology for reliability?

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Artemis: Recommended topology for reliability?

Bummer
Greetings.

I'm implementing an EDI solution based on Artemis: EDI payloads travel
between various endpoints until all message transformations are done.
Because the data in the system is very valuable, I need to be sure that
nothing is lost in case of server crash. Also our systems are all set up via
Ansible and Artemis servers are restarted automatically in case any of the
configuration changes.

Yet the app/server restart thing made me experience data loss and
inconsistent cluster state. Thus I'd like know your opinions on how to build
the cluster topology properly.

Initially I thought that everything might be fine if I have a single live
server and two backups, each instance on a separate server. This however
ended up in having two live servers and one backup. Some of the data was on
the first live server, some on the other one. Later that day I lost it
completely as I've been trying to get back to the single live server
situation.

What's the right approach and topology to gain the highest reliability
possible?

One thing I learned so far is that I MUST NOT start the live server before
the backup if both went down previously, or I lose the data that the backup
server might have received while the live has been down.

Thank you for your responses.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: Artemis: Recommended topology for reliability?

Bummer
/One issue to be aware of is: in case of a successful fail-over, the backup's
data will be newer than the one at the live's storage. If you configure your
live server to perform a failback to live server when restarted, it will
synchronize its data with the backup's. If both servers are shutdown, the
administrator will have to determine which one has the latest data./
Source
<https://activemq.apache.org/components/artemis/documentation/latest/ha.html>  

How do I overcome this in an automated environment without having a full
control over server restarts?



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: Artemis: Recommended topology for reliability?

jbertram
In reply to this post by Bummer
I assume you were using replication with your master/slave/slave setup. If
that assumption is correct, then this isn't a recommended option due to the
risk of split-brain which apparently you ran into. Split-brain is a
scenario where two brokers are live with the same data. This can occur when
using replication which is why we recommend using at least 3 master/slave
pairs in a cluster to achieve a viable quorum to mitigate split-brain.
Additional configuration options are discussed in the documentation [1].

That said, the best mitigation against split-brain is using shared-storage
as the shared-store itself mitigates against split brain. Of course, the
shared-storage device can be a single point of failure so redundancy here
is recommended.

> One thing I learned so far is that I MUST NOT start the live server before
> the backup if both went down previously, or I lose the data that the
backup
> server might have received while the live has been down.

That's not entirely true. When a backup starts it will make a copy of its
existing data on the filesystem before synchronizing with the live and
receiving a new set of data. Therefore, any data you appear to have lost
should be in one of the backup journals. The number of backups the broker
will keep is configured by the max-saved-replicated-journals-size setting.


Justin

[1]
http://activemq.apache.org/components/artemis/documentation/latest/network-isolation.html

On Wed, May 29, 2019 at 5:02 AM Bummer <[hidden email]> wrote:

> Greetings.
>
> I'm implementing an EDI solution based on Artemis: EDI payloads travel
> between various endpoints until all message transformations are done.
> Because the data in the system is very valuable, I need to be sure that
> nothing is lost in case of server crash. Also our systems are all set up
> via
> Ansible and Artemis servers are restarted automatically in case any of the
> configuration changes.
>
> Yet the app/server restart thing made me experience data loss and
> inconsistent cluster state. Thus I'd like know your opinions on how to
> build
> the cluster topology properly.
>
> Initially I thought that everything might be fine if I have a single live
> server and two backups, each instance on a separate server. This however
> ended up in having two live servers and one backup. Some of the data was on
> the first live server, some on the other one. Later that day I lost it
> completely as I've been trying to get back to the single live server
> situation.
>
> What's the right approach and topology to gain the highest reliability
> possible?
>
> One thing I learned so far is that I MUST NOT start the live server before
> the backup if both went down previously, or I lose the data that the backup
> server might have received while the live has been down.
>
> Thank you for your responses.
>
>
>
> --
> Sent from:
> http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis: Recommended topology for reliability?

jbertram
In reply to this post by Bummer
In a non-automated use-case I'd recommend an administrator take a look at
each of the brokers' log files to see which one had been active most
recently and then restart that broker first (and if that server happened to
be a slave its configuration would need to be changed to be a master so it
would fully start rather than just wait for its corresponding master). If
your environment is automated then you'll have to develop some kind of
process to approximate what an administrator would do.

That said, in a fully automated environment a master/slave pair may not
make much sense. Presumably the automation could simply restart the master
broker when it dies and clients can simply reconnect. Combined with a
redundant, robust file store this is a viable solution for high
availability.


Justin

On Thu, May 30, 2019 at 5:41 AM Bummer <[hidden email]> wrote:

> /One issue to be aware of is: in case of a successful fail-over, the
> backup's
> data will be newer than the one at the live's storage. If you configure
> your
> live server to perform a failback to live server when restarted, it will
> synchronize its data with the backup's. If both servers are shutdown, the
> administrator will have to determine which one has the latest data./
> Source
> <
> https://activemq.apache.org/components/artemis/documentation/latest/ha.html>
>
>
> How do I overcome this in an automated environment without having a full
> control over server restarts?
>
>
>
> --
> Sent from:
> http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis: Recommended topology for reliability?

Bummer
Thank you for your responses, Justin. Also thank you for noting the journal
backup feature. That's great news. :)

Meanwhile, I've developed a simple scripted solution that works for
statically configured clusters such as mine. It simply extracts the broker
configuration and state (the last time it went live) and publishes it on a
certain port. Then each broker instance has its own preloader which decides
whether its safe to start the broker or not just by looking at the state of
each of the other brokers within the cluster. Once it learns that it's safe
to start the instance it just returns RC 0, thus it can be used as
ExecStartPre in *.service files. In case of RC != 0 systemd waits for a
while and then tries again.
It's not stable yet but I'll happily publish the draft if anyone happens to
be interested.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html