Artemis Disaster Recovery options

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Artemis Disaster Recovery options

John Bundred
I am looking at options for handling local failover and disaster recovery.
Our existing primary and secondary data centre hosted services run mainly
active-passive and have SAN storage but do not support SAN replication.
There is a dedicated network between the two DC's so we have fast, reliable
connectivity.

For local single node failure in the primary we would failover to a local
backup server configured to use shared storage, however I'm not sure what
the best options are for handling a complete primary DC failure so that we
can failover to the secondary DC.  As I said we don't have SAN replication.
Is there anything similar to SQL log shipping that can be performed?  We
currently use BizTalk which uses SQL log shipping and so we already accept a
certain amount of message loss in the case of a DR.  Even better is there an
option which would remove all data loss from Artemis in the case of a full
DR scenario?

Thanks

John



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: Artemis Disaster Recovery options

jbertram
There's nothing built in to Artemis at this point specifically for the DR
use case.  However, I believe the "data" directory (where persistent data
is stored by default) can be replicated (e.g. via a block-level storage
replication solution) or "shipped" via an external process (e.g. rsync) to
a DR backup.


Justin

On Wed, Nov 29, 2017 at 5:45 AM, JohnBeeBee <[hidden email]>
wrote:

> I am looking at options for handling local failover and disaster recovery.
> Our existing primary and secondary data centre hosted services run mainly
> active-passive and have SAN storage but do not support SAN replication.
> There is a dedicated network between the two DC's so we have fast, reliable
> connectivity.
>
> For local single node failure in the primary we would failover to a local
> backup server configured to use shared storage, however I'm not sure what
> the best options are for handling a complete primary DC failure so that we
> can failover to the secondary DC.  As I said we don't have SAN replication.
> Is there anything similar to SQL log shipping that can be performed?  We
> currently use BizTalk which uses SQL log shipping and so we already accept
> a
> certain amount of message loss in the case of a DR.  Even better is there
> an
> option which would remove all data loss from Artemis in the case of a full
> DR scenario?
>
> Thanks
>
> John
>
>
>
> --
> Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-
> f2341805.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis Disaster Recovery options

John Bundred
Thanks Justin.  Do you know what sort of considerations I would need to make
if doing shipping, e.g.

1) What config, if any, would the node in the secondary DC need to have in
common with the nodes in the  primary DC? e.g. see the warning box regarding
copying data directories and unique node id  here
<http://activemq.apache.org/artemis/docs/latest/clusters.html>  .  I assume
my secondary DC node would have to have the same node id so that it can
treat the journals as its own?
2) I assume the node in the secondary DC would not need to be/must not be in
a cluster the primary DC's nodes?
3) I assume there would need to be a node per data directory being copied,
e.g. if I have two active nodes in the primary DC then I will need to
standby nodes in the secondary DC.
4) What if any file/directory management would need to be done on the
secondary DC file structure if doing a straight file copy.  I wouldn't want
to copy the whole directory every 15 minutes for example, but rather just
reflect the changes.
5) I assume the secondary DC nodes should be cold until required and then be
manually activated?

If anyone else has experience of doing this (DR requirements aren't rare :))
then I'd be very keen to understand your experiences.

Another option I totally forgot about would be to use a core bridge to push
messages across to a cluster in the secondary DC.  The problem with this I
guess is that as messages are removed from the primary DC queues they will
remain in the secondary DC queues.  Not sure if there is a TTL that can be
configured to provide some pruning of messages from the secondary DC queues.
However if this TTL is set too large then a large amount of duplicate
messages would be delivered, or if it is set too low and the failover
doesn't occur in time then there will be message loss.

What would be ideal I think would be the option to use shared storage for
local HA, but be able to combine this with the replication option for remote
DR to a cold backup node.  So rather than live-backup being a one to one
relationship, a live node could have 1 shared storage backup and 1
replication backup, but only one of which can be auto failed to.



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html
Reply | Threaded
Open this post in threaded view
|

Re: Artemis Disaster Recovery options

Justin Bertram
> What config, if any, would the node in the secondary DC need to have in
common with the nodes in the  primary DC?

It would need to have the same essential configuration (e.g. same
addresses, queues, etc.), but it wouldn't necessarily need to be clustered
or have its own HA config, etc.


> I assume my secondary DC node would have to have the same node id so that
it can treat the journals as its own?

That warning is really for clusters where all the nodes will be active
concurrently.  You can't have multiple nodes running concurrently with the
same node ID.  In the case of backups it is expected that they will have
the same node ID.  Therefore, there's no worry with copying the full
journal.


> I assume the node in the secondary DC would not need to be/must not be in
a cluster the primary DC's nodes?

Correct.  The node in the secondary DC wouldn't even be active/started.
Starting the node would happen manually after the "disaster" happened using
the latest data backup from the primary DC.


> I assume there would need to be a node per data directory being copied,
e.g. if I have two active nodes in the primary DC then I will need to
standby nodes in the secondary DC.

Correct.  Each node in the cluster owns messages independently of the other
cluster nodes so each node in the cluster should be backed up in order not
to lose data.


> What if any file/directory management would need to be done on the
secondary DC file structure if doing a straight file copy.  I wouldn't want
to copy the whole directory every 15 minutes for example, but rather just
reflect the changes.

There is a challenge with just copying changes because the journal files
can and will be re-used.  This is where a replicated block device would be
nice.


> I assume the secondary DC nodes should be cold until required and then be
manually activated?

Correct.


> What would be ideal I think would be...

There are lots of possible, graceful solutions here.  Almost anything would
be better than what we have right now (i.e. nothing).  Contributions are
always welcome.


Justin

On Wed, Nov 29, 2017 at 8:59 AM, John Bundred <[hidden email]>
wrote:

> Thanks Justin.  Do you know what sort of considerations I would need to
> make
> if doing shipping, e.g.
>
> 1) What config, if any, would the node in the secondary DC need to have in
> common with the nodes in the  primary DC? e.g. see the warning box
> regarding
> copying data directories and unique node id  here
> <http://activemq.apache.org/artemis/docs/latest/clusters.html>  .  I
> assume
> my secondary DC node would have to have the same node id so that it can
> treat the journals as its own?
> 2) I assume the node in the secondary DC would not need to be/must not be
> in
> a cluster the primary DC's nodes?
> 3) I assume there would need to be a node per data directory being copied,
> e.g. if I have two active nodes in the primary DC then I will need to
> standby nodes in the secondary DC.
> 4) What if any file/directory management would need to be done on the
> secondary DC file structure if doing a straight file copy.  I wouldn't want
> to copy the whole directory every 15 minutes for example, but rather just
> reflect the changes.
> 5) I assume the secondary DC nodes should be cold until required and then
> be
> manually activated?
>
> If anyone else has experience of doing this (DR requirements aren't rare
> :))
> then I'd be very keen to understand your experiences.
>
> Another option I totally forgot about would be to use a core bridge to push
> messages across to a cluster in the secondary DC.  The problem with this I
> guess is that as messages are removed from the primary DC queues they will
> remain in the secondary DC queues.  Not sure if there is a TTL that can be
> configured to provide some pruning of messages from the secondary DC
> queues.
> However if this TTL is set too large then a large amount of duplicate
> messages would be delivered, or if it is set too low and the failover
> doesn't occur in time then there will be message loss.
>
> What would be ideal I think would be the option to use shared storage for
> local HA, but be able to combine this with the replication option for
> remote
> DR to a cold backup node.  So rather than live-backup being a one to one
> relationship, a live node could have 1 shared storage backup and 1
> replication backup, but only one of which can be auto failed to.
>
>
>
> --
> Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-
> f2341805.html
>
Reply | Threaded
Open this post in threaded view
|

Re: Artemis Disaster Recovery options

gtully
>
> > What if any file/directory management would need to be done on the
> secondary DC file structure if doing a straight file copy.  I wouldn't want
> to copy the whole directory every 15 minutes for example, but rather just
> reflect the changes.
>
> There is a challenge with just copying changes because the journal files
> can and will be re-used.  This is where a replicated block device would be
> nice.
>
> rsync will do a good job of just copying changes, even when reused b/c it
does a full scan of the file to determine what (blocks) have changed.

This full scan; ie: full read to determine checksums, can be expensive on
large journal files so one mitigation is to increase the number of journals
in the pool and decrease the journal max file size.

An ideal rsync would be aware of the current journal file append pointer
though some sort of journal emitter and hence know exactly what regions can
have changed. That would require a modified/customised rsync and a modified
journal or some sort of write hook.

If rsync on small journals does not work this could be investigated further.