Problems with ActiveMQ with LevelDB and shared filesystem over NFS4

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Problems with ActiveMQ with LevelDB and shared filesystem over NFS4

scheifra
This post has NOT been accepted by the mailing list yet.
Hi

As I didn't succeed in finding answers to my questions in the mailing list/forum, I have to ask these questions, because we are stuck with the master/slave setup of ActiveMQ with the shared filesystem over nfs4 share.

We have a simple setup of two broker using the LevelDB as persistence adapter sharing a nfs4 location. The setup in both broker is like this:

<persistenceAdapter>
    <levelDB directory="/nfs/activemq/data/leveldb" lockKeepAlivePeriod="5000">
        <locker>
            <shared-file-locker lockAcquireSleepInterval="10000"/>
        </locker>
    </levelDB>
</persistenceAdapter>

From my understanding the master broker tries to renew the lock every 5 seconds, and if it does not succeed it fails. The slave broker tries every 10 seconds to retrieve the lock and sleeps further 10 seconds if it cannot retrieve the lock, otherwise the slave becomes master.

This works perfectly if I shutdown or kill the master. The slave becomes master and when I restart the previous master it becomes the slave.

Now I simulate a network outing on the master and prevent the master from accessing the NFS4 share. What I see is the master still staying alive and act as a master. Additional the slave retrieves the lock on the NFS4 share and also becomes a master. The result is I have 2 master Broker and a completely corrupted system.

When the former master can access the NFS4 share again nothing changes, both broker stay as master. I have to stop both broker and restart them to fix the system.

As I cannot find any configuration settings for a shared filesystem Master/Slave system, what are the requisites?
Does the NFS4 share requires special settings? We tried with hard mounting and soft mounting with no effects.
Is there anything I totally miss in configuring this system?

Any help from someone who has some experiance on this setup is welcome!
Reply | Threaded
Open this post in threaded view
|

Re: Problems with ActiveMQ with LevelDB and shared filesystem over NFS4

artnaseef
What method is used to simulate the network outage to the NFS server?

If the original master loses network connection to the NFS server, writes should fail.  Although there could be NFS settings that affect just how it operates (e.g. sync vs async and timeouts).  It's been a long time since I tweaked NFS settings and I don't remember well how it operates under the hood.

One thing to keep in mind - lock timeouts have this problem; if the original holder of the lock returns after the timeout, it tends to lead to a bad state because the original holder doesn't know it lost the lock.  ActiveMQ's locking is based on a single lock file to take ownership of the entire database of files; individual data files (i.e. the files actually holding the leveldb content) are not locked.

Another thought - this seems like an odd scenario.  The master server is still on the network, so it can serve clients, and the NFS server is still on the network and talking to the slave, but the master cannot talk to the NFS server.

Also, is there redundancy in the master's network setup?  In other words, are there two network interfaces and two wires from the server to the network?  Surviving any single-point-of-failure means considering all points in the system and making sure they are redundant.  A typical H/A setup will survive any single-point-of-failure without down time (or with minimal interruption), but does not survive multiple-points-of-failure.

Hope this helps.
Reply | Threaded
Open this post in threaded view
|

Re: Problems with ActiveMQ with LevelDB and shared filesystem over NFS4

scheifra
This post has NOT been accepted by the mailing list yet.
Thanks for your answer!

We used a simple firewall rule on the ActiveMQ Server to interrupt the connection to the NFS Share.

The scenario with the master still on the network when it cannot connect to the share should not occure when the setting for "lockKeepAlivePeriod" would take any effect. After this periode the master should fail if it cannot connect to the share. But as I cannot see this behaviour I wonder if I understand thie setting correctly or if it's really working.
Btw it should be a setting with soft mounts on the NFS share which bring the master to fail if it cannot connect to the share, after the set timeout.

The network setup indeed has redundancy and this error will hopefully never occure again, but we had a problem with the redundancy and of course a short network outage in the same periode - shit happens. ;-)

I will investigate further, perhaps someone else has already spot this problems and has a solutions or some hints.
Reply | Threaded
Open this post in threaded view
|

Re: Problems with ActiveMQ with LevelDB and shared filesystem over NFS4

artnaseef
OK, digging into the code, here's how the keep-alive works with the shared-filesystem-locker:

        public boolean keepAlive() {
            return lock != null && lock.isValid() && file.exists();
        }

I suspect this code is either (a) ignoring the lost access to the server, or (b) blocking.  It should be possible to detect (b) by looking at the JVM stack dump (e.g. using jstack).
Reply | Threaded
Open this post in threaded view
|

Re: Problems with ActiveMQ with LevelDB and shared filesystem over NFS4

scheifra
This post has NOT been accepted by the mailing list yet.
Thanks for your investigation! I will have a deeper look into it and will try to find out what the behaviour exactly is.