Broker verison : 5.10.0
using Master-slave topology with shared kahadb.
Today we facing very critical production issue due to Kahadb. We got below mentioned error in broker logs, and after that broker stopped it's transport connectors and stopped it's services but still it didn't release the lock on kahadb because of which even failover broker was not able to acquire the lock and not able to serve the clients.
Broker was in this state for long time unless we manually restarted the broker. The major concern here is that master broker didn't release the lock on kahadb because of which failover was not able to get the lock and become master.
Can you please let us know what was the reason caused this and why master didn't release the lock ?
[20150124 10:36:58.665 EST (ActiveMQ Data File Writer) org.apache.activemq.store.kahadb.disk.journal.DataFileAppender#processQueue 382 INFO] - Journal fai
led while writing at: 1677639
[20150124 10:36:58.706 EST (ActiveMQ Journal Checkpoint Worker) org.apache.activemq.store.kahadb.MessageDatabase$3#run 364 ERROR] - Checkpoint failed
java.io.IOException: Input/output error
at java.io.RandomAccessFile.write0(Native Method)
Submit this as a JIRA, stating that KahaDB can't fail over to the slave if
the master is unable to write to disk when it shuts down (because it
couldn't write to disk). I'm not sure how feasible it'll be for the slave
to detect this (maybe there's a file modification timestamp that can be
used, or maybe something could be added to have the master write
periodically to a file so the slave can detect that the master is no longer
writing), but ideally KahaDB should handle this situation.
With that being said, have your sysadmins figured out why KahaDB was unable
to write to disk in a live production system and made sure it never happens
again? Because KahaDB only had this problem because your infrastructure
had a terrible failure, and I really hope that the sysadmin's office, not
the ActiveMQ mailing list, was the first stop you made after you discovered
On Jan 26, 2015 11:24 PM, "khandelwalanuj" <[hidden email]>
I thought the master was killed completely, and the problem was solely that
the slave didn't take over. Can you please describe how the master wasn't
killed completely, since you've never before mentioned that in either your
emails here or the JIRA you submitted?
On Tue, Jan 27, 2015 at 7:46 AM, khandelwalanuj <[hidden email] > wrote:
You said the broker still held the file lock, but I assumed that the broker
process exited without releasing the lock (since it couldn't write to
disk). Can you confirm that the master broker process really was still
running (as seen by ps, not just the state of the file lock)?
If the broker process really was still running, the problem is actually
that the broker tries to shutdown but fails to do that when the broker
can't write to the disk that hosts KahaDB. If the process exited but the
disk still thinks the master broker holds the lock because it couldn't
write to disk to release the lock, then the problem is that the slave
broker isn't able to detect that the master exited without releasing the
lock. Your JIRA should be updated in either case, but you need to know
whether the process exited to know which update to make.
On Jan 27, 2015 11:20 PM, "khandelwalanuj" <[hidden email]>