kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

khandelwalanuj
This post was updated on .
Hi,

Broker verison : 5.10.0
using Master-slave topology with shared kahadb.

Today we facing very critical production issue due to Kahadb. We got below mentioned error in broker logs, and after that broker stopped it's transport connectors and stopped it's services but still it didn't release the lock on kahadb because of which even failover broker was not able to acquire the lock and not able to serve the clients.

Broker was in this state for long time unless we manually restarted the broker. The major concern here is that master broker didn't release the lock on kahadb because of which failover was not able to get the lock and become master.

Can you please let us know what was the reason caused this and why master didn't release the lock ?


[20150124 10:36:58.665 EST (ActiveMQ Data File Writer) org.apache.activemq.store.kahadb.disk.journal.DataFileAppender#processQueue 382 INFO] - Journal fai
led while writing at: 1677639
[20150124 10:36:58.706 EST (ActiveMQ Journal Checkpoint Worker) org.apache.activemq.store.kahadb.MessageDatabase$3#run 364 ERROR] - Checkpoint failed
java.io.IOException: Input/output error
        at java.io.RandomAccessFile.write0(Native Method)
        at java.io.RandomAccessFile.write(RandomAccessFile.java:472)
        at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1028)
        at org.apache.activemq.util.RecoverableRandomAccessFile.writeLong(RecoverableRandomAccessFile.java:305)
        at org.apache.activemq.store.kahadb.disk.page.PageFile.writeBatch(PageFile.java:1062)
        at org.apache.activemq.store.kahadb.disk.page.PageFile.flush(PageFile.java:516)
        at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1512)
        at org.apache.activemq.store.kahadb.MessageDatabase$17.execute(MessageDatabase.java:1484)
        at org.apache.activemq.store.kahadb.disk.page.Transaction.execute(Transaction.java:779)
        at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1481)
        at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:929)
        at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:353)


Thanks,
Anuj
Reply | Threaded
Open this post in threaded view
|

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

khandelwalanuj
This post was updated on .
Adding more details:
I am using KahaDB on NFS.

Attaching the complete stack trace:


ActiveMQ_prod_25_Jan.txt
Reply | Threaded
Open this post in threaded view
|

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

khandelwalanuj
Did anyone get a chance to look at this ?
Reply | Threaded
Open this post in threaded view
|

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

Tim Bain
Submit this as a JIRA, stating that KahaDB can't fail over to the slave if
the master is unable to write to disk when it shuts down (because it
couldn't write to disk).  I'm not sure how feasible it'll be for the slave
to detect this (maybe there's a file modification timestamp that can be
used, or maybe something could be added to have the master write
periodically to a file so the slave can detect that the master is no longer
writing), but ideally KahaDB should handle this situation.

With that being said, have your sysadmins figured out why KahaDB was unable
to write to disk in a live production system and made sure it never happens
again?  Because KahaDB only had this problem because your infrastructure
had a terrible failure, and I really hope that the sysadmin's office, not
the ActiveMQ mailing list, was the first stop you made after you discovered
this problem.
On Jan 26, 2015 11:24 PM, "khandelwalanuj" <[hidden email]>
wrote:

> Did anyone get a chance to look at this ?
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/kahadb-corruption-Checkpoint-failed-java-io-IOException-Input-output-error-tp4690378p4690442.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

khandelwalanuj
Hi,

There was some failures on filer because of which applications (ActiveMQ) was not able to read/write on kahadb.

As you mentioned that kahadb should handle this if master broker is not writing than failover should take over; I have logged a request https://issues.apache.org/jira/browse/AMQ-5540 

To handle this can ActiveMQ provide a configuration in http://activemq.apache.org/configurable-ioexception-handling.html which can kill the master completely and let the failover take over ?

Thanks,
Anuj
Reply | Threaded
Open this post in threaded view
|

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

Tim Bain
I thought the master was killed completely, and the problem was solely that
the slave didn't take over.  Can you please describe how the master wasn't
killed completely, since you've never before mentioned that in either your
emails here or the JIRA you submitted?

On Tue, Jan 27, 2015 at 7:46 AM, khandelwalanuj <[hidden email]
> wrote:

> Hi,
>
> There was some failures on filer because of which applications (ActiveMQ)
> was not able to read/write on kahadb.
>
> As you mentioned that kahadb should handle this if master broker is not
> writing than failover should take over; I have logged a request
> https://issues.apache.org/jira/browse/AMQ-5540
>
> To handle this can ActiveMQ provide a configuration in
> http://activemq.apache.org/configurable-ioexception-handling.html which
> can
> kill the master completely and let the failover take over ?
>
> Thanks,
> Anuj
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/kahadb-corruption-Checkpoint-failed-java-io-IOException-Input-output-error-tp4690378p4690470.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

khandelwalanuj
Hi,

Master was not completely killed. Master has stopped it's transport connectors and plugins but it didn't release it's lock from the kahadb.

Thanks,
Anuj
Reply | Threaded
Open this post in threaded view
|

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

Tim Bain
You said the broker still held the file lock, but I assumed that the broker
process exited without releasing the lock (since it couldn't write to
disk).  Can you confirm that the master broker process really was still
running (as seen by ps, not just the state of the file lock)?

If the broker process really was still running,  the problem is actually
that the broker tries to shutdown but fails to do that when the broker
can't write to the disk that hosts KahaDB.  If the process exited but the
disk still thinks the master broker holds the lock because it couldn't
write to disk to release the lock, then the problem is that the slave
broker isn't able to detect that the master exited without releasing the
lock.  Your JIRA should be updated in either case, but you need to know
whether the process exited to know which update to make.
On Jan 27, 2015 11:20 PM, "khandelwalanuj" <[hidden email]>
wrote:

> Hi,
>
> Master was not completely killed. Master has stopped it's transport
> connectors and plugins but it didn't release it's lock from the kahadb.
>
> Thanks,
> Anuj
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/kahadb-corruption-Checkpoint-failed-java-io-IOException-Input-output-error-tp4690378p4690514.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: kahadb corruption: "Checkpoint failed java.io.IOException: Input/output error"

khandelwalanuj
Hi,

Yes !! The broker process was still running. I verified it with "ps" command.
I have updated the JIRA with details as you mentioned in last update.

Thanks,
Anuj