Lease based locker fails as slave starts when master is still up

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Lease based locker fails as slave starts when master is still up

mandar.wanpal
Hi Team,

We recently switched to AMQ5.7.0 after reading posts in AMQ-3654 and reading about lease locker mechanism. We have implemented this mechanism in our AMQ setup. We are facing some strange issue with this mechanism.

The issue is, at some point after master and slave running for quite some time properly, meaning master serving requests and slave waiting and checking lease, all of a sudden at some point, even if master is running, slave starts and takes lock. I can see brockername changed in DB, though strange thing is, the queues, which were registered with master, still remain with master. So though slave started and took lock, i cant find queues registered in slave. So now both of them keep running simultaneously.

Below is configuration we have done.
------------------------------
master (just putting whats changed from default activemq.xml)
------------------------------
        <ioExceptionHandler>
            <jDBCIOExceptionHandler/>
        </ioExceptionHandler>
               
               
                       <persistenceAdapter>
           
            <jdbcPersistenceAdapter lockKeepAlivePeriod="5000" lockAcquireSleepInterval="10000" dataDirectory="${activemq.base}/data" dataSource="#oracle-ds">
                <databaseLocker>
                    <lease-database-locker/>
                </databaseLocker>
            </jdbcPersistenceAdapter>
        </persistenceAdapter>
               
               
                      <systemUsage>
            <systemUsage>
                <memoryUsage>
                    <memoryUsage limit="20 mb"/>
                </memoryUsage>
                <storeUsage>
                    <storeUsage limit="1 gb"/>
                </storeUsage>
                <tempUsage>
                    <tempUsage limit="100 mb"/>
                </tempUsage>
            </systemUsage>
        </systemUsage>
               
                        <transportConnectors>
           
            <transportConnector name="openwire" uri="tcp://0.0.0.0:61616?maximumConnections=1000&amp;wireformat.maxFrameSize=104857600"/>
        </transportConnectors>
               
               
                 <bean id="oracle-ds" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close">
         <property name="driverClassName" value="oracle.jdbc.driver.OracleDriver"/>
         <property name="url" value="jdbc:oracle:thin:@
                            (DESCRIPTION =
                            (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = db_hostname)(PORT = 1521)))
                            (CONNECT_DATA = (SERVER = DEDICATED)(SERVICE_NAME = db_hostname))   )" />
         <property name="username" value="db_cred"/>
         <property name="password" value="db_cred"/>
         <property name="maxActive" value="200"/>
         <property name="poolPreparedStatements" value="true"/>
     </bean>
         
----------------------------------
slave
----------------------------------

        <ioExceptionHandler>
            <jDBCIOExceptionHandler/>
        </ioExceptionHandler>
               
                     <persistenceAdapter>
           
            <jdbcPersistenceAdapter lockKeepAlivePeriod="5000" lockAcquireSleepInterval="16000" dataDirectory="${activemq.base}/data" dataSource="#oracle-ds">
                <databaseLocker>
                    <lease-database-locker/>
                </databaseLocker>
            </jdbcPersistenceAdapter>
        </persistenceAdapter>
               
               
                      <systemUsage>
            <systemUsage>
                <memoryUsage>
                    <memoryUsage limit="20 mb"/>
                </memoryUsage>
                <storeUsage>
                    <storeUsage limit="1 gb"/>
                </storeUsage>
                <tempUsage>
                    <tempUsage limit="100 mb"/>
                </tempUsage>
            </systemUsage>
        </systemUsage>
               
                        <transportConnectors>
           
            <transportConnector name="openwire" uri="tcp://0.0.0.0:61616?maximumConnections=1000&amp;wireformat.maxFrameSize=104857600"/>
        </transportConnectors>
               
               
               
                 <bean id="oracle-ds" class="org.apache.commons.dbcp.BasicDataSource" destroy-method="close">
         <property name="driverClassName" value="oracle.jdbc.driver.OracleDriver"/>
         
         <property name="url" value="jdbc:oracle:thin:@
                            (DESCRIPTION =
                            (ADDRESS_LIST = (ADDRESS = (PROTOCOL = TCP)(HOST = db_hostname)(PORT = 1521)))
                            (CONNECT_DATA = (SERVER = DEDICATED)(SERVICE_NAME = db_hostname))   )" />
         <property name="username" value="db_cred"/>
         <property name="password" value="db_cred"/>
         <property name="maxActive" value="200"/>
         <property name="poolPreparedStatements" value="true"/>
     </bean>

--------------------------------------------

See in below logs, at 03:17:41, slave suddenly became master when at same time, in master logs, I cant find anything strange that could have triggered slave to become master.
---------------------------------------------
Logs:
master

2012-12-13 03:17:40,241 | DEBUG | Running WriteCheck[tcp://<master_ip>:42695] | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor Worker
2012-12-13 03:17:40,241 | DEBUG | Running WriteCheck[tcp://<master_ip>:42691] | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor Worker
2012-12-13 03:17:40,241 | DEBUG | 30001 ms elapsed since last read check. | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor ReadCheckTimer
2012-12-13 03:17:40,243 | DEBUG | WriteChecker 10000 ms elapsed since last write check. | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor WriteCheckTimer
2012-12-13 03:17:40,243 | DEBUG | Running WriteCheck[tcp://<master_ip>:42692] | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor Worker
2012-12-13 03:17:41,467 | DEBUG | WriteChecker 10000 ms elapsed since last write check. | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor WriteCheckTimer
2012-12-13 03:17:41,655 | DEBUG | <master_broker_name>, lease keepAlive Query is UPDATE ACTIVEMQ_LOCK SET BROKER_NAME=?, TIME=? WHERE BROKER_NAME=? AND ID = 1 | org.apache.activemq.store.jdbc.LeaseDatabaseLocker | ActiveMQ Cleanup Timer
2012-12-13 03:17:42,327 | DEBUG | Running WriteCheck[tcp://10.217.192.142:56709] | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor Worker
2012-12-13 03:17:41,652 | DEBUG | queue://<queue1>-TEMP expiring messages .. | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[<master_broker_name>] Scheduler
2012-12-13 03:17:42,327 | DEBUG | <queue1>-TEMP toPageIn: 0, Inflight: 0, pagedInMessages.size 0, enqueueCount: 0, dequeueCount: 0 | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[<master_broker_name>] Scheduler
2012-12-13 03:17:42,327 | DEBUG | queue://<queue1>-TEMP expiring messages done. | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[<master_broker_name>] Scheduler
2012-12-13 03:17:42,327 | DEBUG | queue://<queue1> expiring messages .. | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[<master_broker_name>] Scheduler
2012-12-13 03:17:42,327 | DEBUG | <queue1> toPageIn: 0, Inflight: 0, pagedInMessages.size 0, enqueueCount: 0, dequeueCount: 0 | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[<master_broker_name>] Scheduler
2012-12-13 03:17:42,327 | DEBUG | queue://<queue1> expiring messages done. | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[<master_broker_name>] Scheduler
2012-12-13 03:17:42,327 | DEBUG | queue://<queue2> expiring messages .. | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[<master_broker_name>] Scheduler
2012-12-13 03:17:42,327 | DEBUG | <queue2> toPageIn: 0, Inflight: 0, pagedInMessages.size 0, enqueueCount: 0, dequeueCount: 0 | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[<master_broker_name>] Scheduler
2012-12-13 03:17:42,327 | DEBUG | queue://<queue2> expiring messages done. | org.apache.activemq.broker.region.Queue | ActiveMQ Broker[<master_broker_name>] Scheduler
2012-12-13 03:17:42,327 | DEBUG | WriteChecker 10860 ms elapsed since last write check. | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor WriteCheckTimer
2012-12-13 03:17:42,328 | DEBUG | WriteChecker 10859 ms elapsed since last write check. | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor WriteCheckTimer
2012-12-13 03:17:42,328 | DEBUG | WriteChecker 10860 ms elapsed since last write check. | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor WriteCheckTimer
2012-12-13 03:17:42,328 | DEBUG | WriteChecker 10860 ms elapsed since last write check. | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor WriteCheckTimer
2012-12-13 03:17:42,328 | DEBUG | WriteChecker 10860 ms elapsed since last write check. | org.apache.activemq.transport.AbstractInactivityMonitor | ActiveMQ InactivityMonitor WriteCheckTimer


slave

2012-12-13 03:17:26,793 | INFO  | <slave_broker_name> failed to acquire lease.  Sleeping for 5000 milli(s) before trying again... | org.apache.activemq.store.jdbc.LeaseDatabaseLocker | main
2012-12-13 03:17:31,843 | INFO  | <slave_broker_name> Lease held by <master_broker_name> till Thu Dec 13 03:17:36 CET 2012 | org.apache.activemq.store.jdbc.LeaseDatabaseLocker | main
2012-12-13 03:17:31,843 | INFO  | <slave_broker_name> failed to acquire lease.  Sleeping for 5000 milli(s) before trying again... | org.apache.activemq.store.jdbc.LeaseDatabaseLocker | main
2012-12-13 03:17:36,846 | INFO  | <slave_broker_name> Lease held by <master_broker_name> till Thu Dec 13 03:17:41 CET 2012 | org.apache.activemq.store.jdbc.LeaseDatabaseLocker | main
2012-12-13 03:17:36,846 | INFO  | <slave_broker_name> failed to acquire lease.  Sleeping for 5000 milli(s) before trying again... | org.apache.activemq.store.jdbc.LeaseDatabaseLocker | main
2012-12-13 03:17:41,996 | DEBUG | <slave_broker_name>, lease keepAlive Query is UPDATE ACTIVEMQ_LOCK SET BROKER_NAME=?, TIME=? WHERE BROKER_NAME=? AND ID = 1 | org.apache.activemq.store.jdbc.LeaseDatabaseLocker | main
2012-12-13 03:17:42,699 | INFO  | <slave_broker_name>, becoming the master on dataSource: org.apache.commons.dbcp.BasicDataSource@5e36d047 | org.apache.activemq.store.jdbc.LeaseDatabaseLocker | main
2012-12-13 03:17:42,707 | DEBUG | Cleaning up old messages. | org.apache.activemq.store.jdbc.JDBCPersistenceAdapter | ActiveMQ Cleanup Timer
2012-12-13 03:17:42,707 | DEBUG | Executing SQL: DELETE FROM ACTIVEMQ_MSGS WHERE (PRIORITY=? AND ID <=      ( SELECT min(ACTIVEMQ_ACKS.LAST_ACKED_ID)       FROM ACTIVEMQ_ACKS WHERE ACTIVEMQ_ACKS.CONTAINER=ACTIVEMQ_MSGS.CONTAINER        AND ACTIVEMQ_ACKS.PRIORITY=?)   ) | org.apache.activemq.store.jdbc.adapter.DefaultJDBCAdapter | ActiveMQ Cleanup Timer
2012-12-13 03:17:42,732 | DEBUG | Deleted 0 old message(s) at priority: 0 | org.apache.activemq.store.jdbc.adapter.DefaultJDBCAdapter | ActiveMQ Cleanup Timer
2012-12-13 03:17:42,732 | DEBUG | Cleanup done. | org.apache.activemq.store.jdbc.JDBCPersistenceAdapter | ActiveMQ Cleanup Timer
2012-12-13 03:17:42,999 | INFO  | Apache ActiveMQ 5.7.0 (<slave_broker_name>, ID:<slave_broker_name>-38421-1355365062728-0:1) is starting | org.apache.activemq.broker.BrokerService | main
2012-12-13 03:17:43,084 | DEBUG | Publishing: tcp://<slave_broker_ip>:61616 for broker transport URI: tcp://<slave_broker_ip>:61616?maximumConnections=1000&wireformat.maxFrameSize=104857600 | org.apache.activemq.broker.TransportConnector | main
2012-12-13 03:17:43,086 | INFO  | Listening for connections at: tcp://<slave_broker_ip>:61616?maximumConnections=1000&wireformat.maxFrameSize=104857600 | org.apache.activemq.transport.TransportServerThreadSupport | main
2012-12-13 03:17:43,086 | INFO  | Connector openwire Started | org.apache.activemq.broker.TransportConnector | main
2012-12-13 03:17:43,087 | INFO  | Apache ActiveMQ 5.7.0 (<slave_broker_name>, ID:<slave_broker_name>-38421-1355365062728-0:1) started | org.apache.activemq.broker.BrokerService | main
2012-12-13 03:17:43,088 | INFO  | For help or more information please see: http://activemq.apache.org | org.apache.activemq.broker.BrokerService | main
2012-12-13 03:17:43,093 | DEBUG | Finished creating instance of bean 'org.apache.activemq.xbean.XBeanBrokerService#0' | org.springframework.beans.factory.support.DefaultListableBeanFactory | main
2012-12-13 03:17:43,093 | DEBUG | Returning cached instance of singleton bean 'oracle-ds' | org.springframework.beans.factory.support.DefaultListableBeanFactory | main
2012-12-13 03:17:43,093 | DEBUG | Creating shared instance of singleton bean 'securityLoginService' | org.springframework.beans.factory.support.DefaultListableBeanFactory | main
2012-12-13 03:17:43,093 | DEBUG | Creating instance of bean 'securityLoginService' | org.springframework.beans.factory.support.DefaultListableBeanFactory | main
2012-12-13 03:17:43,103 | DEBUG | Logging to org.slf4j.impl.Log4jLoggerAdapter(org.eclipse.jetty.util.log) via org.eclipse.jetty.util.log.Slf4jLog | org.eclipse.jetty.util.log | main
2012-12-13 03:17:43,104 | DEBUG | Eagerly caching bean 'securityLoginService' to allow for resolving potential circular references | org.springframework.beans.factory.support.DefaultListableBeanFactory | main
2012-12-13 03:17:43,106 | DEBUG | Finished creating instance of bean 'securityLoginService' | org.springframework.beans.factory.support.DefaultListableBeanFactory | main
2012-12-13 03:17:43,106 | DEBUG | Creating shared instance of singleton bean 'securityConstraint' | org.springframework.beans.factory.support.DefaultListableBeanFactory | main
2012-12-13 03:17:43,106 | DEBUG | Creating instance of bean 'securityConstraint' | org.springframework.beans.factory.support.DefaultListableBeanFactory | main
2012-12-13 03:17:43,106 | DEBUG | Eagerly caching bean 'securityConstraint' to allow for resolving potential circular references | org.springframework.beans.factory.support.DefaultListableBeanFactory | main
2012-12-13 03:17:43,106 | DEBUG | Finished creating instance of bean 'securityConstraint' | org.springframework.beans.factory.support.DefaultListableBeanFactory | main
2012-12-13 03:17:43,107 | DEBUG | Creating shared instance of singleton bean 'securityConstraintMapping' | org.springframework.beans.factory.support.DefaultListableBeanFactory | main
2012-12-13 03:17:43,107 | DEBUG | Creating instance of bean 'securityConstraintMapping' | org.springframework.beans.factory.support.DefaultListableBeanFactory | main
2012-12-13 03:17:43,107 | DEBUG | Eagerly caching bean 'securityConstraintMapping' to allow for resolving potential circular references | org.springframework.beans.factory.support.DefaultListableBeanFactory | main

-------------------------------------------------------


Regards, Mandar Wanpal Email: mandar.wanpal@gmail.com
Reply | Threaded
Open this post in threaded view
|

Re: Lease based locker fails as slave starts when master is still up

gtully
that does seem odd.
Are the system clocks in sync on both machines?

It would be interesting to get db logs to see the exact sequence of
statements with values.
The idea is that a lease update would fail if the slave takes over, but
that does not seem to be the case on your master.
It appears that the slave is checking for expiry about the same time that
the master should be  updating, so it becomes a master, but in this case
the slave should give up.

Consider using the same timeout values on both the master and slave but it
does look like there is a bug here.


On 13 December 2012 07:10, mandar.wanpal <[hidden email]> wrote:

> lockAcquireSleepInterval




--
http://redhat.com
http://blog.garytully.com
Reply | Threaded
Open this post in threaded view
|

Re: Lease based locker fails as slave starts when master is still up

mandar.wanpal
Thanks for response.

I had asked our sys admin team to recheck NTP clock synch settings and get those repaired. So they are ok now.

A question here. we only need both master and slave server's clocks in synch or DB server clock also shall be in synch with both of them. From where does the TIME field in ACTIVEMQ_LOCK table takes value. from db server clock or from master/slave clocks.

I will try and get the AWR report or any logs if available for that time frame from our DBAs and will post the same here.

One observation now. I tried running similar setup on ActiveMQ 5.8.0 SNAPSHOT. I kept it running for more than 24 hrs and both nodes run in master slave properly unless both nodes stopped after some network glich and IO error accessing DB. Is there any way we can compell AMQ process to keep running for at least few secs, if DB connectivity is gone. so that there isnt any outage.

I am still checking ActiveMQ 5.8.0 SNAPSHOT, to see if the issue with master slave is resolved in it, and if it is, do you have any idea by when we will have release version of ActiveMQ 5.8.0 available. I dont think putting SNAPSHOT in production is a good idea.

or is there any way this AMQ 5.7 can be patched to resolve master slave issue with code from AMQ 5.8
Regards, Mandar Wanpal Email: mandar.wanpal@gmail.com