NFS v4 locks "given up" w/o any logging?

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

NFS v4 locks "given up" w/o any logging?

James A. Robinson
Is it common that an activemq broker might give up its NFS v4 lock w/o
logging any sort of message?  I've got two brokers that logged this:

broker-a which held the log:
2016-03-17 15:01:51,113 [yMonitor Worker] WARN  Transport
   - Transport Connection to: tcp://104.232.16.4:62269 failed:
org.apache.activemq.transport.InactivityIOException: Channel was inactive
for too (>30000) long: tcp://xxx.xxx.xxx.xxx:62269
2016-03-18 00:05:22,751 [KeepAlive Timer] INFO  LockFile
    - Lock file /var/log/activemq/activemq-data/amq-dev-1/lock, locked at
Thu Mar 17 13:38:33 PDT 2016, has been modified at Fri Mar 18 00:02:15 PDT
2016
2016-03-18 00:05:22,758 [KeepAlive Timer] ERROR LockableServiceSupport
    - amq-dev-1, no longer able to keep the exclusive lock so giving up
being a master
2016-03-18 00:05:22,761 [KeepAlive Timer] INFO  BrokerService
   - Apache ActiveMQ 5.13.2 (amq-dev-1, ID:cluster-51079-1458247119790-1:1)
is shutting down

broker-b which appeared to steal the lock:
2016-03-17 13:38:52,680 [JMX connector  ] INFO  ManagementContext
   - JMX consoles can connect to
service:jmx:rmi://localhost:2020/jndi/rmi://localhost:2020/jmxrmi
2016-03-18 00:02:23,593 [erSimpleAppMain] INFO  MessageDatabase
   - KahaDB is version 6
2016-03-18 00:02:23,762 [erSimpleAppMain] INFO  MessageDatabase
   - Recovering from the journal @1:63912
2016-03-18 00:02:24,043 [erSimpleAppMain] INFO  MessageDatabase
   - Recovery replayed 7130 operations from the journal in 0.296 seconds.
2016-03-18 00:02:24,058 [erSimpleAppMain] INFO  PListStoreImpl
    - PListStore:[/var/log/activemq/activemq-data/amq-dev-1/tmp_storage]
started
Reply | Threaded
Open this post in threaded view
|

Re: NFS v4 locks "given up" w/o any logging?

Tim Bain
I'd say it's more likely that either 1) NFS gave away the lock when it
shouldn't have, or 2) network conditions were such that your master lost
connectivity and NFS rightly allowed the slave to take it.  In either case,
useful logging could only come from your NFS server.

Separately from the question of why this happened, I'm concerned that it
took 3 minutes for the master to recognize it had lost the lock (during
which time you'd have had a dual-master situation).  Can that be explained
by your specific NFS settings?

Tim
On Mar 18, 2016 7:04 AM, "James A. Robinson" <[hidden email]> wrote:

> Is it common that an activemq broker might give up its NFS v4 lock w/o
> logging any sort of message?  I've got two brokers that logged this:
>
> broker-a which held the log:
> 2016-03-17 15:01:51,113 [yMonitor Worker] WARN  Transport
>    - Transport Connection to: tcp://104.232.16.4:62269 failed:
> org.apache.activemq.transport.InactivityIOException: Channel was inactive
> for too (>30000) long: tcp://xxx.xxx.xxx.xxx:62269
> 2016-03-18 00:05:22,751 [KeepAlive Timer] INFO  LockFile
>     - Lock file /var/log/activemq/activemq-data/amq-dev-1/lock, locked at
> Thu Mar 17 13:38:33 PDT 2016, has been modified at Fri Mar 18 00:02:15 PDT
> 2016
> 2016-03-18 00:05:22,758 [KeepAlive Timer] ERROR LockableServiceSupport
>     - amq-dev-1, no longer able to keep the exclusive lock so giving up
> being a master
> 2016-03-18 00:05:22,761 [KeepAlive Timer] INFO  BrokerService
>    - Apache ActiveMQ 5.13.2 (amq-dev-1, ID:cluster-51079-1458247119790-1:1)
> is shutting down
>
> broker-b which appeared to steal the lock:
> 2016-03-17 13:38:52,680 [JMX connector  ] INFO  ManagementContext
>    - JMX consoles can connect to
> service:jmx:rmi://localhost:2020/jndi/rmi://localhost:2020/jmxrmi
> 2016-03-18 00:02:23,593 [erSimpleAppMain] INFO  MessageDatabase
>    - KahaDB is version 6
> 2016-03-18 00:02:23,762 [erSimpleAppMain] INFO  MessageDatabase
>    - Recovering from the journal @1:63912
> 2016-03-18 00:02:24,043 [erSimpleAppMain] INFO  MessageDatabase
>    - Recovery replayed 7130 operations from the journal in 0.296 seconds.
> 2016-03-18 00:02:24,058 [erSimpleAppMain] INFO  PListStoreImpl
>     - PListStore:[/var/log/activemq/activemq-data/amq-dev-1/tmp_storage]
> started
>
Reply | Threaded
Open this post in threaded view
|

Re: NFS v4 locks "given up" w/o any logging?

James A. Robinson-2
Yes, the combination of settings in place right now could add up to 3
minutes:

On the client side:

nfsvers=4,proto=tcp,
hard,timeo=600,retrans=2,ac,acregmin=3,acregmax=60,acdirmin=30,fg,retry=120,
sharecache,lookupcache=all,cto

So right now it's got a 60 second timeo value, and it will retry up to 2
times.  I'll see if I can find any OS level messages about the NFS server
lock, or if the NFS server reported anything.

On the server side:

read delegation, 30 second lease for locks, grace period 45 seconds


On Fri, Mar 18, 2016 at 6:30 AM, Tim Bain <[hidden email]> wrote:

> I'd say it's more likely that either 1) NFS gave away the lock when it
> shouldn't have, or 2) network conditions were such that your master lost
> connectivity and NFS rightly allowed the slave to take it.  In either case,
> useful logging could only come from your NFS server.
>
> Separately from the question of why this happened, I'm concerned that it
> took 3 minutes for the master to recognize it had lost the lock (during
> which time you'd have had a dual-master situation).  Can that be explained
> by your specific NFS settings?
>
> Tim
Reply | Threaded
Open this post in threaded view
|

Re: NFS v4 locks "given up" w/o any logging?

James A. Robinson
Yes indeed there was a problem w/ the underlying NFS connection, logged at
the OS level.  It's funny, this service wasn't even under load when the
timeout happened, so NFS is living up to my expectations already.

So I could either lower the client side timeouts to fit within the 30
second lease, or I could raise the server side lease time to match the 180
seconds the client will try for.


On Fri, Mar 18, 2016 at 8:26 AM, James A. Robinson <[hidden email]>
wrote:

> Yes, the combination of settings in place right now could add up to 3
> minutes:
>
> On the client side:
>
> nfsvers=4,proto=tcp,
> hard,timeo=600,retrans=2,ac,acregmin=3,acregmax=60,acdirmin=30,fg,retry=120,
> sharecache,lookupcache=all,cto
>
> So right now it's got a 60 second timeo value, and it will retry up to 2
> times.  I'll see if I can find any OS level messages about the NFS server
> lock, or if the NFS server reported anything.
>
> On the server side:
>
> read delegation, 30 second lease for locks, grace period 45 seconds
>
>
> On Fri, Mar 18, 2016 at 6:30 AM, Tim Bain <[hidden email]> wrote:
>
>> I'd say it's more likely that either 1) NFS gave away the lock when it
>> shouldn't have, or 2) network conditions were such that your master lost
>> connectivity and NFS rightly allowed the slave to take it.  In either
>> case,
>> useful logging could only come from your NFS server.
>>
>> Separately from the question of why this happened, I'm concerned that it
>> took 3 minutes for the master to recognize it had lost the lock (during
>> which time you'd have had a dual-master situation).  Can that be explained
>> by your specific NFS settings?
>>
>> Tim
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: NFS v4 locks "given up" w/o any logging?

James A. Robinson
Lowering timeo on the client side doesn't appear to do jack to help with
this
situation.  I'm reluctant to switch from "hard" to "soft" because of the
warning
that it can easily lead to corruption issues, but I do see some discussions
where people say flipping to that mode helped them detect lost locks.

Anyone who has an HA configuration using NFS that they know works for
failover care to share exactly what mount settings they are using?


On Fri, Mar 18, 2016 at 8:51 AM, James A. Robinson <[hidden email]>
wrote:

> Yes indeed there was a problem w/ the underlying NFS connection, logged at
> the OS level.  It's funny, this service wasn't even under load when the
> timeout happened, so NFS is living up to my expectations already.
>
> So I could either lower the client side timeouts to fit within the 30
> second lease, or I could raise the server side lease time to match the 180
> seconds the client will try for.
>
Reply | Threaded
Open this post in threaded view
|

Re: NFS v4 locks "given up" w/o any logging?

James A. Robinson
Running a packet trace while the problem was occurring let me see that the
problem is that the NFS layer isn't even being involved for 3 minutes due
to the TCP layer retrying for 3 minute before notifying the caller of the
error.

The client sends SYN packets, doesn't get an ACK, and finally times out
after 3 minutes.  The documentation for Linux (the client in this case)
indicates the following default:

tcp_syn_retries (integer; default: 5; since Linux 2.2)
              The maximum number of times initial SYNs for an active TCP
              connection attempt will be retransmitted.  This value should
              not be higher than 255.  The default value is 5, which
              corresponds to approximately 180 seconds.

I think this means that it's not possible to get faster failover w/o
modifying the default values for the TCP layer.  This particular error is
always going to take at least 3 minutes to surface.

Jim
Reply | Threaded
Open this post in threaded view
|

Re: NFS v4 locks "given up" w/o any logging?

James A. Robinson
I suppose I should pose the question to folks that use NFS, do you use TCP
or UDP for your NFS stack?

On Mon, Mar 21, 2016 at 3:29 PM James A. Robinson <[hidden email]>
wrote:

> Running a packet trace while the problem was occurring let me see that the
> problem is that the NFS layer isn't even being involved for 3 minutes due
> to the TCP layer retrying for 3 minute before notifying the caller of the
> error.
>
> The client sends SYN packets, doesn't get an ACK, and finally times out
> after 3 minutes.  The documentation for Linux (the client in this case)
> indicates the following default:
>
> tcp_syn_retries (integer; default: 5; since Linux 2.2)
>               The maximum number of times initial SYNs for an active TCP
>               connection attempt will be retransmitted.  This value should
>               not be higher than 255.  The default value is 5, which
>               corresponds to approximately 180 seconds.
>
> I think this means that it's not possible to get faster failover w/o
> modifying the default values for the TCP layer.  This particular error is
> always going to take at least 3 minutes to surface.
>
> Jim
>
>
Reply | Threaded
Open this post in threaded view
|

Re: NFS v4 locks "given up" w/o any logging?

James A. Robinson
To answer my own question, no (at least not NFS 4):

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Storage_Administration_Guide/ch-nfs.html


On Mon, Mar 21, 2016 at 3:46 PM James A. Robinson <[hidden email]>
wrote:

> I suppose I should pose the question to folks that use NFS, do you use TCP
> or UDP for your NFS stack?
>
>
> On Mon, Mar 21, 2016 at 3:29 PM James A. Robinson <[hidden email]>
> wrote:
>
>> Running a packet trace while the problem was occurring let me see that
>> the problem is that the NFS layer isn't even being involved for 3 minutes
>> due to the TCP layer retrying for 3 minute before notifying the caller of
>> the error.
>>
>> The client sends SYN packets, doesn't get an ACK, and finally times out
>> after 3 minutes.  The documentation for Linux (the client in this case)
>> indicates the following default:
>>
>> tcp_syn_retries (integer; default: 5; since Linux 2.2)
>>               The maximum number of times initial SYNs for an active TCP
>>               connection attempt will be retransmitted.  This value should
>>               not be higher than 255.  The default value is 5, which
>>               corresponds to approximately 180 seconds.
>>
>> I think this means that it's not possible to get faster failover w/o
>> modifying the default values for the TCP layer.  This particular error is
>> always going to take at least 3 minutes to surface.
>>
>> Jim
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: NFS v4 locks "given up" w/o any logging?

paulgale
In reply to this post by James A. Robinson
We're using the following with our NetApp device:

-rw,soft,intr,noexec,nosuid,nodev,users,rsize=65535,wsize=65535,proto=tcp

​Whilst 'soft' is generally not recommended I've found that using 'hard'
causes the broker to lock up immediately.
The settings we have in place above were chosen by our storage team.

Thanks,
Paul

On Fri, Mar 18, 2016 at 3:57 PM, James A. Robinson <[hidden email]>
wrote:

> Lowering timeo on the client side doesn't appear to do jack to help with
> this
> situation.  I'm reluctant to switch from "hard" to "soft" because of the
> warning
> that it can easily lead to corruption issues, but I do see some discussions
> where people say flipping to that mode helped them detect lost locks.
>
> Anyone who has an HA configuration using NFS that they know works for
> failover care to share exactly what mount settings they are using?
>
>
> On Fri, Mar 18, 2016 at 8:51 AM, James A. Robinson <[hidden email]
> >
> wrote:
>
> > Yes indeed there was a problem w/ the underlying NFS connection, logged
> at
> > the OS level.  It's funny, this service wasn't even under load when the
> > timeout happened, so NFS is living up to my expectations already.
> >
> > So I could either lower the client side timeouts to fit within the 30
> > second lease, or I could raise the server side lease time to match the
> 180
> > seconds the client will try for.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: NFS v4 locks "given up" w/o any logging?

James A. Robinson-2
Thank you for the information.  Would you be able to tell me if your NetApp
settings for the lock lease are the standard values?  I'm told that it's 30
seconds,  I was thinking about whether or not I needed to have that raised
much higher to prevent the particular issue I'm seeing (the tcp layer
trying to send SYN packets for 3 minutes before giving up)?

On Tue, Mar 22, 2016 at 5:27 PM Paul Gale <[hidden email]> wrote:

> We're using the following with our NetApp device:
>
> -rw,soft,intr,noexec,nosuid,nodev,users,rsize=65535,wsize=65535,proto=tcp
>
> ​Whilst 'soft' is generally not recommended I've found that using 'hard'
> causes the broker to lock up immediately.
> The settings we have in place above were chosen by our storage team.
>
> Thanks,
> Paul
>
Reply | Threaded
Open this post in threaded view
|

Re: NFS v4 locks "given up" w/o any logging?

James A. Robinson-2
To close the loop on this for anyone that digs it up later...

While the issue of lock handoff is still an open one, and one that I think
may require  a new Locker scheme, the underlying cause of the issue I was
seeing with my activemq cluster was due to a difference in understanding on
how the network was supposed to be configured on the NFS server side.

I had isolated the symptom down to the fact that writes of over 16 MiB
caused the NFS layer to lock up when the NFS server admins corrected the
network routes on the server side, and I'm confident this has resolved the
immediate issues I was seeing.

Jim


On Wed, Mar 23, 2016 at 7:21 AM James A. Robinson <[hidden email]> wrote:

> Thank you for the information.  Would you be able to tell me if your
> NetApp settings for the lock lease are the standard values?  I'm told that
> it's 30 seconds,  I was thinking about whether or not I needed to have that
> raised much higher to prevent the particular issue I'm seeing (the tcp
> layer trying to send SYN packets for 3 minutes before giving up)?
>
> On Tue, Mar 22, 2016 at 5:27 PM Paul Gale <[hidden email]> wrote:
>
>> We're using the following with our NetApp device:
>>
>> -rw,soft,intr,noexec,nosuid,nodev,users,rsize=65535,wsize=65535,proto=tcp
>>
>> ​Whilst 'soft' is generally not recommended I've found that using 'hard'
>> causes the broker to lock up immediately.
>> The settings we have in place above were chosen by our storage team.
>>
>> Thanks,
>> Paul
>>
>