[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

jbertram-2
GitHub user TomasHofman opened a pull request:

    https://github.com/apache/activemq-artemis/pull/2287

    ARTEMIS-2069 Backup doesn't activate after shared store is reconnected

    https://issues.apache.org/jira/browse/ARTEMIS-2069
    https://issues.jboss.org/browse/WFLY-10968
    https://issues.jboss.org/browse/JBEAP-15343
   
    Fix tries to prevent a server activation thread from terminating when FileLockNodeManager.tryLock() throws an IOException, e.g. because temporarily inaccessible NFS directory.
    The node manager will repeat tryLock() call every two seconds.
    WARN message with stack trace will be printed on first failure, DEBUG messages will be printed on recurring failures.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/TomasHofman/activemq-artemis JBEAP-14032-master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/activemq-artemis/pull/2287.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2287
   
----
commit 162e71816e54137534c0bc8b2c3d6c85f941917d
Author: Tomas Hofman <thofman@...>
Date:   2018-09-03T13:47:03Z

    ARTEMIS-2069 Backup doesn't activate after shared store is reconnected

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

jbertram-2
Github user mnovak1 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    @TomasHofman Could you update the state of this PR, please?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

jbertram-2
In reply to this post by jbertram-2
Github user franz1981 commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231511486
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    -            }
    -
    -            if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               logger.debug("lock is null");
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -         } else {
    -            return lock;
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    --- End diff --
   
    This would continue to attempt to lock regardless the `lockAcquisitionTimeout` (if any): i suppose it would be more robust if you will check the timeout anyway. There are tests that are using that timeout and I suppose it should be honored


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

jbertram-2
In reply to this post by jbertram-2
Github user franz1981 commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231513243
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    -            }
    -
    -            if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               logger.debug("lock is null");
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -         } else {
    -            return lock;
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    +            Thread.sleep(LOCK_ACCESS_FAILURE_WAIT_TIME);
              }
           }
     
           // todo this is here because sometimes channel.lock throws a resource deadlock exception but trylock works,
           // need to investigate further and review
    -      FileLock lock;
    +      FileLock lock = null;
    --- End diff --
   
    Same thing as the comment above.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

jbertram-2
In reply to this post by jbertram-2
Github user TomasHofman commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231822854
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    -            }
    -
    -            if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               logger.debug("lock is null");
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -         } else {
    -            return lock;
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    +            Thread.sleep(LOCK_ACCESS_FAILURE_WAIT_TIME);
              }
           }
     
           // todo this is here because sometimes channel.lock throws a resource deadlock exception but trylock works,
           // need to investigate further and review
    -      FileLock lock;
    +      FileLock lock = null;
    --- End diff --
   
    Now when I look at this again, this whole second loop in the _original code_ doesn't make sense - the only way execution could get here is when the ```interrupted``` flag was set to true, in which case we should exit immediately. The comment mentions "deadlock exception", but any exception in the first loop would terminate the method.
   
    I'm gonna remove this second loop altogether.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

jbertram-2
In reply to this post by jbertram-2
Github user TomasHofman commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231830652
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    -            }
    -
    -            if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               logger.debug("lock is null");
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -         } else {
    -            return lock;
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    --- End diff --
   
    Modified to exit if timeout already reached, and don't sleep longer then remaining time to timeout.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

jbertram-2
In reply to this post by jbertram-2
Github user franz1981 commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231870842
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -299,44 +303,52 @@ protected FileLock tryLock(final long lockPos) throws IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -
    -            if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible
    +
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    +
    +            long waitTime = LOCK_ACCESS_FAILURE_WAIT_TIME;
    +            if (lockAcquisitionTimeout != -1) {
    +               final long remainingTime = lockAcquisitionTimeout - (System.currentTimeMillis() - start);
    +               if (remainingTime <= 0) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +               waitTime = Collections.min(Arrays.asList(waitTime, remainingTime));
    --- End diff --
   
    `Math.min` will do the same as `Collections.min(Arrays.asList(waitTime, remainingTime));`


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

jbertram-2
In reply to this post by jbertram-2
Github user TomasHofman commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231878791
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -299,44 +303,52 @@ protected FileLock tryLock(final long lockPos) throws IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -
    -            if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible
    +
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    +
    +            long waitTime = LOCK_ACCESS_FAILURE_WAIT_TIME;
    +            if (lockAcquisitionTimeout != -1) {
    +               final long remainingTime = lockAcquisitionTimeout - (System.currentTimeMillis() - start);
    +               if (remainingTime <= 0) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +               waitTime = Collections.min(Arrays.asList(waitTime, remainingTime));
    --- End diff --
   
    Good point, updated.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

jbertram-2
In reply to this post by jbertram-2
Github user TomasHofman commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    Retest this please.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

jbertram-2
In reply to this post by jbertram-2
Github user jbertram commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    @TomasHofman you'll need to update your branch somehow to force the PR build to run again.  I recommend something like `git commit --amend` & `push -f`.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

jbertram-2
In reply to this post by jbertram-2
Github user TomasHofman commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    No automation? Thanks, will do. :)


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

jbertram-2
In reply to this post by jbertram-2
Github user jbertram commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    The PR build is automated. You just need to update your branch so it knows to test it again.  A couple of git commands aren't really much harder than a GitHub comment.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

jbertram-2
In reply to this post by jbertram-2
Github user TomasHofman commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    Everything looks OK now, so I will wait for merging and porting to 2.6.x, and then will update downstream PR for 2.6.3.jbossorg-x (will cherry-pick -x from 2.6.x port).


---