[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

classic Classic list List threaded Threaded
19 messages Options
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

asfgit
GitHub user TomasHofman opened a pull request:

    https://github.com/apache/activemq-artemis/pull/2287

    ARTEMIS-2069 Backup doesn't activate after shared store is reconnected

    https://issues.apache.org/jira/browse/ARTEMIS-2069
    https://issues.jboss.org/browse/WFLY-10968
    https://issues.jboss.org/browse/JBEAP-15343
   
    Fix tries to prevent a server activation thread from terminating when FileLockNodeManager.tryLock() throws an IOException, e.g. because temporarily inaccessible NFS directory.
    The node manager will repeat tryLock() call every two seconds.
    WARN message with stack trace will be printed on first failure, DEBUG messages will be printed on recurring failures.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/TomasHofman/activemq-artemis JBEAP-14032-master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/activemq-artemis/pull/2287.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2287
   
----
commit 162e71816e54137534c0bc8b2c3d6c85f941917d
Author: Tomas Hofman <thofman@...>
Date:   2018-09-03T13:47:03Z

    ARTEMIS-2069 Backup doesn't activate after shared store is reconnected

----


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

asfgit
Github user mnovak1 commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    @TomasHofman Could you update the state of this PR, please?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

asfgit
In reply to this post by asfgit
Github user franz1981 commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231511486
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    -            }
    -
    -            if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               logger.debug("lock is null");
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -         } else {
    -            return lock;
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    --- End diff --
   
    This would continue to attempt to lock regardless the `lockAcquisitionTimeout` (if any): i suppose it would be more robust if you will check the timeout anyway. There are tests that are using that timeout and I suppose it should be honored


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

asfgit
In reply to this post by asfgit
Github user franz1981 commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231513243
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    -            }
    -
    -            if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               logger.debug("lock is null");
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -         } else {
    -            return lock;
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    +            Thread.sleep(LOCK_ACCESS_FAILURE_WAIT_TIME);
              }
           }
     
           // todo this is here because sometimes channel.lock throws a resource deadlock exception but trylock works,
           // need to investigate further and review
    -      FileLock lock;
    +      FileLock lock = null;
    --- End diff --
   
    Same thing as the comment above.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

asfgit
In reply to this post by asfgit
Github user TomasHofman commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231822854
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    -            }
    -
    -            if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               logger.debug("lock is null");
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -         } else {
    -            return lock;
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    +            Thread.sleep(LOCK_ACCESS_FAILURE_WAIT_TIME);
              }
           }
     
           // todo this is here because sometimes channel.lock throws a resource deadlock exception but trylock works,
           // need to investigate further and review
    -      FileLock lock;
    +      FileLock lock = null;
    --- End diff --
   
    Now when I look at this again, this whole second loop in the _original code_ doesn't make sense - the only way execution could get here is when the ```interrupted``` flag was set to true, in which case we should exit immediately. The comment mentions "deadlock exception", but any exception in the first loop would terminate the method.
   
    I'm gonna remove this second loop altogether.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

asfgit
In reply to this post by asfgit
Github user TomasHofman commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231830652
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -299,36 +301,57 @@ protected FileLock tryLock(final long lockPos) throws IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    -            }
    -
    -            if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               logger.debug("lock is null");
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -         } else {
    -            return lock;
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    --- End diff --
   
    Modified to exit if timeout already reached, and don't sleep longer then remaining time to timeout.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

asfgit
In reply to this post by asfgit
Github user franz1981 commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231870842
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -299,44 +303,52 @@ protected FileLock tryLock(final long lockPos) throws IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -
    -            if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible
    +
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    +
    +            long waitTime = LOCK_ACCESS_FAILURE_WAIT_TIME;
    +            if (lockAcquisitionTimeout != -1) {
    +               final long remainingTime = lockAcquisitionTimeout - (System.currentTimeMillis() - start);
    +               if (remainingTime <= 0) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +               waitTime = Collections.min(Arrays.asList(waitTime, remainingTime));
    --- End diff --
   
    `Math.min` will do the same as `Collections.min(Arrays.asList(waitTime, remainingTime));`


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

asfgit
In reply to this post by asfgit
Github user TomasHofman commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r231878791
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -299,44 +303,52 @@ protected FileLock tryLock(final long lockPos) throws IOException {
     
        protected FileLock lock(final long lockPosition) throws Exception {
           long start = System.currentTimeMillis();
    +      boolean isRecurringFailure = false;
     
           while (!interrupted) {
    -         FileLock lock = tryLock(lockPosition);
    -
    -         if (lock == null) {
    -            try {
    -               Thread.sleep(500);
    -            } catch (InterruptedException e) {
    -               return null;
    +         try {
    +            FileLock lock = tryLock(lockPosition);
    +            isRecurringFailure = false;
    +
    +            if (lock == null) {
    +               try {
    +                  Thread.sleep(500);
    +               } catch (InterruptedException e) {
    +                  return null;
    +               }
    +
    +               if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +            } else {
    +               return lock;
                 }
    -
    -            if (lockAcquisitionTimeout != -1 && (System.currentTimeMillis() - start) > lockAcquisitionTimeout) {
    -               throw new Exception("timed out waiting for lock");
    +         } catch (IOException e) {
    +            // IOException during trylock() may be a temporary issue, e.g. NFS volume not being accessible
    +
    +            logger.log(isRecurringFailure ? Logger.Level.DEBUG : Logger.Level.WARN,
    +                    "Failure when accessing a lock file", e);
    +            isRecurringFailure = true;
    +
    +            long waitTime = LOCK_ACCESS_FAILURE_WAIT_TIME;
    +            if (lockAcquisitionTimeout != -1) {
    +               final long remainingTime = lockAcquisitionTimeout - (System.currentTimeMillis() - start);
    +               if (remainingTime <= 0) {
    +                  throw new Exception("timed out waiting for lock");
    +               }
    +               waitTime = Collections.min(Arrays.asList(waitTime, remainingTime));
    --- End diff --
   
    Good point, updated.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

asfgit
In reply to this post by asfgit
Github user TomasHofman commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    Retest this please.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

asfgit
In reply to this post by asfgit
Github user jbertram commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    @TomasHofman you'll need to update your branch somehow to force the PR build to run again.  I recommend something like `git commit --amend` & `push -f`.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

asfgit
In reply to this post by asfgit
Github user TomasHofman commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    No automation? Thanks, will do. :)


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

asfgit
In reply to this post by asfgit
Github user jbertram commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    The PR build is automated. You just need to update your branch so it knows to test it again.  A couple of git commands aren't really much harder than a GitHub comment.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

asfgit
In reply to this post by asfgit
Github user TomasHofman commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    Everything looks OK now, so I will wait for merging and porting to 2.6.x, and then will update downstream PR for 2.6.3.jbossorg-x (will cherry-pick -x from 2.6.x port).


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

asfgit
In reply to this post by asfgit
Github user TomasHofman commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    @jbertram @franz1981 we could really use this...


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

asfgit
In reply to this post by asfgit
Github user clebertsuconic commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r240232734
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -49,6 +49,8 @@
     
        private static final byte NOT_STARTED = 'N';
     
    +   private static final long LOCK_ACCESS_FAILURE_WAIT_TIME = 2000;
    --- End diff --
   
    You are here effectively ignoring the acquire timeout where the user can configure it. I'm not sure this is correct.
   
    your test should play with a configured timeout and see if you get the expected behaviour. Adding this you are forcing your own timeout bypassing the configured one.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

asfgit
In reply to this post by asfgit
Github user TomasHofman commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r240293147
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -49,6 +49,8 @@
     
        private static final byte NOT_STARTED = 'N';
     
    +   private static final long LOCK_ACCESS_FAILURE_WAIT_TIME = 2000;
    --- End diff --
   
    @clebertsuconic no, the `lockAcquisitionTimeout` is not ignored.
   
    If you check the line 337, actual waiting time is either `LOCK_ACCESS_FAILURE_WAIT_TIME` or time remaining until `lockAcquisitionTimeout` runs out, whichever is lower, so acquisition timeout set by user is honored.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis pull request #2287: ARTEMIS-2069 Backup doesn't activate af...

asfgit
In reply to this post by asfgit
Github user TomasHofman commented on a diff in the pull request:

    https://github.com/apache/activemq-artemis/pull/2287#discussion_r240294762
 
    --- Diff: artemis-server/src/main/java/org/apache/activemq/artemis/core/server/impl/FileLockNodeManager.java ---
    @@ -49,6 +49,8 @@
     
        private static final byte NOT_STARTED = 'N';
     
    +   private static final long LOCK_ACCESS_FAILURE_WAIT_TIME = 2000;
    --- End diff --
   
    I added another test to verify that `lockAcquisitionTimeout` is honored.
   
    The only hard waiting time is 500ms on line 313, but that's the original behavior. I can make that honor `lockAcquisitionTimeout` as well if desired.


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

asfgit
In reply to this post by asfgit
Github user clebertsuconic commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    My point is.  -1 is forever.  If you set it forever it will be failing at 2 seconds.  
   
   
    What happens if you just configure the system at 2 seconds instead of -1?


---
Reply | Threaded
Open this post in threaded view
|

[GitHub] activemq-artemis issue #2287: ARTEMIS-2069 Backup doesn't activate after sha...

asfgit
In reply to this post by asfgit
Github user TomasHofman commented on the issue:

    https://github.com/apache/activemq-artemis/pull/2287
 
    No, if `lockAcquisitionTimeout` is set to -1, it will wait 2 seconds, and then retry the lock. I don't see what you are referring to.
   
    There are two situations:
    * `tryLock(pos)` returns null (= lock is already taken), *this hasn't changed*.
    * `tryLock(pos)` throws IOException:
      * If `lockAcquisitionTimeout == -1` it will wait 2 seconds and then retry, forever.
      * If `lockAcquisitionTimeout != -1`, it will wait 2 seconds or remaining time until timeout and retry, or if remaining time is <= 0 exception is thrown.


---