We have a two active data centre messaging solution with activemq running master/slave/slave on gfs2. The activemq instances are connected with a network of brokers between the datacentres so that messages follow consumers on failover.
We are using version 5.13.2 and the system has been fast and stable for 12 months (monthly upgrades, so monthly broker restarts) with about 50-100 messages per second.
Yesterday we saw a slowdown in about 5% of the broker enqueues/dequeues at one datacentre. We failed the traffic from this DC over to our alternative datacentre to investigate, but the problem followed the traffic. After failback the second datacentre returned to normal operation and the problem followed the traffic back to the original datacentre.
We could not conclusively find the cause, but did notice some odd remote connections after the failover - see screenshot.
Restarting the broker has returned things to normal, but it's a mystery why the problem followed the traffic and didn't resolve on failover.
There were no signs of an increase in network traffic while the problem was there.
Posting this in case anyone has any ideas about the null remote connections, what could cause them and whether they are a possible cause or if anyone has any other ideas of what to check.
There's nothing in the logs to indicate any problems - logs running at INFO level.