Message Group Limitations (how many simulataneous groups are supported?)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Message Group Limitations (how many simulataneous groups are supported?)

derek_mlist

Hello,

We are investigating using message groups as a technique to insure "in order" processing of various IDs.   In the past (2010), I have found posts mentioning that more than 1024 message groups will trigger an overflow for the default implementations.    http://scott.cranton.com/2010/09/activemq-message-groups.html 

I can't find anything specific that is newer on the topic except "there could be a massive number of individual message groups we use hash buckets rather than the actual JMSXGroupID string."     The word "massive" is ambiguous and relative.   Say we are using social security numbers as our IDs and have millions of customers.     Is 2 million records going to be supportable by default?

I also understand hashmaps pretty well.   I don't understand what is meant by using "hash buckets" rather than the JMSXGroupID string.     In a usual hashmap implementation a bucket would store a certain number of hashed keys.    What does it matter if the key is stored or a hashed key is stored, and what is the relation to the bucket?  

Thanks in advance.

Derek
Reply | Threaded
Open this post in threaded view
|

Re: Message Group Limitations (how many simulataneous groups are supported?)

paulgale
The default limit is indeed 1024 message groups per destination.

However you increase that by adding the following configuration to your
broker's activemq.xml:

<destinationPolicy>
  <policyMap>
    <policyEntries>
      <policyEntry queue=">">
        <messageGroupMapFactory>
           <cachedMessageGroupMessageFactory cacheSize="2048"/>
        </messageGroupMapFactory>
      </policyEntry>
    </policyEntries>
  </policyMap>
</destinationPolicy>

In addition to cachedMessageGroupMessageFactory (default) there's also the
messageGroupHashBucketFactory and simpleMessageGroupMapFactory factories.
Just dig through the source code. They're all in the
org.apache.activemq.broker.region.group package.

However, it's not advisable to simply increase the cache size to allow for
millions of entries, not least because there's no real way of knowing the
upper limit of how many you'll need.

Therefore you're better off calculating a poor man's hash of the underlying
value mod 1024 to calculate the group id. With such a large many-to-one
mapping many values will map to the same group id. However, that doesn't
matter. As long as the hash of the value is the same each time it's taken
that's all that matters. The groups are only needed to divide the work up
amongst your consumers. Exactly how many values map to each group id is
irrelevant.

Example: an input value of 111-22-1234 might hash to 0xabcd1234. Use that
hash value to calculate the group id:

0xabcd1234 % 1024 = group id

The only thing one has to watch for is ensuring that the distribution of
hash values to group id is uniform. Otherwise you'll end up with consumers
handling much more workload than their fair share.

It's worth noting that if you plan to use message group sequence numbers
(for whatever reason) that you'll have to increment it's value yourself as
the broker doesn't manage the sequence id by default.

I hope this makes sense.

Thanks,
Paul

On Mon, Jan 25, 2016 at 4:55 PM, derek_mlist <[hidden email]> wrote:

>
> Hello,
>
> We are investigating using message groups as a technique to insure "in
> order" processing of various IDs.   In the past (2010), I have found posts
> mentioning that more than 1024 message groups will trigger an overflow for
> the default implementations.
> http://scott.cranton.com/2010/09/activemq-message-groups.html
>
> I can't find anything specific that is newer on the topic except "there
> could be a massive number of individual message groups we use hash buckets
> rather than the actual JMSXGroupID string."     The word "massive" is
> ambiguous and relative.   Say we are using social security numbers as our
> IDs and have millions of customers.     Is 2 million records going to be
> supportable by default?
>
> I also understand hashmaps pretty well.   I don't understand what is meant
> by using "hash buckets" rather than the JMSXGroupID string.     In a usual
> hashmap implementation a bucket would store a certain number of hashed
> keys.
> What does it matter if the key is stored or a hashed key is stored, and
> what
> is the relation to the bucket?
>
> Thanks in advance.
>
> Derek
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/Message-Group-Limitations-how-many-simulataneous-groups-are-supported-tp4706412.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Message Group Limitations (how many simulataneous groups are supported?)

artnaseef
In reply to this post by derek_mlist
The default Message Group Map implementation was recently changed to use an LRU Cache of message groups.

Here's the issue with message groups - the broker does not know the set of message group IDs ahead of time and must allow for any number of group IDs to be used.  If the total set of possible message group IDs is small, this is not a major concern, but if it is large, then tracking message group owners over time acts like a memory leak (consider the memory needed to maintain a mapping of 1 million message groups).

The cache implementation attempts to address the concern by limiting the map to only retain 1024 group ID mappings  (by default - the number appears to be configurable from looking at the code).  This means that once 1025 group IDs exist at once within the broker, assignments will get lost (and new consumers attached to the dropped assignments as-needed, leading to more dropped assignments, and so on).

On the other hand, the previous default implementation used a Hash Map so that the hash value of each group ID determined a "bucket" to which the group was assigned; that bucket can then be assigned to any number of groups.  The bucket is assigned to a single consumer.  Like the LRU cache, the number of buckets is limited, thereby eliminating the possibility of a "pseudo-leak".  However, this leads to the issue that assignments may not be fair and a single consumer may be assigned any combination of groups entirely based on the hash of the group IDs.  If selectors are added to the mix, this easily leads to messages assigned to consumers that cannot consume the messages.  Yuck.  Add in the max page size limitation and messages start getting stuck all over the place - double yuck.

The best practice in general is to look for ways to avoid order dependencies (e.g. attaching sequence numbers to messages so that the processor can determine when messages are received out-of-order and then suspend processing until the late messages are received).  Camel's aggregator and/or resequencer processors can help here.

Using a key such as social security number for message groups is going to be challenging simply due to the number of groups involved, and the memory leak concern mentioned above.  If guarantees can be met, such as "no more than 1000 SS numbers will ever have pending messages at a time," then the concerns can be eliminated.  Probably the hash map solution will be the best bet here - at the expense of reduced fairness of mappings (one consumer can easily carry more than its share) and eliminating the feasibility of selectors (although I usually recommend against using selectors with message groups anyway).
Reply | Threaded
Open this post in threaded view
|

Re: Message Group Limitations (how many simulataneous groups are supported?)

paulgale
*>Using a key such as social security number for message groups is going to
be challenging simply due to the number of groups involved*
This can be overcome by using the hashing technique I described earlier
which has worked out nicely for us YMMV. In practice we've found that more
often than not that the item we wish to group on greatly exceeds 1024 hence
the effectiveness of the hashing method.

*>The default Message Group Map implementation was recently changed to use
an LRU Cache of message groups.*
When did this change? Perhaps I'm missing something but I've looked at the
source for 5.11, 5.12 and 5.13 and they all
​ appear to be​
using CachedMessageGroupMapFactory - and its implementation doesn't appear
to have changed either. Just wondering.

Thanks,
Paul

On Mon, Jan 25, 2016 at 5:49 PM, artnaseef <[hidden email]> wrote:

> The default Message Group Map implementation was recently changed to use an
> LRU Cache of message groups.
>
> Here's the issue with message groups - the broker does not know the set of
> message group IDs ahead of time and must allow for any number of group IDs
> to be used.  If the total set of possible message group IDs is small, this
> is not a major concern, but if it is large, then tracking message group
> owners over time acts like a memory leak (consider the memory needed to
> maintain a mapping of 1 million message groups).
>
> The cache implementation attempts to address the concern by limiting the
> map
> to only retain 1024 group ID mappings  (by default - the number appears to
> be configurable from looking at the code).  This means that once 1025 group
> IDs exist at once within the broker, assignments will get lost (and new
> consumers attached to the dropped assignments as-needed, leading to more
> dropped assignments, and so on).
>
> On the other hand, the previous default implementation used a Hash Map so
> that the hash value of each group ID determined a "bucket" to which the
> group was assigned; that bucket can then be assigned to any number of
> groups.  The bucket is assigned to a single consumer.  Like the LRU cache,
> the number of buckets is limited, thereby eliminating the possibility of a
> "pseudo-leak".  However, this leads to the issue that assignments may not
> be
> fair and a single consumer may be assigned any combination of groups
> entirely based on the hash of the group IDs.  If selectors are added to the
> mix, this easily leads to messages assigned to consumers that cannot
> consume
> the messages.  Yuck.  Add in the max page size limitation and messages
> start
> getting stuck all over the place - double yuck.
>
> The best practice in general is to look for ways to avoid order
> dependencies
> (e.g. attaching sequence numbers to messages so that the processor can
> determine when messages are received out-of-order and then suspend
> processing until the late messages are received).  Camel's aggregator
> and/or
> resequencer processors can help here.
>
> Using a key such as social security number for message groups is going to
> be
> challenging simply due to the number of groups involved, and the memory
> leak
> concern mentioned above.  If guarantees can be met, such as "no more than
> 1000 SS numbers will ever have pending messages at a time," then the
> concerns can be eliminated.  Probably the hash map solution will be the
> best
> bet here - at the expense of reduced fairness of mappings (one consumer can
> easily carry more than its share) and eliminating the feasibility of
> selectors (although I usually recommend against using selectors with
> message
> groups anyway).
>
>
>
>
> --
> View this message in context:
> http://activemq.2283324.n4.nabble.com/Message-Group-Limitations-how-many-simulataneous-groups-are-supported-tp4706412p4706419.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: Message Group Limitations (how many simulataneous groups are supported?)

artnaseef
Searching git history, it appears the following commit introduced the change back in 2013:

468e69765145ddad199963260e4774d179ad5555

That first appears in 5.9.0.  So, it was longer ago than I realized ;-).

Cheers!