Improve paging performance when there are lots of subscribers

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Improve paging performance when there are lots of subscribers

wei yang
Hi, folks

This is the discussion about "ARTEMIS-2399 Fix performance degradation when there are a lot of subscribers".

First apologize i didn't clarify our thoughts.

As noted in the part of Environmentpage-max-cache-size is set to 1 meaning at most one page is allowed in softValueCache. We have tested with the default page-max-cache-size which is 5, it would take some time to see the performance degradation since at start the cursor positions of 100 subscribers are similar when all the messages read hits the softValueCache. But after some time, the cursor positions are different. When these positions are located more than 5 pages, it means some page would be read back and forth. This can be proved by the trace log "adding pageCache pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl where some pages are read a lot of times for the same subscriber. From the time on, the performance starts to degrade. So we set page-max-cache-size to 1 here just to make the test process more fast and it doesn't change the final result.

The softValueCache would be removed if memory is really low, in addition the map size reaches capacity(default 5). In most cases, the subscribers are tailing read which are served by softValueCache(no need to bother disk), thus we need to keep it. But When some subscribers fall behind, they need to read page not in softValueCache. After looking up code, we found one depage round is following at most MAX_SCHEDULED_RUNNERS deliver round in most situations, and that's to say at most MAX_DELIVERIES_IN_LOOP * MAX_SCHEDULED_RUNNERS number of messages would be depaged next. If you adjust QueueImpl logger to debug level, you would see logs like "Queue Memory Size after depage on queue=sub4 is 53478769 with maxSize = 52428800. Depaged 68 messages, pendingDelivery=1002, intermediateMessageReferences= 23162, queueDelivering=0". In order to depage less than 2000 messages, each subscriber has to read a whole page which is unnecessary and wasteful. In our test where one page(50MB) contains ~40000 messages, one subscriber maybe read 40000/2000=20 times of page if softValueCache is evicted to finish delivering it. This has drastically slowed down the process and burdened on the disk. So we add the PageIndexCacheImpl and read one message each time rather than read all messages of page. In this way, for each subscriber each page is read only once after finishing delivering.

Having said that, the softValueCache is used for tailing read. If it's evicted, it won't be reloaded to prevent from the issue illustrated above. Instead the pageIndexCache would be used.

Regarding implementation details, we noted that before delivering page, a pageCursorInfo is constructed which needs to read the whole page. We can take this opportunity to construct the pageIndexCache. It's very simple to code. We also think of building a offset index file and some concerns stemed from following:
  1. When to write and sync index file? Would it have some performance implications?
  2. If we have a index file, we can construct pageCursorInfo through it(no need to read the page like before), but we need to write the total message number into it first. Seems a little weird putting this into the index file.
  3. If experiencing hard crash, a recover mechanism would be needed to recover page and page index files, E.g. truncating to the valid size. So how do we know which files need to be sanity checked?
  4. A variant binary search algorithm maybe needed, see https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala .
  5. Unlike kafka from which user fetches lots of messages at once and broker just needs to look up start offset from the index file once, artemis delivers message one by one and that means we have to look up the index every time we deliver a message. Although the index file is possibly in page cache, there are still chances we miss cache.
  6. Compatibility with old files.
To sum that, kafka uses a mmaped index file and we use a index cache. Both are designed to find physical file position according offset(kafka) or message number(artemis). And we prefer the index cache bcs it's easy to understand and maintain.

We also tested the one subscriber case with the same setup.
The original:
consumer tps(11000msg/s) and latency:

producer tps(30000msg/s) and latency:

The pr:
consumer tps(14000msg/s) and latency:

producer tps(30000msg/s) and latency:

It showed result is similar and event a little better in the case of single subscriber.

We used our inner test platform and i think jmeter can also be used to test again it.
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

michaelpearce
Hi

First of all i think this is an excellent effort, and could be a potential
massive positive change.

Before making any change on such scale, i do think we need to ensure we
have sufficient benchmarks on a number of scenarios, not just one use case,
and the benchmark tool used does need to be available openly so that others
can verify the measures and check on their setups.

Some additional scenarios i would want/need covering are:

PageCache set to 5, and all consumers keeping up, but lagging enough to be
reading from the same 1st page cache, latency and throughput need to be
measured for all.
PageCache set to 5 and all consumers but one keeping up but lagging enough
to be reading from the same 1st page cahce, but the one is falling off the
end, causing the page cache swapping, measure latecy and througput of those
keeping up in the 1st page cache not caring for the one.

Regards to solution some alternative approach to discuss

In your scenario if i understand correctly each subscriber is effectivly
having their own queue (1 to 1 mapping) not sharing.
You mention kafka and say multiple consumers doent read serailly on the
address and this is true, but per queue processing through messages
(dispatch) is still serial even with multiple shared consumers on a queue.

What about keeping the existing mechanism but having a queue hold reference
to a page cache that the queue is currently on, being kept from gc (e.g.
not soft) therefore meaning page cache isnt being swapped around, when you
have queues (in your case subscribers) swapping pagecaches back and forth
avoidning the constant re-read issue.

Also i think Franz had an excellent idea, do away with pagecache in its
current form entirely, ensure the offset is kept with the reference and
rely on OS caching keeping hot blocks/data.

Best
Michael



On Thu, 27 Jun 2019 at 05:13, yw yw <[hidden email]> wrote:

> Hi, folks
>
> This is the discussion about "ARTEMIS-2399 Fix performance degradation
> when there are a lot of subscribers".
>
> First apologize i didn't clarify our thoughts.
>
> As noted in the part of Environment, page-max-cache-size is set to 1
> meaning at most one page is allowed in softValueCache. We have tested with
> the default page-max-cache-size which is 5, it would take some time to
> see the performance degradation since at start the cursor positions of 100
> subscribers are similar when all the messages read hits the softValueCache.
> But after some time, the cursor positions are different. When these
> positions are located more than 5 pages, it means some page would be read
> back and forth. This can be proved by the trace log "adding pageCache
> pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl where some
> pages are read a lot of times for the same subscriber. From the time on,
> the performance starts to degrade. So we set page-max-cache-size to 1
> here just to make the test process more fast and it doesn't change the
> final result.
>
> The softValueCache would be removed if memory is really low, in addition
> the map size reaches capacity(default 5). In most cases, the subscribers
> are tailing read which are served by softValueCache(no need to bother
> disk), thus we need to keep it. But When some subscribers fall behind, they
> need to read page not in softValueCache. After looking up code, we found one
> depage round is following at most MAX_SCHEDULED_RUNNERS deliver round in
> most situations, and that's to say at most MAX_DELIVERIES_IN_LOOP *
> MAX_SCHEDULED_RUNNERS number of messages would be depaged next. If you
> adjust QueueImpl logger to debug level, you would see logs like "Queue
> Memory Size after depage on queue=sub4 is 53478769 with maxSize = 52428800.
> Depaged 68 messages, pendingDelivery=1002, intermediateMessageReferences=
> 23162, queueDelivering=0". In order to depage less than 2000 messages,
> each subscriber has to read a whole page which is unnecessary and wasteful.
> In our test where one page(50MB) contains ~40000 messages, one subscriber
> maybe read 40000/2000=20 times of page if softValueCache is evicted to
> finish delivering it. This has drastically slowed down the process and
> burdened on the disk. So we add the PageIndexCacheImpl and read one message
> each time rather than read all messages of page. In this way, for each
> subscriber each page is read only once after finishing delivering.
>
> Having said that, the softValueCache is used for tailing read. If it's
> evicted, it won't be reloaded to prevent from the issue illustrated above.
> Instead the pageIndexCache would be used.
>
> Regarding implementation details, we noted that before delivering page, a
> pageCursorInfo is constructed which needs to read the whole page. We can
> take this opportunity to construct the pageIndexCache. It's very simple to
> code. We also think of building a offset index file and some concerns
> stemed from following:
>
>    1. When to write and sync index file? Would it have some performance
>    implications?
>    2. If we have a index file, we can construct pageCursorInfo through
>    it(no need to read the page like before), but we need to write the total
>    message number into it first. Seems a little weird putting this into the
>    index file.
>    3. If experiencing hard crash, a recover mechanism would be needed to
>    recover page and page index files, E.g. truncating to the valid size. So
>    how do we know which files need to be sanity checked?
>    4. A variant binary search algorithm maybe needed, see
>    https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
>     .
>    5. Unlike kafka from which user fetches lots of messages at once and
>    broker just needs to look up start offset from the index file once, artemis
>    delivers message one by one and that means we have to look up the index
>    every time we deliver a message. Although the index file is possibly in
>    page cache, there are still chances we miss cache.
>    6. Compatibility with old files.
>
> To sum that, kafka uses a mmaped index file and we use a index cache. Both
> are designed to find physical file position according offset(kafka) or
> message number(artemis). And we prefer the index cache bcs it's easy to
> understand and maintain.
>
> We also tested the one subscriber case with the same setup.
> The original:
> consumer tps(11000msg/s) and latency:
> [image: orig_single_subscriber.png]
> producer tps(30000msg/s) and latency:
> [image: orig_single_producer.png]
> The pr:
> consumer tps(14000msg/s) and latency:
> [image: pr_single_consumer.png]
> producer tps(30000msg/s) and latency:
> [image: pr_single_producer.png]
> It showed result is similar and event a little better in the case of
> single subscriber.
>
> We used our inner test platform and i think jmeter can also be used to
> test again it.
>
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

wei yang
Hi,

We deployed in 20 machines for this stressing test and used grinder
runner/jmeter mainly for aggregating results. We created totally 560
threads, each one picking up randomly a queue to consume. Thus the consumer
number for each queue was randomly distributed. This is to imitate our
business scenario where each queue has it's own consumer concurrency and
response time.

Regard to tools, you can also use artemis command producer/consumer to
perform test in one powerful computer(but first we need to add some metrics
like tps, latency in the code).

I agree we need to cover enough scenarios. For your scenario#1, Do you mean
all read the same page(maybe the oldest or one in the middle but not
latest)? Isn't it similar with the only one subscription? For your
scenario#2, Do you mean E.g. 99 queues read page #10(used count is 99,
won't be evicted), 1 queue read page #1(used count is 1, maybe evicted
since producers created lots of page caches of which used count is 1 too)?

With regard to "having a queue hold reference to a page cache", I'm not
sure i understand it correctly. If each queue holds the page cache and
their cursor positions spread over 100 different pages , 100 queues would
hold 50MB*100=5GB cache in memory?


Michael Pearce <[hidden email]> 于2019年6月27日周四 下午3:03写道:

> Hi
>
> First of all i think this is an excellent effort, and could be a potential
> massive positive change.
>
> Before making any change on such scale, i do think we need to ensure we
> have sufficient benchmarks on a number of scenarios, not just one use case,
> and the benchmark tool used does need to be available openly so that others
> can verify the measures and check on their setups.
>
> Some additional scenarios i would want/need covering are:
>
> PageCache set to 5, and all consumers keeping up, but lagging enough to be
> reading from the same 1st page cache, latency and throughput need to be
> measured for all.
> PageCache set to 5 and all consumers but one keeping up but lagging enough
> to be reading from the same 1st page cahce, but the one is falling off the
> end, causing the page cache swapping, measure latecy and througput of those
> keeping up in the 1st page cache not caring for the one.
>
> Regards to solution some alternative approach to discuss
>
> In your scenario if i understand correctly each subscriber is effectivly
> having their own queue (1 to 1 mapping) not sharing.
> You mention kafka and say multiple consumers doent read serailly on the
> address and this is true, but per queue processing through messages
> (dispatch) is still serial even with multiple shared consumers on a queue.
>
> What about keeping the existing mechanism but having a queue hold reference
> to a page cache that the queue is currently on, being kept from gc (e.g.
> not soft) therefore meaning page cache isnt being swapped around, when you
> have queues (in your case subscribers) swapping pagecaches back and forth
> avoidning the constant re-read issue.
>
> Also i think Franz had an excellent idea, do away with pagecache in its
> current form entirely, ensure the offset is kept with the reference and
> rely on OS caching keeping hot blocks/data.
>
> Best
> Michael
>
>
>
> On Thu, 27 Jun 2019 at 05:13, yw yw <[hidden email]> wrote:
>
> > Hi, folks
> >
> > This is the discussion about "ARTEMIS-2399 Fix performance degradation
> > when there are a lot of subscribers".
> >
> > First apologize i didn't clarify our thoughts.
> >
> > As noted in the part of Environment, page-max-cache-size is set to 1
> > meaning at most one page is allowed in softValueCache. We have tested
> with
> > the default page-max-cache-size which is 5, it would take some time to
> > see the performance degradation since at start the cursor positions of
> 100
> > subscribers are similar when all the messages read hits the
> softValueCache.
> > But after some time, the cursor positions are different. When these
> > positions are located more than 5 pages, it means some page would be read
> > back and forth. This can be proved by the trace log "adding pageCache
> > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl where some
> > pages are read a lot of times for the same subscriber. From the time on,
> > the performance starts to degrade. So we set page-max-cache-size to 1
> > here just to make the test process more fast and it doesn't change the
> > final result.
> >
> > The softValueCache would be removed if memory is really low, in addition
> > the map size reaches capacity(default 5). In most cases, the subscribers
> > are tailing read which are served by softValueCache(no need to bother
> > disk), thus we need to keep it. But When some subscribers fall behind,
> they
> > need to read page not in softValueCache. After looking up code, we found
> one
> > depage round is following at most MAX_SCHEDULED_RUNNERS deliver round in
> > most situations, and that's to say at most MAX_DELIVERIES_IN_LOOP *
> > MAX_SCHEDULED_RUNNERS number of messages would be depaged next. If you
> > adjust QueueImpl logger to debug level, you would see logs like "Queue
> > Memory Size after depage on queue=sub4 is 53478769 with maxSize =
> 52428800.
> > Depaged 68 messages, pendingDelivery=1002, intermediateMessageReferences=
> > 23162, queueDelivering=0". In order to depage less than 2000 messages,
> > each subscriber has to read a whole page which is unnecessary and
> wasteful.
> > In our test where one page(50MB) contains ~40000 messages, one subscriber
> > maybe read 40000/2000=20 times of page if softValueCache is evicted to
> > finish delivering it. This has drastically slowed down the process and
> > burdened on the disk. So we add the PageIndexCacheImpl and read one
> message
> > each time rather than read all messages of page. In this way, for each
> > subscriber each page is read only once after finishing delivering.
> >
> > Having said that, the softValueCache is used for tailing read. If it's
> > evicted, it won't be reloaded to prevent from the issue illustrated
> above.
> > Instead the pageIndexCache would be used.
> >
> > Regarding implementation details, we noted that before delivering page, a
> > pageCursorInfo is constructed which needs to read the whole page. We can
> > take this opportunity to construct the pageIndexCache. It's very simple
> to
> > code. We also think of building a offset index file and some concerns
> > stemed from following:
> >
> >    1. When to write and sync index file? Would it have some performance
> >    implications?
> >    2. If we have a index file, we can construct pageCursorInfo through
> >    it(no need to read the page like before), but we need to write the
> total
> >    message number into it first. Seems a little weird putting this into
> the
> >    index file.
> >    3. If experiencing hard crash, a recover mechanism would be needed to
> >    recover page and page index files, E.g. truncating to the valid size.
> So
> >    how do we know which files need to be sanity checked?
> >    4. A variant binary search algorithm maybe needed, see
> >
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> >     .
> >    5. Unlike kafka from which user fetches lots of messages at once and
> >    broker just needs to look up start offset from the index file once,
> artemis
> >    delivers message one by one and that means we have to look up the
> index
> >    every time we deliver a message. Although the index file is possibly
> in
> >    page cache, there are still chances we miss cache.
> >    6. Compatibility with old files.
> >
> > To sum that, kafka uses a mmaped index file and we use a index cache.
> Both
> > are designed to find physical file position according offset(kafka) or
> > message number(artemis). And we prefer the index cache bcs it's easy to
> > understand and maintain.
> >
> > We also tested the one subscriber case with the same setup.
> > The original:
> > consumer tps(11000msg/s) and latency:
> > [image: orig_single_subscriber.png]
> > producer tps(30000msg/s) and latency:
> > [image: orig_single_producer.png]
> > The pr:
> > consumer tps(14000msg/s) and latency:
> > [image: pr_single_consumer.png]
> > producer tps(30000msg/s) and latency:
> > [image: pr_single_producer.png]
> > It showed result is similar and event a little better in the case of
> > single subscriber.
> >
> > We used our inner test platform and i think jmeter can also be used to
> > test again it.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

wei yang
In reply to this post by michaelpearce
Sorry, I missed the PageReferece part.

The lifecyle of PageReference is: depage(in intermediateMessageReferences)
-> deliver(in messageReferences) -> waiting for ack(in deliveringRefs) ->
removed. Every queue would create it's own PageReference which means the
offset info is 100 times large compared to the shared page index cache.
If we keep 51MB pageReference size in memory, as i said in pr, "For
multiple subscribers to the same address, just one executor is responsible
for delivering which means at the same moment only one queue is delivering.
Thus the queue maybe stalled for a long time. We get queueMemorySize
messages into memory, and when we deliver these after a long time, we
probably need to query message and read page file again.".  At last for one
message we maybe read twice: first we read page and create pagereference;
second we requery message after its reference is removed.

For the shared page index cache design, each message just need to be read
from file once.

Michael Pearce <[hidden email]> 于2019年6月27日周四 下午3:03写道:

> Hi
>
> First of all i think this is an excellent effort, and could be a potential
> massive positive change.
>
> Before making any change on such scale, i do think we need to ensure we
> have sufficient benchmarks on a number of scenarios, not just one use case,
> and the benchmark tool used does need to be available openly so that others
> can verify the measures and check on their setups.
>
> Some additional scenarios i would want/need covering are:
>
> PageCache set to 5, and all consumers keeping up, but lagging enough to be
> reading from the same 1st page cache, latency and throughput need to be
> measured for all.
> PageCache set to 5 and all consumers but one keeping up but lagging enough
> to be reading from the same 1st page cahce, but the one is falling off the
> end, causing the page cache swapping, measure latecy and througput of those
> keeping up in the 1st page cache not caring for the one.
>
> Regards to solution some alternative approach to discuss
>
> In your scenario if i understand correctly each subscriber is effectivly
> having their own queue (1 to 1 mapping) not sharing.
> You mention kafka and say multiple consumers doent read serailly on the
> address and this is true, but per queue processing through messages
> (dispatch) is still serial even with multiple shared consumers on a queue.
>
> What about keeping the existing mechanism but having a queue hold reference
> to a page cache that the queue is currently on, being kept from gc (e.g.
> not soft) therefore meaning page cache isnt being swapped around, when you
> have queues (in your case subscribers) swapping pagecaches back and forth
> avoidning the constant re-read issue.
>
> Also i think Franz had an excellent idea, do away with pagecache in its
> current form entirely, ensure the offset is kept with the reference and
> rely on OS caching keeping hot blocks/data.
>
> Best
> Michael
>
>
>
> On Thu, 27 Jun 2019 at 05:13, yw yw <[hidden email]> wrote:
>
> > Hi, folks
> >
> > This is the discussion about "ARTEMIS-2399 Fix performance degradation
> > when there are a lot of subscribers".
> >
> > First apologize i didn't clarify our thoughts.
> >
> > As noted in the part of Environment, page-max-cache-size is set to 1
> > meaning at most one page is allowed in softValueCache. We have tested
> with
> > the default page-max-cache-size which is 5, it would take some time to
> > see the performance degradation since at start the cursor positions of
> 100
> > subscribers are similar when all the messages read hits the
> softValueCache.
> > But after some time, the cursor positions are different. When these
> > positions are located more than 5 pages, it means some page would be read
> > back and forth. This can be proved by the trace log "adding pageCache
> > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl where some
> > pages are read a lot of times for the same subscriber. From the time on,
> > the performance starts to degrade. So we set page-max-cache-size to 1
> > here just to make the test process more fast and it doesn't change the
> > final result.
> >
> > The softValueCache would be removed if memory is really low, in addition
> > the map size reaches capacity(default 5). In most cases, the subscribers
> > are tailing read which are served by softValueCache(no need to bother
> > disk), thus we need to keep it. But When some subscribers fall behind,
> they
> > need to read page not in softValueCache. After looking up code, we found
> one
> > depage round is following at most MAX_SCHEDULED_RUNNERS deliver round in
> > most situations, and that's to say at most MAX_DELIVERIES_IN_LOOP *
> > MAX_SCHEDULED_RUNNERS number of messages would be depaged next. If you
> > adjust QueueImpl logger to debug level, you would see logs like "Queue
> > Memory Size after depage on queue=sub4 is 53478769 with maxSize =
> 52428800.
> > Depaged 68 messages, pendingDelivery=1002, intermediateMessageReferences=
> > 23162, queueDelivering=0". In order to depage less than 2000 messages,
> > each subscriber has to read a whole page which is unnecessary and
> wasteful.
> > In our test where one page(50MB) contains ~40000 messages, one subscriber
> > maybe read 40000/2000=20 times of page if softValueCache is evicted to
> > finish delivering it. This has drastically slowed down the process and
> > burdened on the disk. So we add the PageIndexCacheImpl and read one
> message
> > each time rather than read all messages of page. In this way, for each
> > subscriber each page is read only once after finishing delivering.
> >
> > Having said that, the softValueCache is used for tailing read. If it's
> > evicted, it won't be reloaded to prevent from the issue illustrated
> above.
> > Instead the pageIndexCache would be used.
> >
> > Regarding implementation details, we noted that before delivering page, a
> > pageCursorInfo is constructed which needs to read the whole page. We can
> > take this opportunity to construct the pageIndexCache. It's very simple
> to
> > code. We also think of building a offset index file and some concerns
> > stemed from following:
> >
> >    1. When to write and sync index file? Would it have some performance
> >    implications?
> >    2. If we have a index file, we can construct pageCursorInfo through
> >    it(no need to read the page like before), but we need to write the
> total
> >    message number into it first. Seems a little weird putting this into
> the
> >    index file.
> >    3. If experiencing hard crash, a recover mechanism would be needed to
> >    recover page and page index files, E.g. truncating to the valid size.
> So
> >    how do we know which files need to be sanity checked?
> >    4. A variant binary search algorithm maybe needed, see
> >
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> >     .
> >    5. Unlike kafka from which user fetches lots of messages at once and
> >    broker just needs to look up start offset from the index file once,
> artemis
> >    delivers message one by one and that means we have to look up the
> index
> >    every time we deliver a message. Although the index file is possibly
> in
> >    page cache, there are still chances we miss cache.
> >    6. Compatibility with old files.
> >
> > To sum that, kafka uses a mmaped index file and we use a index cache.
> Both
> > are designed to find physical file position according offset(kafka) or
> > message number(artemis). And we prefer the index cache bcs it's easy to
> > understand and maintain.
> >
> > We also tested the one subscriber case with the same setup.
> > The original:
> > consumer tps(11000msg/s) and latency:
> > [image: orig_single_subscriber.png]
> > producer tps(30000msg/s) and latency:
> > [image: orig_single_producer.png]
> > The pr:
> > consumer tps(14000msg/s) and latency:
> > [image: pr_single_consumer.png]
> > producer tps(30000msg/s) and latency:
> > [image: pr_single_producer.png]
> > It showed result is similar and event a little better in the case of
> > single subscriber.
> >
> > We used our inner test platform and i think jmeter can also be used to
> > test again it.
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

nigro_franz
>
>  which means the offset info is 100 times large compared to the shared
> page index cache.


I would check with JOL plugin for exact numbers..
I see with it that we would have an increase of 4 bytes for each
PagedRefeferenceImpl, totally decrentralized vs
a centralized approach (the cache). In the economy of a fully loaded
broker, if we care about scaling need to understand if the memory tradeoff
is important enough
to choose one of the 2 approaches.
My point is that paging could be made totally based on the OS page cache if
GC would get in the middle, deleting any previous mechanism of page
caching...simplifying the process at it is.
Using a 2 level cache with such centralized approach can work, but will add
a level of complexity that IMO could be saved...
What do you think could be the benefit of the decentralized solution if
compared with the one proposed in the PR?


Il giorno gio 27 giu 2019 alle ore 10:41 yw yw <[hidden email]> ha
scritto:

> Sorry, I missed the PageReferece part.
>
> The lifecyle of PageReference is: depage(in intermediateMessageReferences)
> -> deliver(in messageReferences) -> waiting for ack(in deliveringRefs) ->
> removed. Every queue would create it's own PageReference which means the
> offset info is 100 times large compared to the shared page index cache.
> If we keep 51MB pageReference size in memory, as i said in pr, "For
> multiple subscribers to the same address, just one executor is responsible
> for delivering which means at the same moment only one queue is delivering.
> Thus the queue maybe stalled for a long time. We get queueMemorySize
> messages into memory, and when we deliver these after a long time, we
> probably need to query message and read page file again.".  At last for one
> message we maybe read twice: first we read page and create pagereference;
> second we requery message after its reference is removed.
>
> For the shared page index cache design, each message just need to be read
> from file once.
>
> Michael Pearce <[hidden email]> 于2019年6月27日周四 下午3:03写道:
>
> > Hi
> >
> > First of all i think this is an excellent effort, and could be a
> potential
> > massive positive change.
> >
> > Before making any change on such scale, i do think we need to ensure we
> > have sufficient benchmarks on a number of scenarios, not just one use
> case,
> > and the benchmark tool used does need to be available openly so that
> others
> > can verify the measures and check on their setups.
> >
> > Some additional scenarios i would want/need covering are:
> >
> > PageCache set to 5, and all consumers keeping up, but lagging enough to
> be
> > reading from the same 1st page cache, latency and throughput need to be
> > measured for all.
> > PageCache set to 5 and all consumers but one keeping up but lagging
> enough
> > to be reading from the same 1st page cahce, but the one is falling off
> the
> > end, causing the page cache swapping, measure latecy and througput of
> those
> > keeping up in the 1st page cache not caring for the one.
> >
> > Regards to solution some alternative approach to discuss
> >
> > In your scenario if i understand correctly each subscriber is effectivly
> > having their own queue (1 to 1 mapping) not sharing.
> > You mention kafka and say multiple consumers doent read serailly on the
> > address and this is true, but per queue processing through messages
> > (dispatch) is still serial even with multiple shared consumers on a
> queue.
> >
> > What about keeping the existing mechanism but having a queue hold
> reference
> > to a page cache that the queue is currently on, being kept from gc (e.g.
> > not soft) therefore meaning page cache isnt being swapped around, when
> you
> > have queues (in your case subscribers) swapping pagecaches back and forth
> > avoidning the constant re-read issue.
> >
> > Also i think Franz had an excellent idea, do away with pagecache in its
> > current form entirely, ensure the offset is kept with the reference and
> > rely on OS caching keeping hot blocks/data.
> >
> > Best
> > Michael
> >
> >
> >
> > On Thu, 27 Jun 2019 at 05:13, yw yw <[hidden email]> wrote:
> >
> > > Hi, folks
> > >
> > > This is the discussion about "ARTEMIS-2399 Fix performance degradation
> > > when there are a lot of subscribers".
> > >
> > > First apologize i didn't clarify our thoughts.
> > >
> > > As noted in the part of Environment, page-max-cache-size is set to 1
> > > meaning at most one page is allowed in softValueCache. We have tested
> > with
> > > the default page-max-cache-size which is 5, it would take some time to
> > > see the performance degradation since at start the cursor positions of
> > 100
> > > subscribers are similar when all the messages read hits the
> > softValueCache.
> > > But after some time, the cursor positions are different. When these
> > > positions are located more than 5 pages, it means some page would be
> read
> > > back and forth. This can be proved by the trace log "adding pageCache
> > > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl where
> some
> > > pages are read a lot of times for the same subscriber. From the time
> on,
> > > the performance starts to degrade. So we set page-max-cache-size to 1
> > > here just to make the test process more fast and it doesn't change the
> > > final result.
> > >
> > > The softValueCache would be removed if memory is really low, in
> addition
> > > the map size reaches capacity(default 5). In most cases, the
> subscribers
> > > are tailing read which are served by softValueCache(no need to bother
> > > disk), thus we need to keep it. But When some subscribers fall behind,
> > they
> > > need to read page not in softValueCache. After looking up code, we
> found
> > one
> > > depage round is following at most MAX_SCHEDULED_RUNNERS deliver round
> in
> > > most situations, and that's to say at most MAX_DELIVERIES_IN_LOOP *
> > > MAX_SCHEDULED_RUNNERS number of messages would be depaged next. If you
> > > adjust QueueImpl logger to debug level, you would see logs like "Queue
> > > Memory Size after depage on queue=sub4 is 53478769 with maxSize =
> > 52428800.
> > > Depaged 68 messages, pendingDelivery=1002,
> intermediateMessageReferences=
> > > 23162, queueDelivering=0". In order to depage less than 2000 messages,
> > > each subscriber has to read a whole page which is unnecessary and
> > wasteful.
> > > In our test where one page(50MB) contains ~40000 messages, one
> subscriber
> > > maybe read 40000/2000=20 times of page if softValueCache is evicted to
> > > finish delivering it. This has drastically slowed down the process and
> > > burdened on the disk. So we add the PageIndexCacheImpl and read one
> > message
> > > each time rather than read all messages of page. In this way, for each
> > > subscriber each page is read only once after finishing delivering.
> > >
> > > Having said that, the softValueCache is used for tailing read. If it's
> > > evicted, it won't be reloaded to prevent from the issue illustrated
> > above.
> > > Instead the pageIndexCache would be used.
> > >
> > > Regarding implementation details, we noted that before delivering
> page, a
> > > pageCursorInfo is constructed which needs to read the whole page. We
> can
> > > take this opportunity to construct the pageIndexCache. It's very simple
> > to
> > > code. We also think of building a offset index file and some concerns
> > > stemed from following:
> > >
> > >    1. When to write and sync index file? Would it have some performance
> > >    implications?
> > >    2. If we have a index file, we can construct pageCursorInfo through
> > >    it(no need to read the page like before), but we need to write the
> > total
> > >    message number into it first. Seems a little weird putting this into
> > the
> > >    index file.
> > >    3. If experiencing hard crash, a recover mechanism would be needed
> to
> > >    recover page and page index files, E.g. truncating to the valid
> size.
> > So
> > >    how do we know which files need to be sanity checked?
> > >    4. A variant binary search algorithm maybe needed, see
> > >
> >
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > >     .
> > >    5. Unlike kafka from which user fetches lots of messages at once and
> > >    broker just needs to look up start offset from the index file once,
> > artemis
> > >    delivers message one by one and that means we have to look up the
> > index
> > >    every time we deliver a message. Although the index file is possibly
> > in
> > >    page cache, there are still chances we miss cache.
> > >    6. Compatibility with old files.
> > >
> > > To sum that, kafka uses a mmaped index file and we use a index cache.
> > Both
> > > are designed to find physical file position according offset(kafka) or
> > > message number(artemis). And we prefer the index cache bcs it's easy to
> > > understand and maintain.
> > >
> > > We also tested the one subscriber case with the same setup.
> > > The original:
> > > consumer tps(11000msg/s) and latency:
> > > [image: orig_single_subscriber.png]
> > > producer tps(30000msg/s) and latency:
> > > [image: orig_single_producer.png]
> > > The pr:
> > > consumer tps(14000msg/s) and latency:
> > > [image: pr_single_consumer.png]
> > > producer tps(30000msg/s) and latency:
> > > [image: pr_single_producer.png]
> > > It showed result is similar and event a little better in the case of
> > > single subscriber.
> > >
> > > We used our inner test platform and i think jmeter can also be used to
> > > test again it.
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

wei yang
"At last for one message we maybe read twice: first we read page and create
pagereference; second we requery message after its reference is removed.  "

I just realized it was wrong. One message maybe read many times. Think of
this: When #1~#2000 msg is delivered, need to depage #2001-#4000 msg,
reading the whole page; When #2001~#4000 msg is deliverd, need to depage
#4001~#6000 msg, reading page again, etc.

One message maybe read three times if we don't depage until all messages
are delivered. For example, we have 3 pages p1, p2,p3 and message m1 which
is at top part of the p2. In our case(max-size-bytes=51MB, a little bigger
than page size), first depage round reads bottom half of p1 and top part of
p2; second depage round reads bottom half of p2 and top part of p3.
Therforce p2 is read twice and m1 maybe read three times if requeryed.

Be honest, i don't know how to fix the problem above with the
decrentralized approch. The point is not how we rely on os cache, it's that
we do it the wrong way, shouldn't read whole page(50MB) just for ~2000
messages. Also there is no need to save 51MB PagedReferenceImpl in memory.
When 100 queues occupy 5100MB memory, the message references are very
likely to be removed.


Francesco Nigro <[hidden email]> 于2019年6月27日周四 下午5:05写道:

> >
> >  which means the offset info is 100 times large compared to the shared
> > page index cache.
>
>
> I would check with JOL plugin for exact numbers..
> I see with it that we would have an increase of 4 bytes for each
> PagedRefeferenceImpl, totally decrentralized vs
> a centralized approach (the cache). In the economy of a fully loaded
> broker, if we care about scaling need to understand if the memory tradeoff
> is important enough
> to choose one of the 2 approaches.
> My point is that paging could be made totally based on the OS page cache if
> GC would get in the middle, deleting any previous mechanism of page
> caching...simplifying the process at it is.
> Using a 2 level cache with such centralized approach can work, but will add
> a level of complexity that IMO could be saved...
> What do you think could be the benefit of the decentralized solution if
> compared with the one proposed in the PR?
>
>
> Il giorno gio 27 giu 2019 alle ore 10:41 yw yw <[hidden email]> ha
> scritto:
>
> > Sorry, I missed the PageReferece part.
> >
> > The lifecyle of PageReference is: depage(in
> intermediateMessageReferences)
> > -> deliver(in messageReferences) -> waiting for ack(in deliveringRefs) ->
> > removed. Every queue would create it's own PageReference which means the
> > offset info is 100 times large compared to the shared page index cache.
> > If we keep 51MB pageReference size in memory, as i said in pr, "For
> > multiple subscribers to the same address, just one executor is
> responsible
> > for delivering which means at the same moment only one queue is
> delivering.
> > Thus the queue maybe stalled for a long time. We get queueMemorySize
> > messages into memory, and when we deliver these after a long time, we
> > probably need to query message and read page file again.".  At last for
> one
> > message we maybe read twice: first we read page and create pagereference;
> > second we requery message after its reference is removed.
> >
> > For the shared page index cache design, each message just need to be read
> > from file once.
> >
> > Michael Pearce <[hidden email]> 于2019年6月27日周四 下午3:03写道:
> >
> > > Hi
> > >
> > > First of all i think this is an excellent effort, and could be a
> > potential
> > > massive positive change.
> > >
> > > Before making any change on such scale, i do think we need to ensure we
> > > have sufficient benchmarks on a number of scenarios, not just one use
> > case,
> > > and the benchmark tool used does need to be available openly so that
> > others
> > > can verify the measures and check on their setups.
> > >
> > > Some additional scenarios i would want/need covering are:
> > >
> > > PageCache set to 5, and all consumers keeping up, but lagging enough to
> > be
> > > reading from the same 1st page cache, latency and throughput need to be
> > > measured for all.
> > > PageCache set to 5 and all consumers but one keeping up but lagging
> > enough
> > > to be reading from the same 1st page cahce, but the one is falling off
> > the
> > > end, causing the page cache swapping, measure latecy and througput of
> > those
> > > keeping up in the 1st page cache not caring for the one.
> > >
> > > Regards to solution some alternative approach to discuss
> > >
> > > In your scenario if i understand correctly each subscriber is
> effectivly
> > > having their own queue (1 to 1 mapping) not sharing.
> > > You mention kafka and say multiple consumers doent read serailly on the
> > > address and this is true, but per queue processing through messages
> > > (dispatch) is still serial even with multiple shared consumers on a
> > queue.
> > >
> > > What about keeping the existing mechanism but having a queue hold
> > reference
> > > to a page cache that the queue is currently on, being kept from gc
> (e.g.
> > > not soft) therefore meaning page cache isnt being swapped around, when
> > you
> > > have queues (in your case subscribers) swapping pagecaches back and
> forth
> > > avoidning the constant re-read issue.
> > >
> > > Also i think Franz had an excellent idea, do away with pagecache in its
> > > current form entirely, ensure the offset is kept with the reference and
> > > rely on OS caching keeping hot blocks/data.
> > >
> > > Best
> > > Michael
> > >
> > >
> > >
> > > On Thu, 27 Jun 2019 at 05:13, yw yw <[hidden email]> wrote:
> > >
> > > > Hi, folks
> > > >
> > > > This is the discussion about "ARTEMIS-2399 Fix performance
> degradation
> > > > when there are a lot of subscribers".
> > > >
> > > > First apologize i didn't clarify our thoughts.
> > > >
> > > > As noted in the part of Environment, page-max-cache-size is set to 1
> > > > meaning at most one page is allowed in softValueCache. We have tested
> > > with
> > > > the default page-max-cache-size which is 5, it would take some time
> to
> > > > see the performance degradation since at start the cursor positions
> of
> > > 100
> > > > subscribers are similar when all the messages read hits the
> > > softValueCache.
> > > > But after some time, the cursor positions are different. When these
> > > > positions are located more than 5 pages, it means some page would be
> > read
> > > > back and forth. This can be proved by the trace log "adding pageCache
> > > > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl where
> > some
> > > > pages are read a lot of times for the same subscriber. From the time
> > on,
> > > > the performance starts to degrade. So we set page-max-cache-size to 1
> > > > here just to make the test process more fast and it doesn't change
> the
> > > > final result.
> > > >
> > > > The softValueCache would be removed if memory is really low, in
> > addition
> > > > the map size reaches capacity(default 5). In most cases, the
> > subscribers
> > > > are tailing read which are served by softValueCache(no need to bother
> > > > disk), thus we need to keep it. But When some subscribers fall
> behind,
> > > they
> > > > need to read page not in softValueCache. After looking up code, we
> > found
> > > one
> > > > depage round is following at most MAX_SCHEDULED_RUNNERS deliver round
> > in
> > > > most situations, and that's to say at most MAX_DELIVERIES_IN_LOOP *
> > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged next. If
> you
> > > > adjust QueueImpl logger to debug level, you would see logs like
> "Queue
> > > > Memory Size after depage on queue=sub4 is 53478769 with maxSize =
> > > 52428800.
> > > > Depaged 68 messages, pendingDelivery=1002,
> > intermediateMessageReferences=
> > > > 23162, queueDelivering=0". In order to depage less than 2000
> messages,
> > > > each subscriber has to read a whole page which is unnecessary and
> > > wasteful.
> > > > In our test where one page(50MB) contains ~40000 messages, one
> > subscriber
> > > > maybe read 40000/2000=20 times of page if softValueCache is evicted
> to
> > > > finish delivering it. This has drastically slowed down the process
> and
> > > > burdened on the disk. So we add the PageIndexCacheImpl and read one
> > > message
> > > > each time rather than read all messages of page. In this way, for
> each
> > > > subscriber each page is read only once after finishing delivering.
> > > >
> > > > Having said that, the softValueCache is used for tailing read. If
> it's
> > > > evicted, it won't be reloaded to prevent from the issue illustrated
> > > above.
> > > > Instead the pageIndexCache would be used.
> > > >
> > > > Regarding implementation details, we noted that before delivering
> > page, a
> > > > pageCursorInfo is constructed which needs to read the whole page. We
> > can
> > > > take this opportunity to construct the pageIndexCache. It's very
> simple
> > > to
> > > > code. We also think of building a offset index file and some concerns
> > > > stemed from following:
> > > >
> > > >    1. When to write and sync index file? Would it have some
> performance
> > > >    implications?
> > > >    2. If we have a index file, we can construct pageCursorInfo
> through
> > > >    it(no need to read the page like before), but we need to write the
> > > total
> > > >    message number into it first. Seems a little weird putting this
> into
> > > the
> > > >    index file.
> > > >    3. If experiencing hard crash, a recover mechanism would be needed
> > to
> > > >    recover page and page index files, E.g. truncating to the valid
> > size.
> > > So
> > > >    how do we know which files need to be sanity checked?
> > > >    4. A variant binary search algorithm maybe needed, see
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > > >     .
> > > >    5. Unlike kafka from which user fetches lots of messages at once
> and
> > > >    broker just needs to look up start offset from the index file
> once,
> > > artemis
> > > >    delivers message one by one and that means we have to look up the
> > > index
> > > >    every time we deliver a message. Although the index file is
> possibly
> > > in
> > > >    page cache, there are still chances we miss cache.
> > > >    6. Compatibility with old files.
> > > >
> > > > To sum that, kafka uses a mmaped index file and we use a index cache.
> > > Both
> > > > are designed to find physical file position according offset(kafka)
> or
> > > > message number(artemis). And we prefer the index cache bcs it's easy
> to
> > > > understand and maintain.
> > > >
> > > > We also tested the one subscriber case with the same setup.
> > > > The original:
> > > > consumer tps(11000msg/s) and latency:
> > > > [image: orig_single_subscriber.png]
> > > > producer tps(30000msg/s) and latency:
> > > > [image: orig_single_producer.png]
> > > > The pr:
> > > > consumer tps(14000msg/s) and latency:
> > > > [image: pr_single_consumer.png]
> > > > producer tps(30000msg/s) and latency:
> > > > [image: pr_single_producer.png]
> > > > It showed result is similar and event a little better in the case of
> > > > single subscriber.
> > > >
> > > > We used our inner test platform and i think jmeter can also be used
> to
> > > > test again it.
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

michael.andre.pearce
I think some of that is down to configuration. If you think you could configure paging to have much smaller page files but have many more held. That way the reference sizes will be far smaller and pages dropping in and out would be less. E.g. if you expect 100 being read make it 100 but make the page sizes smaller so the overhead is far less




Get Outlook for Android







On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw" <[hidden email]> wrote:










"At last for one message we maybe read twice: first we read page and create
pagereference; second we requery message after its reference is removed.  "

I just realized it was wrong. One message maybe read many times. Think of
this: When #1~#2000 msg is delivered, need to depage #2001-#4000 msg,
reading the whole page; When #2001~#4000 msg is deliverd, need to depage
#4001~#6000 msg, reading page again, etc.

One message maybe read three times if we don't depage until all messages
are delivered. For example, we have 3 pages p1, p2,p3 and message m1 which
is at top part of the p2. In our case(max-size-bytes=51MB, a little bigger
than page size), first depage round reads bottom half of p1 and top part of
p2; second depage round reads bottom half of p2 and top part of p3.
Therforce p2 is read twice and m1 maybe read three times if requeryed.

Be honest, i don't know how to fix the problem above with the
decrentralized approch. The point is not how we rely on os cache, it's that
we do it the wrong way, shouldn't read whole page(50MB) just for ~2000
messages. Also there is no need to save 51MB PagedReferenceImpl in memory.
When 100 queues occupy 5100MB memory, the message references are very
likely to be removed.


Francesco Nigro  于2019年6月27日周四 下午5:05写道:

> >
> >  which means the offset info is 100 times large compared to the shared
> > page index cache.
>
>
> I would check with JOL plugin for exact numbers..
> I see with it that we would have an increase of 4 bytes for each
> PagedRefeferenceImpl, totally decrentralized vs
> a centralized approach (the cache). In the economy of a fully loaded
> broker, if we care about scaling need to understand if the memory tradeoff
> is important enough
> to choose one of the 2 approaches.
> My point is that paging could be made totally based on the OS page cache if
> GC would get in the middle, deleting any previous mechanism of page
> caching...simplifying the process at it is.
> Using a 2 level cache with such centralized approach can work, but will add
> a level of complexity that IMO could be saved...
> What do you think could be the benefit of the decentralized solution if
> compared with the one proposed in the PR?
>
>
> Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> scritto:
>
> > Sorry, I missed the PageReferece part.
> >
> > The lifecyle of PageReference is: depage(in
> intermediateMessageReferences)
> > -> deliver(in messageReferences) -> waiting for ack(in deliveringRefs) ->
> > removed. Every queue would create it's own PageReference which means the
> > offset info is 100 times large compared to the shared page index cache.
> > If we keep 51MB pageReference size in memory, as i said in pr, "For
> > multiple subscribers to the same address, just one executor is
> responsible
> > for delivering which means at the same moment only one queue is
> delivering.
> > Thus the queue maybe stalled for a long time. We get queueMemorySize
> > messages into memory, and when we deliver these after a long time, we
> > probably need to query message and read page file again.".  At last for
> one
> > message we maybe read twice: first we read page and create pagereference;
> > second we requery message after its reference is removed.
> >
> > For the shared page index cache design, each message just need to be read
> > from file once.
> >
> > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> >
> > > Hi
> > >
> > > First of all i think this is an excellent effort, and could be a
> > potential
> > > massive positive change.
> > >
> > > Before making any change on such scale, i do think we need to ensure we
> > > have sufficient benchmarks on a number of scenarios, not just one use
> > case,
> > > and the benchmark tool used does need to be available openly so that
> > others
> > > can verify the measures and check on their setups.
> > >
> > > Some additional scenarios i would want/need covering are:
> > >
> > > PageCache set to 5, and all consumers keeping up, but lagging enough to
> > be
> > > reading from the same 1st page cache, latency and throughput need to be
> > > measured for all.
> > > PageCache set to 5 and all consumers but one keeping up but lagging
> > enough
> > > to be reading from the same 1st page cahce, but the one is falling off
> > the
> > > end, causing the page cache swapping, measure latecy and througput of
> > those
> > > keeping up in the 1st page cache not caring for the one.
> > >
> > > Regards to solution some alternative approach to discuss
> > >
> > > In your scenario if i understand correctly each subscriber is
> effectivly
> > > having their own queue (1 to 1 mapping) not sharing.
> > > You mention kafka and say multiple consumers doent read serailly on the
> > > address and this is true, but per queue processing through messages
> > > (dispatch) is still serial even with multiple shared consumers on a
> > queue.
> > >
> > > What about keeping the existing mechanism but having a queue hold
> > reference
> > > to a page cache that the queue is currently on, being kept from gc
> (e.g.
> > > not soft) therefore meaning page cache isnt being swapped around, when
> > you
> > > have queues (in your case subscribers) swapping pagecaches back and
> forth
> > > avoidning the constant re-read issue.
> > >
> > > Also i think Franz had an excellent idea, do away with pagecache in its
> > > current form entirely, ensure the offset is kept with the reference and
> > > rely on OS caching keeping hot blocks/data.
> > >
> > > Best
> > > Michael
> > >
> > >
> > >
> > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> > >
> > > > Hi, folks
> > > >
> > > > This is the discussion about "ARTEMIS-2399 Fix performance
> degradation
> > > > when there are a lot of subscribers".
> > > >
> > > > First apologize i didn't clarify our thoughts.
> > > >
> > > > As noted in the part of Environment, page-max-cache-size is set to 1
> > > > meaning at most one page is allowed in softValueCache. We have tested
> > > with
> > > > the default page-max-cache-size which is 5, it would take some time
> to
> > > > see the performance degradation since at start the cursor positions
> of
> > > 100
> > > > subscribers are similar when all the messages read hits the
> > > softValueCache.
> > > > But after some time, the cursor positions are different. When these
> > > > positions are located more than 5 pages, it means some page would be
> > read
> > > > back and forth. This can be proved by the trace log "adding pageCache
> > > > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl where
> > some
> > > > pages are read a lot of times for the same subscriber. From the time
> > on,
> > > > the performance starts to degrade. So we set page-max-cache-size to 1
> > > > here just to make the test process more fast and it doesn't change
> the
> > > > final result.
> > > >
> > > > The softValueCache would be removed if memory is really low, in
> > addition
> > > > the map size reaches capacity(default 5). In most cases, the
> > subscribers
> > > > are tailing read which are served by softValueCache(no need to bother
> > > > disk), thus we need to keep it. But When some subscribers fall
> behind,
> > > they
> > > > need to read page not in softValueCache. After looking up code, we
> > found
> > > one
> > > > depage round is following at most MAX_SCHEDULED_RUNNERS deliver round
> > in
> > > > most situations, and that's to say at most MAX_DELIVERIES_IN_LOOP *
> > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged next. If
> you
> > > > adjust QueueImpl logger to debug level, you would see logs like
> "Queue
> > > > Memory Size after depage on queue=sub4 is 53478769 with maxSize =
> > > 52428800.
> > > > Depaged 68 messages, pendingDelivery=1002,
> > intermediateMessageReferences=
> > > > 23162, queueDelivering=0". In order to depage less than 2000
> messages,
> > > > each subscriber has to read a whole page which is unnecessary and
> > > wasteful.
> > > > In our test where one page(50MB) contains ~40000 messages, one
> > subscriber
> > > > maybe read 40000/2000=20 times of page if softValueCache is evicted
> to
> > > > finish delivering it. This has drastically slowed down the process
> and
> > > > burdened on the disk. So we add the PageIndexCacheImpl and read one
> > > message
> > > > each time rather than read all messages of page. In this way, for
> each
> > > > subscriber each page is read only once after finishing delivering.
> > > >
> > > > Having said that, the softValueCache is used for tailing read. If
> it's
> > > > evicted, it won't be reloaded to prevent from the issue illustrated
> > > above.
> > > > Instead the pageIndexCache would be used.
> > > >
> > > > Regarding implementation details, we noted that before delivering
> > page, a
> > > > pageCursorInfo is constructed which needs to read the whole page. We
> > can
> > > > take this opportunity to construct the pageIndexCache. It's very
> simple
> > > to
> > > > code. We also think of building a offset index file and some concerns
> > > > stemed from following:
> > > >
> > > >    1. When to write and sync index file? Would it have some
> performance
> > > >    implications?
> > > >    2. If we have a index file, we can construct pageCursorInfo
> through
> > > >    it(no need to read the page like before), but we need to write the
> > > total
> > > >    message number into it first. Seems a little weird putting this
> into
> > > the
> > > >    index file.
> > > >    3. If experiencing hard crash, a recover mechanism would be needed
> > to
> > > >    recover page and page index files, E.g. truncating to the valid
> > size.
> > > So
> > > >    how do we know which files need to be sanity checked?
> > > >    4. A variant binary search algorithm maybe needed, see
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > > >     .
> > > >    5. Unlike kafka from which user fetches lots of messages at once
> and
> > > >    broker just needs to look up start offset from the index file
> once,
> > > artemis
> > > >    delivers message one by one and that means we have to look up the
> > > index
> > > >    every time we deliver a message. Although the index file is
> possibly
> > > in
> > > >    page cache, there are still chances we miss cache.
> > > >    6. Compatibility with old files.
> > > >
> > > > To sum that, kafka uses a mmaped index file and we use a index cache.
> > > Both
> > > > are designed to find physical file position according offset(kafka)
> or
> > > > message number(artemis). And we prefer the index cache bcs it's easy
> to
> > > > understand and maintain.
> > > >
> > > > We also tested the one subscriber case with the same setup.
> > > > The original:
> > > > consumer tps(11000msg/s) and latency:
> > > > [image: orig_single_subscriber.png]
> > > > producer tps(30000msg/s) and latency:
> > > > [image: orig_single_producer.png]
> > > > The pr:
> > > > consumer tps(14000msg/s) and latency:
> > > > [image: pr_single_consumer.png]
> > > > producer tps(30000msg/s) and latency:
> > > > [image: pr_single_producer.png]
> > > > It showed result is similar and event a little better in the case of
> > > > single subscriber.
> > > >
> > > > We used our inner test platform and i think jmeter can also be used
> to
> > > > test again it.
> > > >
> > >
> >
>





Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

wei yang
Hi,
We've took a test against your configuration:
<page-size-bytes>5Mb</page-size-bytes><page-max-cache-size>100</page-max-cache-size><max-size-bytes>10Mb</max-size-bytes>.
The current code: 7000msg/s sent and 18000msg/s received.
Pr code:16000msg/s received and 8200msg/s sent.
Like you said, the performance boosts by using much smaller page file and
holding many more for current code.

Not sure what implications would have using smaller page file, the producer
performance may reduce since switching files is more frequent, number of
file handle would increase?

While our consumer in the test just echos, nothing to do after receiving
message, the consumer in the real world may be busy doing business. This
means references and page caches reside in memory longer and may be evicted
more easily when producers are sending all the time.

Since We don't know how many subscribers there are, it is not a scalable
approch. We can't reduce page file size unlimited to fit the number of
subscribers. The code should accommodate to all kinds of configurations. We
adjust configuration for trade off as needed, not work around IMO.
In our company, ~200 queues(60% are owned by some addresses) are deployed
in the broker. We can't set all to e.g. 100 page caches(too much memory),
and neither set different size according to address pattern(hard for
operation). In the multi tenants cluster, we prefer availability and to
avoid memory exhausted, we set pageSize to 30MB, max cache size to 1 and
max size to 31MB. It's running well in one of our clusters now:)

<[hidden email]> 于2019年6月29日周六 上午2:35写道:

> I think some of that is down to configuration. If you think you could
> configure paging to have much smaller page files but have many more held.
> That way the reference sizes will be far smaller and pages dropping in and
> out would be less. E.g. if you expect 100 being read make it 100 but make
> the page sizes smaller so the overhead is far less
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw" <[hidden email]> wrote:
>
>
>
>
>
>
>
>
>
>
> "At last for one message we maybe read twice: first we read page and create
> pagereference; second we requery message after its reference is removed.  "
>
> I just realized it was wrong. One message maybe read many times. Think of
> this: When #1~#2000 msg is delivered, need to depage #2001-#4000 msg,
> reading the whole page; When #2001~#4000 msg is deliverd, need to depage
> #4001~#6000 msg, reading page again, etc.
>
> One message maybe read three times if we don't depage until all messages
> are delivered. For example, we have 3 pages p1, p2,p3 and message m1 which
> is at top part of the p2. In our case(max-size-bytes=51MB, a little bigger
> than page size), first depage round reads bottom half of p1 and top part of
> p2; second depage round reads bottom half of p2 and top part of p3.
> Therforce p2 is read twice and m1 maybe read three times if requeryed.
>
> Be honest, i don't know how to fix the problem above with the
> decrentralized approch. The point is not how we rely on os cache, it's that
> we do it the wrong way, shouldn't read whole page(50MB) just for ~2000
> messages. Also there is no need to save 51MB PagedReferenceImpl in memory.
> When 100 queues occupy 5100MB memory, the message references are very
> likely to be removed.
>
>
> Francesco Nigro  于2019年6月27日周四 下午5:05写道:
>
> > >
> > >  which means the offset info is 100 times large compared to the shared
> > > page index cache.
> >
> >
> > I would check with JOL plugin for exact numbers..
> > I see with it that we would have an increase of 4 bytes for each
> > PagedRefeferenceImpl, totally decrentralized vs
> > a centralized approach (the cache). In the economy of a fully loaded
> > broker, if we care about scaling need to understand if the memory
> tradeoff
> > is important enough
> > to choose one of the 2 approaches.
> > My point is that paging could be made totally based on the OS page cache
> if
> > GC would get in the middle, deleting any previous mechanism of page
> > caching...simplifying the process at it is.
> > Using a 2 level cache with such centralized approach can work, but will
> add
> > a level of complexity that IMO could be saved...
> > What do you think could be the benefit of the decentralized solution if
> > compared with the one proposed in the PR?
> >
> >
> > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> > scritto:
> >
> > > Sorry, I missed the PageReferece part.
> > >
> > > The lifecyle of PageReference is: depage(in
> > intermediateMessageReferences)
> > > -> deliver(in messageReferences) -> waiting for ack(in deliveringRefs)
> ->
> > > removed. Every queue would create it's own PageReference which means
> the
> > > offset info is 100 times large compared to the shared page index cache.
> > > If we keep 51MB pageReference size in memory, as i said in pr, "For
> > > multiple subscribers to the same address, just one executor is
> > responsible
> > > for delivering which means at the same moment only one queue is
> > delivering.
> > > Thus the queue maybe stalled for a long time. We get queueMemorySize
> > > messages into memory, and when we deliver these after a long time, we
> > > probably need to query message and read page file again.".  At last for
> > one
> > > message we maybe read twice: first we read page and create
> pagereference;
> > > second we requery message after its reference is removed.
> > >
> > > For the shared page index cache design, each message just need to be
> read
> > > from file once.
> > >
> > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> > >
> > > > Hi
> > > >
> > > > First of all i think this is an excellent effort, and could be a
> > > potential
> > > > massive positive change.
> > > >
> > > > Before making any change on such scale, i do think we need to ensure
> we
> > > > have sufficient benchmarks on a number of scenarios, not just one use
> > > case,
> > > > and the benchmark tool used does need to be available openly so that
> > > others
> > > > can verify the measures and check on their setups.
> > > >
> > > > Some additional scenarios i would want/need covering are:
> > > >
> > > > PageCache set to 5, and all consumers keeping up, but lagging enough
> to
> > > be
> > > > reading from the same 1st page cache, latency and throughput need to
> be
> > > > measured for all.
> > > > PageCache set to 5 and all consumers but one keeping up but lagging
> > > enough
> > > > to be reading from the same 1st page cahce, but the one is falling
> off
> > > the
> > > > end, causing the page cache swapping, measure latecy and througput of
> > > those
> > > > keeping up in the 1st page cache not caring for the one.
> > > >
> > > > Regards to solution some alternative approach to discuss
> > > >
> > > > In your scenario if i understand correctly each subscriber is
> > effectivly
> > > > having their own queue (1 to 1 mapping) not sharing.
> > > > You mention kafka and say multiple consumers doent read serailly on
> the
> > > > address and this is true, but per queue processing through messages
> > > > (dispatch) is still serial even with multiple shared consumers on a
> > > queue.
> > > >
> > > > What about keeping the existing mechanism but having a queue hold
> > > reference
> > > > to a page cache that the queue is currently on, being kept from gc
> > (e.g.
> > > > not soft) therefore meaning page cache isnt being swapped around,
> when
> > > you
> > > > have queues (in your case subscribers) swapping pagecaches back and
> > forth
> > > > avoidning the constant re-read issue.
> > > >
> > > > Also i think Franz had an excellent idea, do away with pagecache in
> its
> > > > current form entirely, ensure the offset is kept with the reference
> and
> > > > rely on OS caching keeping hot blocks/data.
> > > >
> > > > Best
> > > > Michael
> > > >
> > > >
> > > >
> > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> > > >
> > > > > Hi, folks
> > > > >
> > > > > This is the discussion about "ARTEMIS-2399 Fix performance
> > degradation
> > > > > when there are a lot of subscribers".
> > > > >
> > > > > First apologize i didn't clarify our thoughts.
> > > > >
> > > > > As noted in the part of Environment, page-max-cache-size is set to
> 1
> > > > > meaning at most one page is allowed in softValueCache. We have
> tested
> > > > with
> > > > > the default page-max-cache-size which is 5, it would take some time
> > to
> > > > > see the performance degradation since at start the cursor positions
> > of
> > > > 100
> > > > > subscribers are similar when all the messages read hits the
> > > > softValueCache.
> > > > > But after some time, the cursor positions are different. When these
> > > > > positions are located more than 5 pages, it means some page would
> be
> > > read
> > > > > back and forth. This can be proved by the trace log "adding
> pageCache
> > > > > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl
> where
> > > some
> > > > > pages are read a lot of times for the same subscriber. From the
> time
> > > on,
> > > > > the performance starts to degrade. So we set page-max-cache-size
> to 1
> > > > > here just to make the test process more fast and it doesn't change
> > the
> > > > > final result.
> > > > >
> > > > > The softValueCache would be removed if memory is really low, in
> > > addition
> > > > > the map size reaches capacity(default 5). In most cases, the
> > > subscribers
> > > > > are tailing read which are served by softValueCache(no need to
> bother
> > > > > disk), thus we need to keep it. But When some subscribers fall
> > behind,
> > > > they
> > > > > need to read page not in softValueCache. After looking up code, we
> > > found
> > > > one
> > > > > depage round is following at most MAX_SCHEDULED_RUNNERS deliver
> round
> > > in
> > > > > most situations, and that's to say at most MAX_DELIVERIES_IN_LOOP *
> > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged next. If
> > you
> > > > > adjust QueueImpl logger to debug level, you would see logs like
> > "Queue
> > > > > Memory Size after depage on queue=sub4 is 53478769 with maxSize =
> > > > 52428800.
> > > > > Depaged 68 messages, pendingDelivery=1002,
> > > intermediateMessageReferences=
> > > > > 23162, queueDelivering=0". In order to depage less than 2000
> > messages,
> > > > > each subscriber has to read a whole page which is unnecessary and
> > > > wasteful.
> > > > > In our test where one page(50MB) contains ~40000 messages, one
> > > subscriber
> > > > > maybe read 40000/2000=20 times of page if softValueCache is evicted
> > to
> > > > > finish delivering it. This has drastically slowed down the process
> > and
> > > > > burdened on the disk. So we add the PageIndexCacheImpl and read one
> > > > message
> > > > > each time rather than read all messages of page. In this way, for
> > each
> > > > > subscriber each page is read only once after finishing delivering.
> > > > >
> > > > > Having said that, the softValueCache is used for tailing read. If
> > it's
> > > > > evicted, it won't be reloaded to prevent from the issue illustrated
> > > > above.
> > > > > Instead the pageIndexCache would be used.
> > > > >
> > > > > Regarding implementation details, we noted that before delivering
> > > page, a
> > > > > pageCursorInfo is constructed which needs to read the whole page.
> We
> > > can
> > > > > take this opportunity to construct the pageIndexCache. It's very
> > simple
> > > > to
> > > > > code. We also think of building a offset index file and some
> concerns
> > > > > stemed from following:
> > > > >
> > > > >    1. When to write and sync index file? Would it have some
> > performance
> > > > >    implications?
> > > > >    2. If we have a index file, we can construct pageCursorInfo
> > through
> > > > >    it(no need to read the page like before), but we need to write
> the
> > > > total
> > > > >    message number into it first. Seems a little weird putting this
> > into
> > > > the
> > > > >    index file.
> > > > >    3. If experiencing hard crash, a recover mechanism would be
> needed
> > > to
> > > > >    recover page and page index files, E.g. truncating to the valid
> > > size.
> > > > So
> > > > >    how do we know which files need to be sanity checked?
> > > > >    4. A variant binary search algorithm maybe needed, see
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > > > >     .
> > > > >    5. Unlike kafka from which user fetches lots of messages at once
> > and
> > > > >    broker just needs to look up start offset from the index file
> > once,
> > > > artemis
> > > > >    delivers message one by one and that means we have to look up
> the
> > > > index
> > > > >    every time we deliver a message. Although the index file is
> > possibly
> > > > in
> > > > >    page cache, there are still chances we miss cache.
> > > > >    6. Compatibility with old files.
> > > > >
> > > > > To sum that, kafka uses a mmaped index file and we use a index
> cache.
> > > > Both
> > > > > are designed to find physical file position according offset(kafka)
> > or
> > > > > message number(artemis). And we prefer the index cache bcs it's
> easy
> > to
> > > > > understand and maintain.
> > > > >
> > > > > We also tested the one subscriber case with the same setup.
> > > > > The original:
> > > > > consumer tps(11000msg/s) and latency:
> > > > > [image: orig_single_subscriber.png]
> > > > > producer tps(30000msg/s) and latency:
> > > > > [image: orig_single_producer.png]
> > > > > The pr:
> > > > > consumer tps(14000msg/s) and latency:
> > > > > [image: pr_single_consumer.png]
> > > > > producer tps(30000msg/s) and latency:
> > > > > [image: pr_single_producer.png]
> > > > > It showed result is similar and event a little better in the case
> of
> > > > > single subscriber.
> > > > >
> > > > > We used our inner test platform and i think jmeter can also be used
> > to
> > > > > test again it.
> > > > >
> > > >
> > >
> >
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

michael.andre.pearce
Point though is an extra index cache layer is needed. The overhead of that means the total paged capacity will be more limited as that overhead isnt just an extra int per reference. E.g. in the pr the current impl isnt very memory optimised, could an int array be used or at worst an open primitive int int hashmap.




This is why i really prefer franz's approach.




Also what ever we do, we need the new behaviour configurable, so should a use case not thought about they won't be impacted. E.g. the change should not be a surprise, it should be something you toggle on.




Get Outlook for Android







On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw" <[hidden email]> wrote:










Hi,
We've took a test against your configuration:
5Mb10010Mb.
The current code: 7000msg/s sent and 18000msg/s received.
Pr code:16000msg/s received and 8200msg/s sent.
Like you said, the performance boosts by using much smaller page file and
holding many more for current code.

Not sure what implications would have using smaller page file, the producer
performance may reduce since switching files is more frequent, number of
file handle would increase?

While our consumer in the test just echos, nothing to do after receiving
message, the consumer in the real world may be busy doing business. This
means references and page caches reside in memory longer and may be evicted
more easily when producers are sending all the time.

Since We don't know how many subscribers there are, it is not a scalable
approch. We can't reduce page file size unlimited to fit the number of
subscribers. The code should accommodate to all kinds of configurations. We
adjust configuration for trade off as needed, not work around IMO.
In our company, ~200 queues(60% are owned by some addresses) are deployed
in the broker. We can't set all to e.g. 100 page caches(too much memory),
and neither set different size according to address pattern(hard for
operation). In the multi tenants cluster, we prefer availability and to
avoid memory exhausted, we set pageSize to 30MB, max cache size to 1 and
max size to 31MB. It's running well in one of our clusters now:)

 于2019年6月29日周六 上午2:35写道:

> I think some of that is down to configuration. If you think you could
> configure paging to have much smaller page files but have many more held.
> That way the reference sizes will be far smaller and pages dropping in and
> out would be less. E.g. if you expect 100 being read make it 100 but make
> the page sizes smaller so the overhead is far less
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
>
>
>
>
>
>
>
>
>
>
> "At last for one message we maybe read twice: first we read page and create
> pagereference; second we requery message after its reference is removed.  "
>
> I just realized it was wrong. One message maybe read many times. Think of
> this: When #1~#2000 msg is delivered, need to depage #2001-#4000 msg,
> reading the whole page; When #2001~#4000 msg is deliverd, need to depage
> #4001~#6000 msg, reading page again, etc.
>
> One message maybe read three times if we don't depage until all messages
> are delivered. For example, we have 3 pages p1, p2,p3 and message m1 which
> is at top part of the p2. In our case(max-size-bytes=51MB, a little bigger
> than page size), first depage round reads bottom half of p1 and top part of
> p2; second depage round reads bottom half of p2 and top part of p3.
> Therforce p2 is read twice and m1 maybe read three times if requeryed.
>
> Be honest, i don't know how to fix the problem above with the
> decrentralized approch. The point is not how we rely on os cache, it's that
> we do it the wrong way, shouldn't read whole page(50MB) just for ~2000
> messages. Also there is no need to save 51MB PagedReferenceImpl in memory.
> When 100 queues occupy 5100MB memory, the message references are very
> likely to be removed.
>
>
> Francesco Nigro  于2019年6月27日周四 下午5:05写道:
>
> > >
> > >  which means the offset info is 100 times large compared to the shared
> > > page index cache.
> >
> >
> > I would check with JOL plugin for exact numbers..
> > I see with it that we would have an increase of 4 bytes for each
> > PagedRefeferenceImpl, totally decrentralized vs
> > a centralized approach (the cache). In the economy of a fully loaded
> > broker, if we care about scaling need to understand if the memory
> tradeoff
> > is important enough
> > to choose one of the 2 approaches.
> > My point is that paging could be made totally based on the OS page cache
> if
> > GC would get in the middle, deleting any previous mechanism of page
> > caching...simplifying the process at it is.
> > Using a 2 level cache with such centralized approach can work, but will
> add
> > a level of complexity that IMO could be saved...
> > What do you think could be the benefit of the decentralized solution if
> > compared with the one proposed in the PR?
> >
> >
> > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> > scritto:
> >
> > > Sorry, I missed the PageReferece part.
> > >
> > > The lifecyle of PageReference is: depage(in
> > intermediateMessageReferences)
> > > -> deliver(in messageReferences) -> waiting for ack(in deliveringRefs)
> ->
> > > removed. Every queue would create it's own PageReference which means
> the
> > > offset info is 100 times large compared to the shared page index cache.
> > > If we keep 51MB pageReference size in memory, as i said in pr, "For
> > > multiple subscribers to the same address, just one executor is
> > responsible
> > > for delivering which means at the same moment only one queue is
> > delivering.
> > > Thus the queue maybe stalled for a long time. We get queueMemorySize
> > > messages into memory, and when we deliver these after a long time, we
> > > probably need to query message and read page file again.".  At last for
> > one
> > > message we maybe read twice: first we read page and create
> pagereference;
> > > second we requery message after its reference is removed.
> > >
> > > For the shared page index cache design, each message just need to be
> read
> > > from file once.
> > >
> > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> > >
> > > > Hi
> > > >
> > > > First of all i think this is an excellent effort, and could be a
> > > potential
> > > > massive positive change.
> > > >
> > > > Before making any change on such scale, i do think we need to ensure
> we
> > > > have sufficient benchmarks on a number of scenarios, not just one use
> > > case,
> > > > and the benchmark tool used does need to be available openly so that
> > > others
> > > > can verify the measures and check on their setups.
> > > >
> > > > Some additional scenarios i would want/need covering are:
> > > >
> > > > PageCache set to 5, and all consumers keeping up, but lagging enough
> to
> > > be
> > > > reading from the same 1st page cache, latency and throughput need to
> be
> > > > measured for all.
> > > > PageCache set to 5 and all consumers but one keeping up but lagging
> > > enough
> > > > to be reading from the same 1st page cahce, but the one is falling
> off
> > > the
> > > > end, causing the page cache swapping, measure latecy and througput of
> > > those
> > > > keeping up in the 1st page cache not caring for the one.
> > > >
> > > > Regards to solution some alternative approach to discuss
> > > >
> > > > In your scenario if i understand correctly each subscriber is
> > effectivly
> > > > having their own queue (1 to 1 mapping) not sharing.
> > > > You mention kafka and say multiple consumers doent read serailly on
> the
> > > > address and this is true, but per queue processing through messages
> > > > (dispatch) is still serial even with multiple shared consumers on a
> > > queue.
> > > >
> > > > What about keeping the existing mechanism but having a queue hold
> > > reference
> > > > to a page cache that the queue is currently on, being kept from gc
> > (e.g.
> > > > not soft) therefore meaning page cache isnt being swapped around,
> when
> > > you
> > > > have queues (in your case subscribers) swapping pagecaches back and
> > forth
> > > > avoidning the constant re-read issue.
> > > >
> > > > Also i think Franz had an excellent idea, do away with pagecache in
> its
> > > > current form entirely, ensure the offset is kept with the reference
> and
> > > > rely on OS caching keeping hot blocks/data.
> > > >
> > > > Best
> > > > Michael
> > > >
> > > >
> > > >
> > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> > > >
> > > > > Hi, folks
> > > > >
> > > > > This is the discussion about "ARTEMIS-2399 Fix performance
> > degradation
> > > > > when there are a lot of subscribers".
> > > > >
> > > > > First apologize i didn't clarify our thoughts.
> > > > >
> > > > > As noted in the part of Environment, page-max-cache-size is set to
> 1
> > > > > meaning at most one page is allowed in softValueCache. We have
> tested
> > > > with
> > > > > the default page-max-cache-size which is 5, it would take some time
> > to
> > > > > see the performance degradation since at start the cursor positions
> > of
> > > > 100
> > > > > subscribers are similar when all the messages read hits the
> > > > softValueCache.
> > > > > But after some time, the cursor positions are different. When these
> > > > > positions are located more than 5 pages, it means some page would
> be
> > > read
> > > > > back and forth. This can be proved by the trace log "adding
> pageCache
> > > > > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl
> where
> > > some
> > > > > pages are read a lot of times for the same subscriber. From the
> time
> > > on,
> > > > > the performance starts to degrade. So we set page-max-cache-size
> to 1
> > > > > here just to make the test process more fast and it doesn't change
> > the
> > > > > final result.
> > > > >
> > > > > The softValueCache would be removed if memory is really low, in
> > > addition
> > > > > the map size reaches capacity(default 5). In most cases, the
> > > subscribers
> > > > > are tailing read which are served by softValueCache(no need to
> bother
> > > > > disk), thus we need to keep it. But When some subscribers fall
> > behind,
> > > > they
> > > > > need to read page not in softValueCache. After looking up code, we
> > > found
> > > > one
> > > > > depage round is following at most MAX_SCHEDULED_RUNNERS deliver
> round
> > > in
> > > > > most situations, and that's to say at most MAX_DELIVERIES_IN_LOOP *
> > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged next. If
> > you
> > > > > adjust QueueImpl logger to debug level, you would see logs like
> > "Queue
> > > > > Memory Size after depage on queue=sub4 is 53478769 with maxSize =
> > > > 52428800.
> > > > > Depaged 68 messages, pendingDelivery=1002,
> > > intermediateMessageReferences=
> > > > > 23162, queueDelivering=0". In order to depage less than 2000
> > messages,
> > > > > each subscriber has to read a whole page which is unnecessary and
> > > > wasteful.
> > > > > In our test where one page(50MB) contains ~40000 messages, one
> > > subscriber
> > > > > maybe read 40000/2000=20 times of page if softValueCache is evicted
> > to
> > > > > finish delivering it. This has drastically slowed down the process
> > and
> > > > > burdened on the disk. So we add the PageIndexCacheImpl and read one
> > > > message
> > > > > each time rather than read all messages of page. In this way, for
> > each
> > > > > subscriber each page is read only once after finishing delivering.
> > > > >
> > > > > Having said that, the softValueCache is used for tailing read. If
> > it's
> > > > > evicted, it won't be reloaded to prevent from the issue illustrated
> > > > above.
> > > > > Instead the pageIndexCache would be used.
> > > > >
> > > > > Regarding implementation details, we noted that before delivering
> > > page, a
> > > > > pageCursorInfo is constructed which needs to read the whole page.
> We
> > > can
> > > > > take this opportunity to construct the pageIndexCache. It's very
> > simple
> > > > to
> > > > > code. We also think of building a offset index file and some
> concerns
> > > > > stemed from following:
> > > > >
> > > > >    1. When to write and sync index file? Would it have some
> > performance
> > > > >    implications?
> > > > >    2. If we have a index file, we can construct pageCursorInfo
> > through
> > > > >    it(no need to read the page like before), but we need to write
> the
> > > > total
> > > > >    message number into it first. Seems a little weird putting this
> > into
> > > > the
> > > > >    index file.
> > > > >    3. If experiencing hard crash, a recover mechanism would be
> needed
> > > to
> > > > >    recover page and page index files, E.g. truncating to the valid
> > > size.
> > > > So
> > > > >    how do we know which files need to be sanity checked?
> > > > >    4. A variant binary search algorithm maybe needed, see
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > > > >     .
> > > > >    5. Unlike kafka from which user fetches lots of messages at once
> > and
> > > > >    broker just needs to look up start offset from the index file
> > once,
> > > > artemis
> > > > >    delivers message one by one and that means we have to look up
> the
> > > > index
> > > > >    every time we deliver a message. Although the index file is
> > possibly
> > > > in
> > > > >    page cache, there are still chances we miss cache.
> > > > >    6. Compatibility with old files.
> > > > >
> > > > > To sum that, kafka uses a mmaped index file and we use a index
> cache.
> > > > Both
> > > > > are designed to find physical file position according offset(kafka)
> > or
> > > > > message number(artemis). And we prefer the index cache bcs it's
> easy
> > to
> > > > > understand and maintain.
> > > > >
> > > > > We also tested the one subscriber case with the same setup.
> > > > > The original:
> > > > > consumer tps(11000msg/s) and latency:
> > > > > [image: orig_single_subscriber.png]
> > > > > producer tps(30000msg/s) and latency:
> > > > > [image: orig_single_producer.png]
> > > > > The pr:
> > > > > consumer tps(14000msg/s) and latency:
> > > > > [image: pr_single_consumer.png]
> > > > > producer tps(30000msg/s) and latency:
> > > > > [image: pr_single_producer.png]
> > > > > It showed result is similar and event a little better in the case
> of
> > > > > single subscriber.
> > > > >
> > > > > We used our inner test platform and i think jmeter can also be used
> > to
> > > > > test again it.
> > > > >
> > > >
> > >
> >
>
>
>
>
>
>





Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

wei yang
Hi,  michael

Thanks for the advise. For the current pr, we can use two arrays where one
records the message number and the other one corresponding offset to
optimize the memory usage. For the franz's approch, we will also work on a
early prototyping implementation. After that, we would take some basic
tests in different scenarios.

<[hidden email]> 于2019年7月2日周二 上午7:08写道:

> Point though is an extra index cache layer is needed. The overhead of that
> means the total paged capacity will be more limited as that overhead isnt
> just an extra int per reference. E.g. in the pr the current impl isnt very
> memory optimised, could an int array be used or at worst an open primitive
> int int hashmap.
>
>
>
>
> This is why i really prefer franz's approach.
>
>
>
>
> Also what ever we do, we need the new behaviour configurable, so should a
> use case not thought about they won't be impacted. E.g. the change should
> not be a surprise, it should be something you toggle on.
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw" <[hidden email]> wrote:
>
>
>
>
>
>
>
>
>
>
> Hi,
> We've took a test against your configuration:
> 5Mb10010Mb.
> The current code: 7000msg/s sent and 18000msg/s received.
> Pr code:16000msg/s received and 8200msg/s sent.
> Like you said, the performance boosts by using much smaller page file and
> holding many more for current code.
>
> Not sure what implications would have using smaller page file, the producer
> performance may reduce since switching files is more frequent, number of
> file handle would increase?
>
> While our consumer in the test just echos, nothing to do after receiving
> message, the consumer in the real world may be busy doing business. This
> means references and page caches reside in memory longer and may be evicted
> more easily when producers are sending all the time.
>
> Since We don't know how many subscribers there are, it is not a scalable
> approch. We can't reduce page file size unlimited to fit the number of
> subscribers. The code should accommodate to all kinds of configurations. We
> adjust configuration for trade off as needed, not work around IMO.
> In our company, ~200 queues(60% are owned by some addresses) are deployed
> in the broker. We can't set all to e.g. 100 page caches(too much memory),
> and neither set different size according to address pattern(hard for
> operation). In the multi tenants cluster, we prefer availability and to
> avoid memory exhausted, we set pageSize to 30MB, max cache size to 1 and
> max size to 31MB. It's running well in one of our clusters now:)
>
>  于2019年6月29日周六 上午2:35写道:
>
> > I think some of that is down to configuration. If you think you could
> > configure paging to have much smaller page files but have many more held.
> > That way the reference sizes will be far smaller and pages dropping in
> and
> > out would be less. E.g. if you expect 100 being read make it 100 but make
> > the page sizes smaller so the overhead is far less
> >
> >
> >
> >
> > Get Outlook for Android
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > "At last for one message we maybe read twice: first we read page and
> create
> > pagereference; second we requery message after its reference is
> removed.  "
> >
> > I just realized it was wrong. One message maybe read many times. Think of
> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000 msg,
> > reading the whole page; When #2001~#4000 msg is deliverd, need to depage
> > #4001~#6000 msg, reading page again, etc.
> >
> > One message maybe read three times if we don't depage until all messages
> > are delivered. For example, we have 3 pages p1, p2,p3 and message m1
> which
> > is at top part of the p2. In our case(max-size-bytes=51MB, a little
> bigger
> > than page size), first depage round reads bottom half of p1 and top part
> of
> > p2; second depage round reads bottom half of p2 and top part of p3.
> > Therforce p2 is read twice and m1 maybe read three times if requeryed.
> >
> > Be honest, i don't know how to fix the problem above with the
> > decrentralized approch. The point is not how we rely on os cache, it's
> that
> > we do it the wrong way, shouldn't read whole page(50MB) just for ~2000
> > messages. Also there is no need to save 51MB PagedReferenceImpl in
> memory.
> > When 100 queues occupy 5100MB memory, the message references are very
> > likely to be removed.
> >
> >
> > Francesco Nigro  于2019年6月27日周四 下午5:05写道:
> >
> > > >
> > > >  which means the offset info is 100 times large compared to the
> shared
> > > > page index cache.
> > >
> > >
> > > I would check with JOL plugin for exact numbers..
> > > I see with it that we would have an increase of 4 bytes for each
> > > PagedRefeferenceImpl, totally decrentralized vs
> > > a centralized approach (the cache). In the economy of a fully loaded
> > > broker, if we care about scaling need to understand if the memory
> > tradeoff
> > > is important enough
> > > to choose one of the 2 approaches.
> > > My point is that paging could be made totally based on the OS page
> cache
> > if
> > > GC would get in the middle, deleting any previous mechanism of page
> > > caching...simplifying the process at it is.
> > > Using a 2 level cache with such centralized approach can work, but will
> > add
> > > a level of complexity that IMO could be saved...
> > > What do you think could be the benefit of the decentralized solution if
> > > compared with the one proposed in the PR?
> > >
> > >
> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> > > scritto:
> > >
> > > > Sorry, I missed the PageReferece part.
> > > >
> > > > The lifecyle of PageReference is: depage(in
> > > intermediateMessageReferences)
> > > > -> deliver(in messageReferences) -> waiting for ack(in
> deliveringRefs)
> > ->
> > > > removed. Every queue would create it's own PageReference which means
> > the
> > > > offset info is 100 times large compared to the shared page index
> cache.
> > > > If we keep 51MB pageReference size in memory, as i said in pr, "For
> > > > multiple subscribers to the same address, just one executor is
> > > responsible
> > > > for delivering which means at the same moment only one queue is
> > > delivering.
> > > > Thus the queue maybe stalled for a long time. We get queueMemorySize
> > > > messages into memory, and when we deliver these after a long time, we
> > > > probably need to query message and read page file again.".  At last
> for
> > > one
> > > > message we maybe read twice: first we read page and create
> > pagereference;
> > > > second we requery message after its reference is removed.
> > > >
> > > > For the shared page index cache design, each message just need to be
> > read
> > > > from file once.
> > > >
> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> > > >
> > > > > Hi
> > > > >
> > > > > First of all i think this is an excellent effort, and could be a
> > > > potential
> > > > > massive positive change.
> > > > >
> > > > > Before making any change on such scale, i do think we need to
> ensure
> > we
> > > > > have sufficient benchmarks on a number of scenarios, not just one
> use
> > > > case,
> > > > > and the benchmark tool used does need to be available openly so
> that
> > > > others
> > > > > can verify the measures and check on their setups.
> > > > >
> > > > > Some additional scenarios i would want/need covering are:
> > > > >
> > > > > PageCache set to 5, and all consumers keeping up, but lagging
> enough
> > to
> > > > be
> > > > > reading from the same 1st page cache, latency and throughput need
> to
> > be
> > > > > measured for all.
> > > > > PageCache set to 5 and all consumers but one keeping up but lagging
> > > > enough
> > > > > to be reading from the same 1st page cahce, but the one is falling
> > off
> > > > the
> > > > > end, causing the page cache swapping, measure latecy and througput
> of
> > > > those
> > > > > keeping up in the 1st page cache not caring for the one.
> > > > >
> > > > > Regards to solution some alternative approach to discuss
> > > > >
> > > > > In your scenario if i understand correctly each subscriber is
> > > effectivly
> > > > > having their own queue (1 to 1 mapping) not sharing.
> > > > > You mention kafka and say multiple consumers doent read serailly on
> > the
> > > > > address and this is true, but per queue processing through messages
> > > > > (dispatch) is still serial even with multiple shared consumers on a
> > > > queue.
> > > > >
> > > > > What about keeping the existing mechanism but having a queue hold
> > > > reference
> > > > > to a page cache that the queue is currently on, being kept from gc
> > > (e.g.
> > > > > not soft) therefore meaning page cache isnt being swapped around,
> > when
> > > > you
> > > > > have queues (in your case subscribers) swapping pagecaches back and
> > > forth
> > > > > avoidning the constant re-read issue.
> > > > >
> > > > > Also i think Franz had an excellent idea, do away with pagecache in
> > its
> > > > > current form entirely, ensure the offset is kept with the reference
> > and
> > > > > rely on OS caching keeping hot blocks/data.
> > > > >
> > > > > Best
> > > > > Michael
> > > > >
> > > > >
> > > > >
> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> > > > >
> > > > > > Hi, folks
> > > > > >
> > > > > > This is the discussion about "ARTEMIS-2399 Fix performance
> > > degradation
> > > > > > when there are a lot of subscribers".
> > > > > >
> > > > > > First apologize i didn't clarify our thoughts.
> > > > > >
> > > > > > As noted in the part of Environment, page-max-cache-size is set
> to
> > 1
> > > > > > meaning at most one page is allowed in softValueCache. We have
> > tested
> > > > > with
> > > > > > the default page-max-cache-size which is 5, it would take some
> time
> > > to
> > > > > > see the performance degradation since at start the cursor
> positions
> > > of
> > > > > 100
> > > > > > subscribers are similar when all the messages read hits the
> > > > > softValueCache.
> > > > > > But after some time, the cursor positions are different. When
> these
> > > > > > positions are located more than 5 pages, it means some page would
> > be
> > > > read
> > > > > > back and forth. This can be proved by the trace log "adding
> > pageCache
> > > > > > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl
> > where
> > > > some
> > > > > > pages are read a lot of times for the same subscriber. From the
> > time
> > > > on,
> > > > > > the performance starts to degrade. So we set page-max-cache-size
> > to 1
> > > > > > here just to make the test process more fast and it doesn't
> change
> > > the
> > > > > > final result.
> > > > > >
> > > > > > The softValueCache would be removed if memory is really low, in
> > > > addition
> > > > > > the map size reaches capacity(default 5). In most cases, the
> > > > subscribers
> > > > > > are tailing read which are served by softValueCache(no need to
> > bother
> > > > > > disk), thus we need to keep it. But When some subscribers fall
> > > behind,
> > > > > they
> > > > > > need to read page not in softValueCache. After looking up code,
> we
> > > > found
> > > > > one
> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS deliver
> > round
> > > > in
> > > > > > most situations, and that's to say at most
> MAX_DELIVERIES_IN_LOOP *
> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged next.
> If
> > > you
> > > > > > adjust QueueImpl logger to debug level, you would see logs like
> > > "Queue
> > > > > > Memory Size after depage on queue=sub4 is 53478769 with maxSize =
> > > > > 52428800.
> > > > > > Depaged 68 messages, pendingDelivery=1002,
> > > > intermediateMessageReferences=
> > > > > > 23162, queueDelivering=0". In order to depage less than 2000
> > > messages,
> > > > > > each subscriber has to read a whole page which is unnecessary and
> > > > > wasteful.
> > > > > > In our test where one page(50MB) contains ~40000 messages, one
> > > > subscriber
> > > > > > maybe read 40000/2000=20 times of page if softValueCache is
> evicted
> > > to
> > > > > > finish delivering it. This has drastically slowed down the
> process
> > > and
> > > > > > burdened on the disk. So we add the PageIndexCacheImpl and read
> one
> > > > > message
> > > > > > each time rather than read all messages of page. In this way, for
> > > each
> > > > > > subscriber each page is read only once after finishing
> delivering.
> > > > > >
> > > > > > Having said that, the softValueCache is used for tailing read. If
> > > it's
> > > > > > evicted, it won't be reloaded to prevent from the issue
> illustrated
> > > > > above.
> > > > > > Instead the pageIndexCache would be used.
> > > > > >
> > > > > > Regarding implementation details, we noted that before delivering
> > > > page, a
> > > > > > pageCursorInfo is constructed which needs to read the whole page.
> > We
> > > > can
> > > > > > take this opportunity to construct the pageIndexCache. It's very
> > > simple
> > > > > to
> > > > > > code. We also think of building a offset index file and some
> > concerns
> > > > > > stemed from following:
> > > > > >
> > > > > >    1. When to write and sync index file? Would it have some
> > > performance
> > > > > >    implications?
> > > > > >    2. If we have a index file, we can construct pageCursorInfo
> > > through
> > > > > >    it(no need to read the page like before), but we need to write
> > the
> > > > > total
> > > > > >    message number into it first. Seems a little weird putting
> this
> > > into
> > > > > the
> > > > > >    index file.
> > > > > >    3. If experiencing hard crash, a recover mechanism would be
> > needed
> > > > to
> > > > > >    recover page and page index files, E.g. truncating to the
> valid
> > > > size.
> > > > > So
> > > > > >    how do we know which files need to be sanity checked?
> > > > > >    4. A variant binary search algorithm maybe needed, see
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > > > > >     .
> > > > > >    5. Unlike kafka from which user fetches lots of messages at
> once
> > > and
> > > > > >    broker just needs to look up start offset from the index file
> > > once,
> > > > > artemis
> > > > > >    delivers message one by one and that means we have to look up
> > the
> > > > > index
> > > > > >    every time we deliver a message. Although the index file is
> > > possibly
> > > > > in
> > > > > >    page cache, there are still chances we miss cache.
> > > > > >    6. Compatibility with old files.
> > > > > >
> > > > > > To sum that, kafka uses a mmaped index file and we use a index
> > cache.
> > > > > Both
> > > > > > are designed to find physical file position according
> offset(kafka)
> > > or
> > > > > > message number(artemis). And we prefer the index cache bcs it's
> > easy
> > > to
> > > > > > understand and maintain.
> > > > > >
> > > > > > We also tested the one subscriber case with the same setup.
> > > > > > The original:
> > > > > > consumer tps(11000msg/s) and latency:
> > > > > > [image: orig_single_subscriber.png]
> > > > > > producer tps(30000msg/s) and latency:
> > > > > > [image: orig_single_producer.png]
> > > > > > The pr:
> > > > > > consumer tps(14000msg/s) and latency:
> > > > > > [image: pr_single_consumer.png]
> > > > > > producer tps(30000msg/s) and latency:
> > > > > > [image: pr_single_producer.png]
> > > > > > It showed result is similar and event a little better in the case
> > of
> > > > > > single subscriber.
> > > > > >
> > > > > > We used our inner test platform and i think jmeter can also be
> used
> > > to
> > > > > > test again it.
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> >
> >
> >
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

wei yang
Hi,

I have finished work on the new implementation(not yet tests and
configuration) as suggested by franz.

I put fileOffsetset in the PagePosition and add a new class PageReader
which is a wrapper of the page that implements PageCache interface. The
PageReader class is used to read page file if cache is evicted. For detail,
see
https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987

I deployed some tests and results below:
1. Running in 51MB size page and 1 page cache in the case of 100 multicast
queues.
https://filebin.net/wnyan7d2n1qgfsvg
2. Running in 5MB size page and 100 page cache in the case of 100 multicast
queues.
https://filebin.net/re0989vz7ib1c5mc
3. Running in 51MB size page and 1 page cache in the case of 1 queue.
https://filebin.net/3qndct7f11qckrus

The results seem good, similar with the implementation in the pr. The most
important is the index cache data is removed, no worry about extra overhead
:)

yw yw <[hidden email]> 于2019年7月4日周四 下午5:38写道:

> Hi,  michael
>
> Thanks for the advise. For the current pr, we can use two arrays where one
> records the message number and the other one corresponding offset to
> optimize the memory usage. For the franz's approch, we will also work on
> a early prototyping implementation. After that, we would take some basic
> tests in different scenarios.
>
> <[hidden email]> 于2019年7月2日周二 上午7:08写道:
>
>> Point though is an extra index cache layer is needed. The overhead of
>> that means the total paged capacity will be more limited as that overhead
>> isnt just an extra int per reference. E.g. in the pr the current impl isnt
>> very memory optimised, could an int array be used or at worst an open
>> primitive int int hashmap.
>>
>>
>>
>>
>> This is why i really prefer franz's approach.
>>
>>
>>
>>
>> Also what ever we do, we need the new behaviour configurable, so should a
>> use case not thought about they won't be impacted. E.g. the change should
>> not be a surprise, it should be something you toggle on.
>>
>>
>>
>>
>> Get Outlook for Android
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw" <[hidden email]> wrote:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Hi,
>> We've took a test against your configuration:
>> 5Mb10010Mb.
>> The current code: 7000msg/s sent and 18000msg/s received.
>> Pr code:16000msg/s received and 8200msg/s sent.
>> Like you said, the performance boosts by using much smaller page file and
>> holding many more for current code.
>>
>> Not sure what implications would have using smaller page file, the
>> producer
>> performance may reduce since switching files is more frequent, number of
>> file handle would increase?
>>
>> While our consumer in the test just echos, nothing to do after receiving
>> message, the consumer in the real world may be busy doing business. This
>> means references and page caches reside in memory longer and may be
>> evicted
>> more easily when producers are sending all the time.
>>
>> Since We don't know how many subscribers there are, it is not a scalable
>> approch. We can't reduce page file size unlimited to fit the number of
>> subscribers. The code should accommodate to all kinds of configurations.
>> We
>> adjust configuration for trade off as needed, not work around IMO.
>> In our company, ~200 queues(60% are owned by some addresses) are deployed
>> in the broker. We can't set all to e.g. 100 page caches(too much memory),
>> and neither set different size according to address pattern(hard for
>> operation). In the multi tenants cluster, we prefer availability and to
>> avoid memory exhausted, we set pageSize to 30MB, max cache size to 1 and
>> max size to 31MB. It's running well in one of our clusters now:)
>>
>>  于2019年6月29日周六 上午2:35写道:
>>
>> > I think some of that is down to configuration. If you think you could
>> > configure paging to have much smaller page files but have many more
>> held.
>> > That way the reference sizes will be far smaller and pages dropping in
>> and
>> > out would be less. E.g. if you expect 100 being read make it 100 but
>> make
>> > the page sizes smaller so the overhead is far less
>> >
>> >
>> >
>> >
>> > Get Outlook for Android
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > "At last for one message we maybe read twice: first we read page and
>> create
>> > pagereference; second we requery message after its reference is
>> removed.  "
>> >
>> > I just realized it was wrong. One message maybe read many times. Think
>> of
>> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000 msg,
>> > reading the whole page; When #2001~#4000 msg is deliverd, need to depage
>> > #4001~#6000 msg, reading page again, etc.
>> >
>> > One message maybe read three times if we don't depage until all messages
>> > are delivered. For example, we have 3 pages p1, p2,p3 and message m1
>> which
>> > is at top part of the p2. In our case(max-size-bytes=51MB, a little
>> bigger
>> > than page size), first depage round reads bottom half of p1 and top
>> part of
>> > p2; second depage round reads bottom half of p2 and top part of p3.
>> > Therforce p2 is read twice and m1 maybe read three times if requeryed.
>> >
>> > Be honest, i don't know how to fix the problem above with the
>> > decrentralized approch. The point is not how we rely on os cache, it's
>> that
>> > we do it the wrong way, shouldn't read whole page(50MB) just for ~2000
>> > messages. Also there is no need to save 51MB PagedReferenceImpl in
>> memory.
>> > When 100 queues occupy 5100MB memory, the message references are very
>> > likely to be removed.
>> >
>> >
>> > Francesco Nigro  于2019年6月27日周四 下午5:05写道:
>> >
>> > > >
>> > > >  which means the offset info is 100 times large compared to the
>> shared
>> > > > page index cache.
>> > >
>> > >
>> > > I would check with JOL plugin for exact numbers..
>> > > I see with it that we would have an increase of 4 bytes for each
>> > > PagedRefeferenceImpl, totally decrentralized vs
>> > > a centralized approach (the cache). In the economy of a fully loaded
>> > > broker, if we care about scaling need to understand if the memory
>> > tradeoff
>> > > is important enough
>> > > to choose one of the 2 approaches.
>> > > My point is that paging could be made totally based on the OS page
>> cache
>> > if
>> > > GC would get in the middle, deleting any previous mechanism of page
>> > > caching...simplifying the process at it is.
>> > > Using a 2 level cache with such centralized approach can work, but
>> will
>> > add
>> > > a level of complexity that IMO could be saved...
>> > > What do you think could be the benefit of the decentralized solution
>> if
>> > > compared with the one proposed in the PR?
>> > >
>> > >
>> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
>> > > scritto:
>> > >
>> > > > Sorry, I missed the PageReferece part.
>> > > >
>> > > > The lifecyle of PageReference is: depage(in
>> > > intermediateMessageReferences)
>> > > > -> deliver(in messageReferences) -> waiting for ack(in
>> deliveringRefs)
>> > ->
>> > > > removed. Every queue would create it's own PageReference which means
>> > the
>> > > > offset info is 100 times large compared to the shared page index
>> cache.
>> > > > If we keep 51MB pageReference size in memory, as i said in pr, "For
>> > > > multiple subscribers to the same address, just one executor is
>> > > responsible
>> > > > for delivering which means at the same moment only one queue is
>> > > delivering.
>> > > > Thus the queue maybe stalled for a long time. We get queueMemorySize
>> > > > messages into memory, and when we deliver these after a long time,
>> we
>> > > > probably need to query message and read page file again.".  At last
>> for
>> > > one
>> > > > message we maybe read twice: first we read page and create
>> > pagereference;
>> > > > second we requery message after its reference is removed.
>> > > >
>> > > > For the shared page index cache design, each message just need to be
>> > read
>> > > > from file once.
>> > > >
>> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
>> > > >
>> > > > > Hi
>> > > > >
>> > > > > First of all i think this is an excellent effort, and could be a
>> > > > potential
>> > > > > massive positive change.
>> > > > >
>> > > > > Before making any change on such scale, i do think we need to
>> ensure
>> > we
>> > > > > have sufficient benchmarks on a number of scenarios, not just one
>> use
>> > > > case,
>> > > > > and the benchmark tool used does need to be available openly so
>> that
>> > > > others
>> > > > > can verify the measures and check on their setups.
>> > > > >
>> > > > > Some additional scenarios i would want/need covering are:
>> > > > >
>> > > > > PageCache set to 5, and all consumers keeping up, but lagging
>> enough
>> > to
>> > > > be
>> > > > > reading from the same 1st page cache, latency and throughput need
>> to
>> > be
>> > > > > measured for all.
>> > > > > PageCache set to 5 and all consumers but one keeping up but
>> lagging
>> > > > enough
>> > > > > to be reading from the same 1st page cahce, but the one is falling
>> > off
>> > > > the
>> > > > > end, causing the page cache swapping, measure latecy and
>> througput of
>> > > > those
>> > > > > keeping up in the 1st page cache not caring for the one.
>> > > > >
>> > > > > Regards to solution some alternative approach to discuss
>> > > > >
>> > > > > In your scenario if i understand correctly each subscriber is
>> > > effectivly
>> > > > > having their own queue (1 to 1 mapping) not sharing.
>> > > > > You mention kafka and say multiple consumers doent read serailly
>> on
>> > the
>> > > > > address and this is true, but per queue processing through
>> messages
>> > > > > (dispatch) is still serial even with multiple shared consumers on
>> a
>> > > > queue.
>> > > > >
>> > > > > What about keeping the existing mechanism but having a queue hold
>> > > > reference
>> > > > > to a page cache that the queue is currently on, being kept from gc
>> > > (e.g.
>> > > > > not soft) therefore meaning page cache isnt being swapped around,
>> > when
>> > > > you
>> > > > > have queues (in your case subscribers) swapping pagecaches back
>> and
>> > > forth
>> > > > > avoidning the constant re-read issue.
>> > > > >
>> > > > > Also i think Franz had an excellent idea, do away with pagecache
>> in
>> > its
>> > > > > current form entirely, ensure the offset is kept with the
>> reference
>> > and
>> > > > > rely on OS caching keeping hot blocks/data.
>> > > > >
>> > > > > Best
>> > > > > Michael
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
>> > > > >
>> > > > > > Hi, folks
>> > > > > >
>> > > > > > This is the discussion about "ARTEMIS-2399 Fix performance
>> > > degradation
>> > > > > > when there are a lot of subscribers".
>> > > > > >
>> > > > > > First apologize i didn't clarify our thoughts.
>> > > > > >
>> > > > > > As noted in the part of Environment, page-max-cache-size is set
>> to
>> > 1
>> > > > > > meaning at most one page is allowed in softValueCache. We have
>> > tested
>> > > > > with
>> > > > > > the default page-max-cache-size which is 5, it would take some
>> time
>> > > to
>> > > > > > see the performance degradation since at start the cursor
>> positions
>> > > of
>> > > > > 100
>> > > > > > subscribers are similar when all the messages read hits the
>> > > > > softValueCache.
>> > > > > > But after some time, the cursor positions are different. When
>> these
>> > > > > > positions are located more than 5 pages, it means some page
>> would
>> > be
>> > > > read
>> > > > > > back and forth. This can be proved by the trace log "adding
>> > pageCache
>> > > > > > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl
>> > where
>> > > > some
>> > > > > > pages are read a lot of times for the same subscriber. From the
>> > time
>> > > > on,
>> > > > > > the performance starts to degrade. So we set page-max-cache-size
>> > to 1
>> > > > > > here just to make the test process more fast and it doesn't
>> change
>> > > the
>> > > > > > final result.
>> > > > > >
>> > > > > > The softValueCache would be removed if memory is really low, in
>> > > > addition
>> > > > > > the map size reaches capacity(default 5). In most cases, the
>> > > > subscribers
>> > > > > > are tailing read which are served by softValueCache(no need to
>> > bother
>> > > > > > disk), thus we need to keep it. But When some subscribers fall
>> > > behind,
>> > > > > they
>> > > > > > need to read page not in softValueCache. After looking up code,
>> we
>> > > > found
>> > > > > one
>> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS deliver
>> > round
>> > > > in
>> > > > > > most situations, and that's to say at most
>> MAX_DELIVERIES_IN_LOOP *
>> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged next.
>> If
>> > > you
>> > > > > > adjust QueueImpl logger to debug level, you would see logs like
>> > > "Queue
>> > > > > > Memory Size after depage on queue=sub4 is 53478769 with maxSize
>> =
>> > > > > 52428800.
>> > > > > > Depaged 68 messages, pendingDelivery=1002,
>> > > > intermediateMessageReferences=
>> > > > > > 23162, queueDelivering=0". In order to depage less than 2000
>> > > messages,
>> > > > > > each subscriber has to read a whole page which is unnecessary
>> and
>> > > > > wasteful.
>> > > > > > In our test where one page(50MB) contains ~40000 messages, one
>> > > > subscriber
>> > > > > > maybe read 40000/2000=20 times of page if softValueCache is
>> evicted
>> > > to
>> > > > > > finish delivering it. This has drastically slowed down the
>> process
>> > > and
>> > > > > > burdened on the disk. So we add the PageIndexCacheImpl and read
>> one
>> > > > > message
>> > > > > > each time rather than read all messages of page. In this way,
>> for
>> > > each
>> > > > > > subscriber each page is read only once after finishing
>> delivering.
>> > > > > >
>> > > > > > Having said that, the softValueCache is used for tailing read.
>> If
>> > > it's
>> > > > > > evicted, it won't be reloaded to prevent from the issue
>> illustrated
>> > > > > above.
>> > > > > > Instead the pageIndexCache would be used.
>> > > > > >
>> > > > > > Regarding implementation details, we noted that before
>> delivering
>> > > > page, a
>> > > > > > pageCursorInfo is constructed which needs to read the whole
>> page.
>> > We
>> > > > can
>> > > > > > take this opportunity to construct the pageIndexCache. It's very
>> > > simple
>> > > > > to
>> > > > > > code. We also think of building a offset index file and some
>> > concerns
>> > > > > > stemed from following:
>> > > > > >
>> > > > > >    1. When to write and sync index file? Would it have some
>> > > performance
>> > > > > >    implications?
>> > > > > >    2. If we have a index file, we can construct pageCursorInfo
>> > > through
>> > > > > >    it(no need to read the page like before), but we need to
>> write
>> > the
>> > > > > total
>> > > > > >    message number into it first. Seems a little weird putting
>> this
>> > > into
>> > > > > the
>> > > > > >    index file.
>> > > > > >    3. If experiencing hard crash, a recover mechanism would be
>> > needed
>> > > > to
>> > > > > >    recover page and page index files, E.g. truncating to the
>> valid
>> > > > size.
>> > > > > So
>> > > > > >    how do we know which files need to be sanity checked?
>> > > > > >    4. A variant binary search algorithm maybe needed, see
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
>> > > > > >     .
>> > > > > >    5. Unlike kafka from which user fetches lots of messages at
>> once
>> > > and
>> > > > > >    broker just needs to look up start offset from the index file
>> > > once,
>> > > > > artemis
>> > > > > >    delivers message one by one and that means we have to look up
>> > the
>> > > > > index
>> > > > > >    every time we deliver a message. Although the index file is
>> > > possibly
>> > > > > in
>> > > > > >    page cache, there are still chances we miss cache.
>> > > > > >    6. Compatibility with old files.
>> > > > > >
>> > > > > > To sum that, kafka uses a mmaped index file and we use a index
>> > cache.
>> > > > > Both
>> > > > > > are designed to find physical file position according
>> offset(kafka)
>> > > or
>> > > > > > message number(artemis). And we prefer the index cache bcs it's
>> > easy
>> > > to
>> > > > > > understand and maintain.
>> > > > > >
>> > > > > > We also tested the one subscriber case with the same setup.
>> > > > > > The original:
>> > > > > > consumer tps(11000msg/s) and latency:
>> > > > > > [image: orig_single_subscriber.png]
>> > > > > > producer tps(30000msg/s) and latency:
>> > > > > > [image: orig_single_producer.png]
>> > > > > > The pr:
>> > > > > > consumer tps(14000msg/s) and latency:
>> > > > > > [image: pr_single_consumer.png]
>> > > > > > producer tps(30000msg/s) and latency:
>> > > > > > [image: pr_single_producer.png]
>> > > > > > It showed result is similar and event a little better in the
>> case
>> > of
>> > > > > > single subscriber.
>> > > > > >
>> > > > > > We used our inner test platform and i think jmeter can also be
>> used
>> > > to
>> > > > > > test again it.
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>>
>>
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

michael.andre.pearce
Could a squashed PR be sent?




Get Outlook for Android







On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw" <[hidden email]> wrote:










Hi,

I have finished work on the new implementation(not yet tests and
configuration) as suggested by franz.

I put fileOffsetset in the PagePosition and add a new class PageReader
which is a wrapper of the page that implements PageCache interface. The
PageReader class is used to read page file if cache is evicted. For detail,
see
https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987

I deployed some tests and results below:
1. Running in 51MB size page and 1 page cache in the case of 100 multicast
queues.
https://filebin.net/wnyan7d2n1qgfsvg
2. Running in 5MB size page and 100 page cache in the case of 100 multicast
queues.
https://filebin.net/re0989vz7ib1c5mc
3. Running in 51MB size page and 1 page cache in the case of 1 queue.
https://filebin.net/3qndct7f11qckrus

The results seem good, similar with the implementation in the pr. The most
important is the index cache data is removed, no worry about extra overhead
:)

yw yw  于2019年7月4日周四 下午5:38写道:

> Hi,  michael
>
> Thanks for the advise. For the current pr, we can use two arrays where one
> records the message number and the other one corresponding offset to
> optimize the memory usage. For the franz's approch, we will also work on
> a early prototyping implementation. After that, we would take some basic
> tests in different scenarios.
>
>  于2019年7月2日周二 上午7:08写道:
>
>> Point though is an extra index cache layer is needed. The overhead of
>> that means the total paged capacity will be more limited as that overhead
>> isnt just an extra int per reference. E.g. in the pr the current impl isnt
>> very memory optimised, could an int array be used or at worst an open
>> primitive int int hashmap.
>>
>>
>>
>>
>> This is why i really prefer franz's approach.
>>
>>
>>
>>
>> Also what ever we do, we need the new behaviour configurable, so should a
>> use case not thought about they won't be impacted. E.g. the change should
>> not be a surprise, it should be something you toggle on.
>>
>>
>>
>>
>> Get Outlook for Android
>>
>>
>>
>>
>>
>>
>>
>> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw"  wrote:
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> Hi,
>> We've took a test against your configuration:
>> 5Mb10010Mb.
>> The current code: 7000msg/s sent and 18000msg/s received.
>> Pr code:16000msg/s received and 8200msg/s sent.
>> Like you said, the performance boosts by using much smaller page file and
>> holding many more for current code.
>>
>> Not sure what implications would have using smaller page file, the
>> producer
>> performance may reduce since switching files is more frequent, number of
>> file handle would increase?
>>
>> While our consumer in the test just echos, nothing to do after receiving
>> message, the consumer in the real world may be busy doing business. This
>> means references and page caches reside in memory longer and may be
>> evicted
>> more easily when producers are sending all the time.
>>
>> Since We don't know how many subscribers there are, it is not a scalable
>> approch. We can't reduce page file size unlimited to fit the number of
>> subscribers. The code should accommodate to all kinds of configurations.
>> We
>> adjust configuration for trade off as needed, not work around IMO.
>> In our company, ~200 queues(60% are owned by some addresses) are deployed
>> in the broker. We can't set all to e.g. 100 page caches(too much memory),
>> and neither set different size according to address pattern(hard for
>> operation). In the multi tenants cluster, we prefer availability and to
>> avoid memory exhausted, we set pageSize to 30MB, max cache size to 1 and
>> max size to 31MB. It's running well in one of our clusters now:)
>>
>>  于2019年6月29日周六 上午2:35写道:
>>
>> > I think some of that is down to configuration. If you think you could
>> > configure paging to have much smaller page files but have many more
>> held.
>> > That way the reference sizes will be far smaller and pages dropping in
>> and
>> > out would be less. E.g. if you expect 100 being read make it 100 but
>> make
>> > the page sizes smaller so the overhead is far less
>> >
>> >
>> >
>> >
>> > Get Outlook for Android
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > "At last for one message we maybe read twice: first we read page and
>> create
>> > pagereference; second we requery message after its reference is
>> removed.  "
>> >
>> > I just realized it was wrong. One message maybe read many times. Think
>> of
>> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000 msg,
>> > reading the whole page; When #2001~#4000 msg is deliverd, need to depage
>> > #4001~#6000 msg, reading page again, etc.
>> >
>> > One message maybe read three times if we don't depage until all messages
>> > are delivered. For example, we have 3 pages p1, p2,p3 and message m1
>> which
>> > is at top part of the p2. In our case(max-size-bytes=51MB, a little
>> bigger
>> > than page size), first depage round reads bottom half of p1 and top
>> part of
>> > p2; second depage round reads bottom half of p2 and top part of p3.
>> > Therforce p2 is read twice and m1 maybe read three times if requeryed.
>> >
>> > Be honest, i don't know how to fix the problem above with the
>> > decrentralized approch. The point is not how we rely on os cache, it's
>> that
>> > we do it the wrong way, shouldn't read whole page(50MB) just for ~2000
>> > messages. Also there is no need to save 51MB PagedReferenceImpl in
>> memory.
>> > When 100 queues occupy 5100MB memory, the message references are very
>> > likely to be removed.
>> >
>> >
>> > Francesco Nigro  于2019年6月27日周四 下午5:05写道:
>> >
>> > > >
>> > > >  which means the offset info is 100 times large compared to the
>> shared
>> > > > page index cache.
>> > >
>> > >
>> > > I would check with JOL plugin for exact numbers..
>> > > I see with it that we would have an increase of 4 bytes for each
>> > > PagedRefeferenceImpl, totally decrentralized vs
>> > > a centralized approach (the cache). In the economy of a fully loaded
>> > > broker, if we care about scaling need to understand if the memory
>> > tradeoff
>> > > is important enough
>> > > to choose one of the 2 approaches.
>> > > My point is that paging could be made totally based on the OS page
>> cache
>> > if
>> > > GC would get in the middle, deleting any previous mechanism of page
>> > > caching...simplifying the process at it is.
>> > > Using a 2 level cache with such centralized approach can work, but
>> will
>> > add
>> > > a level of complexity that IMO could be saved...
>> > > What do you think could be the benefit of the decentralized solution
>> if
>> > > compared with the one proposed in the PR?
>> > >
>> > >
>> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
>> > > scritto:
>> > >
>> > > > Sorry, I missed the PageReferece part.
>> > > >
>> > > > The lifecyle of PageReference is: depage(in
>> > > intermediateMessageReferences)
>> > > > -> deliver(in messageReferences) -> waiting for ack(in
>> deliveringRefs)
>> > ->
>> > > > removed. Every queue would create it's own PageReference which means
>> > the
>> > > > offset info is 100 times large compared to the shared page index
>> cache.
>> > > > If we keep 51MB pageReference size in memory, as i said in pr, "For
>> > > > multiple subscribers to the same address, just one executor is
>> > > responsible
>> > > > for delivering which means at the same moment only one queue is
>> > > delivering.
>> > > > Thus the queue maybe stalled for a long time. We get queueMemorySize
>> > > > messages into memory, and when we deliver these after a long time,
>> we
>> > > > probably need to query message and read page file again.".  At last
>> for
>> > > one
>> > > > message we maybe read twice: first we read page and create
>> > pagereference;
>> > > > second we requery message after its reference is removed.
>> > > >
>> > > > For the shared page index cache design, each message just need to be
>> > read
>> > > > from file once.
>> > > >
>> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
>> > > >
>> > > > > Hi
>> > > > >
>> > > > > First of all i think this is an excellent effort, and could be a
>> > > > potential
>> > > > > massive positive change.
>> > > > >
>> > > > > Before making any change on such scale, i do think we need to
>> ensure
>> > we
>> > > > > have sufficient benchmarks on a number of scenarios, not just one
>> use
>> > > > case,
>> > > > > and the benchmark tool used does need to be available openly so
>> that
>> > > > others
>> > > > > can verify the measures and check on their setups.
>> > > > >
>> > > > > Some additional scenarios i would want/need covering are:
>> > > > >
>> > > > > PageCache set to 5, and all consumers keeping up, but lagging
>> enough
>> > to
>> > > > be
>> > > > > reading from the same 1st page cache, latency and throughput need
>> to
>> > be
>> > > > > measured for all.
>> > > > > PageCache set to 5 and all consumers but one keeping up but
>> lagging
>> > > > enough
>> > > > > to be reading from the same 1st page cahce, but the one is falling
>> > off
>> > > > the
>> > > > > end, causing the page cache swapping, measure latecy and
>> througput of
>> > > > those
>> > > > > keeping up in the 1st page cache not caring for the one.
>> > > > >
>> > > > > Regards to solution some alternative approach to discuss
>> > > > >
>> > > > > In your scenario if i understand correctly each subscriber is
>> > > effectivly
>> > > > > having their own queue (1 to 1 mapping) not sharing.
>> > > > > You mention kafka and say multiple consumers doent read serailly
>> on
>> > the
>> > > > > address and this is true, but per queue processing through
>> messages
>> > > > > (dispatch) is still serial even with multiple shared consumers on
>> a
>> > > > queue.
>> > > > >
>> > > > > What about keeping the existing mechanism but having a queue hold
>> > > > reference
>> > > > > to a page cache that the queue is currently on, being kept from gc
>> > > (e.g.
>> > > > > not soft) therefore meaning page cache isnt being swapped around,
>> > when
>> > > > you
>> > > > > have queues (in your case subscribers) swapping pagecaches back
>> and
>> > > forth
>> > > > > avoidning the constant re-read issue.
>> > > > >
>> > > > > Also i think Franz had an excellent idea, do away with pagecache
>> in
>> > its
>> > > > > current form entirely, ensure the offset is kept with the
>> reference
>> > and
>> > > > > rely on OS caching keeping hot blocks/data.
>> > > > >
>> > > > > Best
>> > > > > Michael
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
>> > > > >
>> > > > > > Hi, folks
>> > > > > >
>> > > > > > This is the discussion about "ARTEMIS-2399 Fix performance
>> > > degradation
>> > > > > > when there are a lot of subscribers".
>> > > > > >
>> > > > > > First apologize i didn't clarify our thoughts.
>> > > > > >
>> > > > > > As noted in the part of Environment, page-max-cache-size is set
>> to
>> > 1
>> > > > > > meaning at most one page is allowed in softValueCache. We have
>> > tested
>> > > > > with
>> > > > > > the default page-max-cache-size which is 5, it would take some
>> time
>> > > to
>> > > > > > see the performance degradation since at start the cursor
>> positions
>> > > of
>> > > > > 100
>> > > > > > subscribers are similar when all the messages read hits the
>> > > > > softValueCache.
>> > > > > > But after some time, the cursor positions are different. When
>> these
>> > > > > > positions are located more than 5 pages, it means some page
>> would
>> > be
>> > > > read
>> > > > > > back and forth. This can be proved by the trace log "adding
>> > pageCache
>> > > > > > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl
>> > where
>> > > > some
>> > > > > > pages are read a lot of times for the same subscriber. From the
>> > time
>> > > > on,
>> > > > > > the performance starts to degrade. So we set page-max-cache-size
>> > to 1
>> > > > > > here just to make the test process more fast and it doesn't
>> change
>> > > the
>> > > > > > final result.
>> > > > > >
>> > > > > > The softValueCache would be removed if memory is really low, in
>> > > > addition
>> > > > > > the map size reaches capacity(default 5). In most cases, the
>> > > > subscribers
>> > > > > > are tailing read which are served by softValueCache(no need to
>> > bother
>> > > > > > disk), thus we need to keep it. But When some subscribers fall
>> > > behind,
>> > > > > they
>> > > > > > need to read page not in softValueCache. After looking up code,
>> we
>> > > > found
>> > > > > one
>> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS deliver
>> > round
>> > > > in
>> > > > > > most situations, and that's to say at most
>> MAX_DELIVERIES_IN_LOOP *
>> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged next.
>> If
>> > > you
>> > > > > > adjust QueueImpl logger to debug level, you would see logs like
>> > > "Queue
>> > > > > > Memory Size after depage on queue=sub4 is 53478769 with maxSize
>> =
>> > > > > 52428800.
>> > > > > > Depaged 68 messages, pendingDelivery=1002,
>> > > > intermediateMessageReferences=
>> > > > > > 23162, queueDelivering=0". In order to depage less than 2000
>> > > messages,
>> > > > > > each subscriber has to read a whole page which is unnecessary
>> and
>> > > > > wasteful.
>> > > > > > In our test where one page(50MB) contains ~40000 messages, one
>> > > > subscriber
>> > > > > > maybe read 40000/2000=20 times of page if softValueCache is
>> evicted
>> > > to
>> > > > > > finish delivering it. This has drastically slowed down the
>> process
>> > > and
>> > > > > > burdened on the disk. So we add the PageIndexCacheImpl and read
>> one
>> > > > > message
>> > > > > > each time rather than read all messages of page. In this way,
>> for
>> > > each
>> > > > > > subscriber each page is read only once after finishing
>> delivering.
>> > > > > >
>> > > > > > Having said that, the softValueCache is used for tailing read.
>> If
>> > > it's
>> > > > > > evicted, it won't be reloaded to prevent from the issue
>> illustrated
>> > > > > above.
>> > > > > > Instead the pageIndexCache would be used.
>> > > > > >
>> > > > > > Regarding implementation details, we noted that before
>> delivering
>> > > > page, a
>> > > > > > pageCursorInfo is constructed which needs to read the whole
>> page.
>> > We
>> > > > can
>> > > > > > take this opportunity to construct the pageIndexCache. It's very
>> > > simple
>> > > > > to
>> > > > > > code. We also think of building a offset index file and some
>> > concerns
>> > > > > > stemed from following:
>> > > > > >
>> > > > > >    1. When to write and sync index file? Would it have some
>> > > performance
>> > > > > >    implications?
>> > > > > >    2. If we have a index file, we can construct pageCursorInfo
>> > > through
>> > > > > >    it(no need to read the page like before), but we need to
>> write
>> > the
>> > > > > total
>> > > > > >    message number into it first. Seems a little weird putting
>> this
>> > > into
>> > > > > the
>> > > > > >    index file.
>> > > > > >    3. If experiencing hard crash, a recover mechanism would be
>> > needed
>> > > > to
>> > > > > >    recover page and page index files, E.g. truncating to the
>> valid
>> > > > size.
>> > > > > So
>> > > > > >    how do we know which files need to be sanity checked?
>> > > > > >    4. A variant binary search algorithm maybe needed, see
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
>> > > > > >     .
>> > > > > >    5. Unlike kafka from which user fetches lots of messages at
>> once
>> > > and
>> > > > > >    broker just needs to look up start offset from the index file
>> > > once,
>> > > > > artemis
>> > > > > >    delivers message one by one and that means we have to look up
>> > the
>> > > > > index
>> > > > > >    every time we deliver a message. Although the index file is
>> > > possibly
>> > > > > in
>> > > > > >    page cache, there are still chances we miss cache.
>> > > > > >    6. Compatibility with old files.
>> > > > > >
>> > > > > > To sum that, kafka uses a mmaped index file and we use a index
>> > cache.
>> > > > > Both
>> > > > > > are designed to find physical file position according
>> offset(kafka)
>> > > or
>> > > > > > message number(artemis). And we prefer the index cache bcs it's
>> > easy
>> > > to
>> > > > > > understand and maintain.
>> > > > > >
>> > > > > > We also tested the one subscriber case with the same setup.
>> > > > > > The original:
>> > > > > > consumer tps(11000msg/s) and latency:
>> > > > > > [image: orig_single_subscriber.png]
>> > > > > > producer tps(30000msg/s) and latency:
>> > > > > > [image: orig_single_producer.png]
>> > > > > > The pr:
>> > > > > > consumer tps(14000msg/s) and latency:
>> > > > > > [image: pr_single_consumer.png]
>> > > > > > producer tps(30000msg/s) and latency:
>> > > > > > [image: pr_single_producer.png]
>> > > > > > It showed result is similar and event a little better in the
>> case
>> > of
>> > > > > > single subscriber.
>> > > > > >
>> > > > > > We used our inner test platform and i think jmeter can also be
>> used
>> > > to
>> > > > > > test again it.
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> >
>> >
>> >
>>
>>
>>
>>
>>
>>





Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

clebertsuconic
I just came back after a 2 weeks deserved break and I was looking at
this.. and I can say. it's well done.. nice job! it's a lot simpler!

However there's one question now. which is probably a further
improvement. Shouldn't the pageReader be instantiated at the
PageSubscription.

That means.. if there's no page cache, in case of the page been
evicted, the Subscription would then create a new Page/PageReader
pair. and dispose it when it's done (meaning, moved to a different
page).

As you are solving the case with many subscriptions, wouldn't you hit
a corner case where all Pages are instantiated as PageReaders?


I feel like it would be better to eventually duplicate a PageReader
and close it when done.


Or did you already consider that possibility and still think it's best
to keep this cache of PageReaders?

On Sat, Jul 13, 2019 at 12:15 AM <[hidden email]> wrote:

>
> Could a squashed PR be sent?
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw" <[hidden email]> wrote:
>
>
>
>
>
>
>
>
>
>
> Hi,
>
> I have finished work on the new implementation(not yet tests and
> configuration) as suggested by franz.
>
> I put fileOffsetset in the PagePosition and add a new class PageReader
> which is a wrapper of the page that implements PageCache interface. The
> PageReader class is used to read page file if cache is evicted. For detail,
> see
> https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987
>
> I deployed some tests and results below:
> 1. Running in 51MB size page and 1 page cache in the case of 100 multicast
> queues.
> https://filebin.net/wnyan7d2n1qgfsvg
> 2. Running in 5MB size page and 100 page cache in the case of 100 multicast
> queues.
> https://filebin.net/re0989vz7ib1c5mc
> 3. Running in 51MB size page and 1 page cache in the case of 1 queue.
> https://filebin.net/3qndct7f11qckrus
>
> The results seem good, similar with the implementation in the pr. The most
> important is the index cache data is removed, no worry about extra overhead
> :)
>
> yw yw  于2019年7月4日周四 下午5:38写道:
>
> > Hi,  michael
> >
> > Thanks for the advise. For the current pr, we can use two arrays where one
> > records the message number and the other one corresponding offset to
> > optimize the memory usage. For the franz's approch, we will also work on
> > a early prototyping implementation. After that, we would take some basic
> > tests in different scenarios.
> >
> >  于2019年7月2日周二 上午7:08写道:
> >
> >> Point though is an extra index cache layer is needed. The overhead of
> >> that means the total paged capacity will be more limited as that overhead
> >> isnt just an extra int per reference. E.g. in the pr the current impl isnt
> >> very memory optimised, could an int array be used or at worst an open
> >> primitive int int hashmap.
> >>
> >>
> >>
> >>
> >> This is why i really prefer franz's approach.
> >>
> >>
> >>
> >>
> >> Also what ever we do, we need the new behaviour configurable, so should a
> >> use case not thought about they won't be impacted. E.g. the change should
> >> not be a surprise, it should be something you toggle on.
> >>
> >>
> >>
> >>
> >> Get Outlook for Android
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw"  wrote:
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >> Hi,
> >> We've took a test against your configuration:
> >> 5Mb10010Mb.
> >> The current code: 7000msg/s sent and 18000msg/s received.
> >> Pr code:16000msg/s received and 8200msg/s sent.
> >> Like you said, the performance boosts by using much smaller page file and
> >> holding many more for current code.
> >>
> >> Not sure what implications would have using smaller page file, the
> >> producer
> >> performance may reduce since switching files is more frequent, number of
> >> file handle would increase?
> >>
> >> While our consumer in the test just echos, nothing to do after receiving
> >> message, the consumer in the real world may be busy doing business. This
> >> means references and page caches reside in memory longer and may be
> >> evicted
> >> more easily when producers are sending all the time.
> >>
> >> Since We don't know how many subscribers there are, it is not a scalable
> >> approch. We can't reduce page file size unlimited to fit the number of
> >> subscribers. The code should accommodate to all kinds of configurations.
> >> We
> >> adjust configuration for trade off as needed, not work around IMO.
> >> In our company, ~200 queues(60% are owned by some addresses) are deployed
> >> in the broker. We can't set all to e.g. 100 page caches(too much memory),
> >> and neither set different size according to address pattern(hard for
> >> operation). In the multi tenants cluster, we prefer availability and to
> >> avoid memory exhausted, we set pageSize to 30MB, max cache size to 1 and
> >> max size to 31MB. It's running well in one of our clusters now:)
> >>
> >>  于2019年6月29日周六 上午2:35写道:
> >>
> >> > I think some of that is down to configuration. If you think you could
> >> > configure paging to have much smaller page files but have many more
> >> held.
> >> > That way the reference sizes will be far smaller and pages dropping in
> >> and
> >> > out would be less. E.g. if you expect 100 being read make it 100 but
> >> make
> >> > the page sizes smaller so the overhead is far less
> >> >
> >> >
> >> >
> >> >
> >> > Get Outlook for Android
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > "At last for one message we maybe read twice: first we read page and
> >> create
> >> > pagereference; second we requery message after its reference is
> >> removed.  "
> >> >
> >> > I just realized it was wrong. One message maybe read many times. Think
> >> of
> >> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000 msg,
> >> > reading the whole page; When #2001~#4000 msg is deliverd, need to depage
> >> > #4001~#6000 msg, reading page again, etc.
> >> >
> >> > One message maybe read three times if we don't depage until all messages
> >> > are delivered. For example, we have 3 pages p1, p2,p3 and message m1
> >> which
> >> > is at top part of the p2. In our case(max-size-bytes=51MB, a little
> >> bigger
> >> > than page size), first depage round reads bottom half of p1 and top
> >> part of
> >> > p2; second depage round reads bottom half of p2 and top part of p3.
> >> > Therforce p2 is read twice and m1 maybe read three times if requeryed.
> >> >
> >> > Be honest, i don't know how to fix the problem above with the
> >> > decrentralized approch. The point is not how we rely on os cache, it's
> >> that
> >> > we do it the wrong way, shouldn't read whole page(50MB) just for ~2000
> >> > messages. Also there is no need to save 51MB PagedReferenceImpl in
> >> memory.
> >> > When 100 queues occupy 5100MB memory, the message references are very
> >> > likely to be removed.
> >> >
> >> >
> >> > Francesco Nigro  于2019年6月27日周四 下午5:05写道:
> >> >
> >> > > >
> >> > > >  which means the offset info is 100 times large compared to the
> >> shared
> >> > > > page index cache.
> >> > >
> >> > >
> >> > > I would check with JOL plugin for exact numbers..
> >> > > I see with it that we would have an increase of 4 bytes for each
> >> > > PagedRefeferenceImpl, totally decrentralized vs
> >> > > a centralized approach (the cache). In the economy of a fully loaded
> >> > > broker, if we care about scaling need to understand if the memory
> >> > tradeoff
> >> > > is important enough
> >> > > to choose one of the 2 approaches.
> >> > > My point is that paging could be made totally based on the OS page
> >> cache
> >> > if
> >> > > GC would get in the middle, deleting any previous mechanism of page
> >> > > caching...simplifying the process at it is.
> >> > > Using a 2 level cache with such centralized approach can work, but
> >> will
> >> > add
> >> > > a level of complexity that IMO could be saved...
> >> > > What do you think could be the benefit of the decentralized solution
> >> if
> >> > > compared with the one proposed in the PR?
> >> > >
> >> > >
> >> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> >> > > scritto:
> >> > >
> >> > > > Sorry, I missed the PageReferece part.
> >> > > >
> >> > > > The lifecyle of PageReference is: depage(in
> >> > > intermediateMessageReferences)
> >> > > > -> deliver(in messageReferences) -> waiting for ack(in
> >> deliveringRefs)
> >> > ->
> >> > > > removed. Every queue would create it's own PageReference which means
> >> > the
> >> > > > offset info is 100 times large compared to the shared page index
> >> cache.
> >> > > > If we keep 51MB pageReference size in memory, as i said in pr, "For
> >> > > > multiple subscribers to the same address, just one executor is
> >> > > responsible
> >> > > > for delivering which means at the same moment only one queue is
> >> > > delivering.
> >> > > > Thus the queue maybe stalled for a long time. We get queueMemorySize
> >> > > > messages into memory, and when we deliver these after a long time,
> >> we
> >> > > > probably need to query message and read page file again.".  At last
> >> for
> >> > > one
> >> > > > message we maybe read twice: first we read page and create
> >> > pagereference;
> >> > > > second we requery message after its reference is removed.
> >> > > >
> >> > > > For the shared page index cache design, each message just need to be
> >> > read
> >> > > > from file once.
> >> > > >
> >> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> >> > > >
> >> > > > > Hi
> >> > > > >
> >> > > > > First of all i think this is an excellent effort, and could be a
> >> > > > potential
> >> > > > > massive positive change.
> >> > > > >
> >> > > > > Before making any change on such scale, i do think we need to
> >> ensure
> >> > we
> >> > > > > have sufficient benchmarks on a number of scenarios, not just one
> >> use
> >> > > > case,
> >> > > > > and the benchmark tool used does need to be available openly so
> >> that
> >> > > > others
> >> > > > > can verify the measures and check on their setups.
> >> > > > >
> >> > > > > Some additional scenarios i would want/need covering are:
> >> > > > >
> >> > > > > PageCache set to 5, and all consumers keeping up, but lagging
> >> enough
> >> > to
> >> > > > be
> >> > > > > reading from the same 1st page cache, latency and throughput need
> >> to
> >> > be
> >> > > > > measured for all.
> >> > > > > PageCache set to 5 and all consumers but one keeping up but
> >> lagging
> >> > > > enough
> >> > > > > to be reading from the same 1st page cahce, but the one is falling
> >> > off
> >> > > > the
> >> > > > > end, causing the page cache swapping, measure latecy and
> >> througput of
> >> > > > those
> >> > > > > keeping up in the 1st page cache not caring for the one.
> >> > > > >
> >> > > > > Regards to solution some alternative approach to discuss
> >> > > > >
> >> > > > > In your scenario if i understand correctly each subscriber is
> >> > > effectivly
> >> > > > > having their own queue (1 to 1 mapping) not sharing.
> >> > > > > You mention kafka and say multiple consumers doent read serailly
> >> on
> >> > the
> >> > > > > address and this is true, but per queue processing through
> >> messages
> >> > > > > (dispatch) is still serial even with multiple shared consumers on
> >> a
> >> > > > queue.
> >> > > > >
> >> > > > > What about keeping the existing mechanism but having a queue hold
> >> > > > reference
> >> > > > > to a page cache that the queue is currently on, being kept from gc
> >> > > (e.g.
> >> > > > > not soft) therefore meaning page cache isnt being swapped around,
> >> > when
> >> > > > you
> >> > > > > have queues (in your case subscribers) swapping pagecaches back
> >> and
> >> > > forth
> >> > > > > avoidning the constant re-read issue.
> >> > > > >
> >> > > > > Also i think Franz had an excellent idea, do away with pagecache
> >> in
> >> > its
> >> > > > > current form entirely, ensure the offset is kept with the
> >> reference
> >> > and
> >> > > > > rely on OS caching keeping hot blocks/data.
> >> > > > >
> >> > > > > Best
> >> > > > > Michael
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> >> > > > >
> >> > > > > > Hi, folks
> >> > > > > >
> >> > > > > > This is the discussion about "ARTEMIS-2399 Fix performance
> >> > > degradation
> >> > > > > > when there are a lot of subscribers".
> >> > > > > >
> >> > > > > > First apologize i didn't clarify our thoughts.
> >> > > > > >
> >> > > > > > As noted in the part of Environment, page-max-cache-size is set
> >> to
> >> > 1
> >> > > > > > meaning at most one page is allowed in softValueCache. We have
> >> > tested
> >> > > > > with
> >> > > > > > the default page-max-cache-size which is 5, it would take some
> >> time
> >> > > to
> >> > > > > > see the performance degradation since at start the cursor
> >> positions
> >> > > of
> >> > > > > 100
> >> > > > > > subscribers are similar when all the messages read hits the
> >> > > > > softValueCache.
> >> > > > > > But after some time, the cursor positions are different. When
> >> these
> >> > > > > > positions are located more than 5 pages, it means some page
> >> would
> >> > be
> >> > > > read
> >> > > > > > back and forth. This can be proved by the trace log "adding
> >> > pageCache
> >> > > > > > pageNr=xxx into cursor = test-topic" in PageCursorProviderImpl
> >> > where
> >> > > > some
> >> > > > > > pages are read a lot of times for the same subscriber. From the
> >> > time
> >> > > > on,
> >> > > > > > the performance starts to degrade. So we set page-max-cache-size
> >> > to 1
> >> > > > > > here just to make the test process more fast and it doesn't
> >> change
> >> > > the
> >> > > > > > final result.
> >> > > > > >
> >> > > > > > The softValueCache would be removed if memory is really low, in
> >> > > > addition
> >> > > > > > the map size reaches capacity(default 5). In most cases, the
> >> > > > subscribers
> >> > > > > > are tailing read which are served by softValueCache(no need to
> >> > bother
> >> > > > > > disk), thus we need to keep it. But When some subscribers fall
> >> > > behind,
> >> > > > > they
> >> > > > > > need to read page not in softValueCache. After looking up code,
> >> we
> >> > > > found
> >> > > > > one
> >> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS deliver
> >> > round
> >> > > > in
> >> > > > > > most situations, and that's to say at most
> >> MAX_DELIVERIES_IN_LOOP *
> >> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged next.
> >> If
> >> > > you
> >> > > > > > adjust QueueImpl logger to debug level, you would see logs like
> >> > > "Queue
> >> > > > > > Memory Size after depage on queue=sub4 is 53478769 with maxSize
> >> =
> >> > > > > 52428800.
> >> > > > > > Depaged 68 messages, pendingDelivery=1002,
> >> > > > intermediateMessageReferences=
> >> > > > > > 23162, queueDelivering=0". In order to depage less than 2000
> >> > > messages,
> >> > > > > > each subscriber has to read a whole page which is unnecessary
> >> and
> >> > > > > wasteful.
> >> > > > > > In our test where one page(50MB) contains ~40000 messages, one
> >> > > > subscriber
> >> > > > > > maybe read 40000/2000=20 times of page if softValueCache is
> >> evicted
> >> > > to
> >> > > > > > finish delivering it. This has drastically slowed down the
> >> process
> >> > > and
> >> > > > > > burdened on the disk. So we add the PageIndexCacheImpl and read
> >> one
> >> > > > > message
> >> > > > > > each time rather than read all messages of page. In this way,
> >> for
> >> > > each
> >> > > > > > subscriber each page is read only once after finishing
> >> delivering.
> >> > > > > >
> >> > > > > > Having said that, the softValueCache is used for tailing read.
> >> If
> >> > > it's
> >> > > > > > evicted, it won't be reloaded to prevent from the issue
> >> illustrated
> >> > > > > above.
> >> > > > > > Instead the pageIndexCache would be used.
> >> > > > > >
> >> > > > > > Regarding implementation details, we noted that before
> >> delivering
> >> > > > page, a
> >> > > > > > pageCursorInfo is constructed which needs to read the whole
> >> page.
> >> > We
> >> > > > can
> >> > > > > > take this opportunity to construct the pageIndexCache. It's very
> >> > > simple
> >> > > > > to
> >> > > > > > code. We also think of building a offset index file and some
> >> > concerns
> >> > > > > > stemed from following:
> >> > > > > >
> >> > > > > >    1. When to write and sync index file? Would it have some
> >> > > performance
> >> > > > > >    implications?
> >> > > > > >    2. If we have a index file, we can construct pageCursorInfo
> >> > > through
> >> > > > > >    it(no need to read the page like before), but we need to
> >> write
> >> > the
> >> > > > > total
> >> > > > > >    message number into it first. Seems a little weird putting
> >> this
> >> > > into
> >> > > > > the
> >> > > > > >    index file.
> >> > > > > >    3. If experiencing hard crash, a recover mechanism would be
> >> > needed
> >> > > > to
> >> > > > > >    recover page and page index files, E.g. truncating to the
> >> valid
> >> > > > size.
> >> > > > > So
> >> > > > > >    how do we know which files need to be sanity checked?
> >> > > > > >    4. A variant binary search algorithm maybe needed, see
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> >> > > > > >     .
> >> > > > > >    5. Unlike kafka from which user fetches lots of messages at
> >> once
> >> > > and
> >> > > > > >    broker just needs to look up start offset from the index file
> >> > > once,
> >> > > > > artemis
> >> > > > > >    delivers message one by one and that means we have to look up
> >> > the
> >> > > > > index
> >> > > > > >    every time we deliver a message. Although the index file is
> >> > > possibly
> >> > > > > in
> >> > > > > >    page cache, there are still chances we miss cache.
> >> > > > > >    6. Compatibility with old files.
> >> > > > > >
> >> > > > > > To sum that, kafka uses a mmaped index file and we use a index
> >> > cache.
> >> > > > > Both
> >> > > > > > are designed to find physical file position according
> >> offset(kafka)
> >> > > or
> >> > > > > > message number(artemis). And we prefer the index cache bcs it's
> >> > easy
> >> > > to
> >> > > > > > understand and maintain.
> >> > > > > >
> >> > > > > > We also tested the one subscriber case with the same setup.
> >> > > > > > The original:
> >> > > > > > consumer tps(11000msg/s) and latency:
> >> > > > > > [image: orig_single_subscriber.png]
> >> > > > > > producer tps(30000msg/s) and latency:
> >> > > > > > [image: orig_single_producer.png]
> >> > > > > > The pr:
> >> > > > > > consumer tps(14000msg/s) and latency:
> >> > > > > > [image: pr_single_consumer.png]
> >> > > > > > producer tps(30000msg/s) and latency:
> >> > > > > > [image: pr_single_producer.png]
> >> > > > > > It showed result is similar and event a little better in the
> >> case
> >> > of
> >> > > > > > single subscriber.
> >> > > > > >
> >> > > > > > We used our inner test platform and i think jmeter can also be
> >> used
> >> > > to
> >> > > > > > test again it.
> >> > > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >>
> >>
> >>
> >>
> >>
> >>
>
>
>
>
>


--
Clebert Suconic
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

wei yang
I did consider the case where all pages are instantiated as PageReaders.
That's really a problem.

The pros of pr is every page is read only once to build PageReader and
shared by all the queues. The cons of pr is many PageReaders are probably
instantiated if consumers make slow/no progress in several queues whereas
fast in other queues(I think it's the only cause leading to the corner
case, right?). This means too many open files and too much memory.

The pros of duplicated PageReader is there are fixed number of PageReaders
as with queues at the same time.
The cons is each queue has to read the page once to build their own
PageReader if page cache is evicted. I'm not sure how this will affect
performance.

The point is we need the number of messages in the page which is used by
PageCursorInfo and PageSubscription::internalGetNext, so we have to read
the page file. How about we only cache the number of messages in each page
instead of PageReader and build PageReader in each queue. While we
encounter the corner case, only <long, int> pair data is permanently in
memory that I assume is smaller than completed PageCursorInfo data. This
way we achieve the performance gain at a small price.

Clebert Suconic <[hidden email]> 于2019年7月16日周二 下午10:18写道:

> I just came back after a 2 weeks deserved break and I was looking at
> this.. and I can say. it's well done.. nice job! it's a lot simpler!
>
> However there's one question now. which is probably a further
> improvement. Shouldn't the pageReader be instantiated at the
> PageSubscription.
>
> That means.. if there's no page cache, in case of the page been
> evicted, the Subscription would then create a new Page/PageReader
> pair. and dispose it when it's done (meaning, moved to a different
> page).
>
> As you are solving the case with many subscriptions, wouldn't you hit
> a corner case where all Pages are instantiated as PageReaders?
>
>
> I feel like it would be better to eventually duplicate a PageReader
> and close it when done.
>
>
> Or did you already consider that possibility and still think it's best
> to keep this cache of PageReaders?
>
> On Sat, Jul 13, 2019 at 12:15 AM <[hidden email]>
> wrote:
> >
> > Could a squashed PR be sent?
> >
> >
> >
> >
> > Get Outlook for Android
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw" <[hidden email]>
> wrote:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > Hi,
> >
> > I have finished work on the new implementation(not yet tests and
> > configuration) as suggested by franz.
> >
> > I put fileOffsetset in the PagePosition and add a new class PageReader
> > which is a wrapper of the page that implements PageCache interface. The
> > PageReader class is used to read page file if cache is evicted. For
> detail,
> > see
> >
> https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987
> >
> > I deployed some tests and results below:
> > 1. Running in 51MB size page and 1 page cache in the case of 100
> multicast
> > queues.
> > https://filebin.net/wnyan7d2n1qgfsvg
> > 2. Running in 5MB size page and 100 page cache in the case of 100
> multicast
> > queues.
> > https://filebin.net/re0989vz7ib1c5mc
> > 3. Running in 51MB size page and 1 page cache in the case of 1 queue.
> > https://filebin.net/3qndct7f11qckrus
> >
> > The results seem good, similar with the implementation in the pr. The
> most
> > important is the index cache data is removed, no worry about extra
> overhead
> > :)
> >
> > yw yw  于2019年7月4日周四 下午5:38写道:
> >
> > > Hi,  michael
> > >
> > > Thanks for the advise. For the current pr, we can use two arrays where
> one
> > > records the message number and the other one corresponding offset to
> > > optimize the memory usage. For the franz's approch, we will also work
> on
> > > a early prototyping implementation. After that, we would take some
> basic
> > > tests in different scenarios.
> > >
> > >  于2019年7月2日周二 上午7:08写道:
> > >
> > >> Point though is an extra index cache layer is needed. The overhead of
> > >> that means the total paged capacity will be more limited as that
> overhead
> > >> isnt just an extra int per reference. E.g. in the pr the current impl
> isnt
> > >> very memory optimised, could an int array be used or at worst an open
> > >> primitive int int hashmap.
> > >>
> > >>
> > >>
> > >>
> > >> This is why i really prefer franz's approach.
> > >>
> > >>
> > >>
> > >>
> > >> Also what ever we do, we need the new behaviour configurable, so
> should a
> > >> use case not thought about they won't be impacted. E.g. the change
> should
> > >> not be a surprise, it should be something you toggle on.
> > >>
> > >>
> > >>
> > >>
> > >> Get Outlook for Android
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw"  wrote:
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> Hi,
> > >> We've took a test against your configuration:
> > >> 5Mb10010Mb.
> > >> The current code: 7000msg/s sent and 18000msg/s received.
> > >> Pr code:16000msg/s received and 8200msg/s sent.
> > >> Like you said, the performance boosts by using much smaller page file
> and
> > >> holding many more for current code.
> > >>
> > >> Not sure what implications would have using smaller page file, the
> > >> producer
> > >> performance may reduce since switching files is more frequent, number
> of
> > >> file handle would increase?
> > >>
> > >> While our consumer in the test just echos, nothing to do after
> receiving
> > >> message, the consumer in the real world may be busy doing business.
> This
> > >> means references and page caches reside in memory longer and may be
> > >> evicted
> > >> more easily when producers are sending all the time.
> > >>
> > >> Since We don't know how many subscribers there are, it is not a
> scalable
> > >> approch. We can't reduce page file size unlimited to fit the number of
> > >> subscribers. The code should accommodate to all kinds of
> configurations.
> > >> We
> > >> adjust configuration for trade off as needed, not work around IMO.
> > >> In our company, ~200 queues(60% are owned by some addresses) are
> deployed
> > >> in the broker. We can't set all to e.g. 100 page caches(too much
> memory),
> > >> and neither set different size according to address pattern(hard for
> > >> operation). In the multi tenants cluster, we prefer availability and
> to
> > >> avoid memory exhausted, we set pageSize to 30MB, max cache size to 1
> and
> > >> max size to 31MB. It's running well in one of our clusters now:)
> > >>
> > >>  于2019年6月29日周六 上午2:35写道:
> > >>
> > >> > I think some of that is down to configuration. If you think you
> could
> > >> > configure paging to have much smaller page files but have many more
> > >> held.
> > >> > That way the reference sizes will be far smaller and pages dropping
> in
> > >> and
> > >> > out would be less. E.g. if you expect 100 being read make it 100 but
> > >> make
> > >> > the page sizes smaller so the overhead is far less
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > Get Outlook for Android
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > "At last for one message we maybe read twice: first we read page and
> > >> create
> > >> > pagereference; second we requery message after its reference is
> > >> removed.  "
> > >> >
> > >> > I just realized it was wrong. One message maybe read many times.
> Think
> > >> of
> > >> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000
> msg,
> > >> > reading the whole page; When #2001~#4000 msg is deliverd, need to
> depage
> > >> > #4001~#6000 msg, reading page again, etc.
> > >> >
> > >> > One message maybe read three times if we don't depage until all
> messages
> > >> > are delivered. For example, we have 3 pages p1, p2,p3 and message m1
> > >> which
> > >> > is at top part of the p2. In our case(max-size-bytes=51MB, a little
> > >> bigger
> > >> > than page size), first depage round reads bottom half of p1 and top
> > >> part of
> > >> > p2; second depage round reads bottom half of p2 and top part of p3.
> > >> > Therforce p2 is read twice and m1 maybe read three times if
> requeryed.
> > >> >
> > >> > Be honest, i don't know how to fix the problem above with the
> > >> > decrentralized approch. The point is not how we rely on os cache,
> it's
> > >> that
> > >> > we do it the wrong way, shouldn't read whole page(50MB) just for
> ~2000
> > >> > messages. Also there is no need to save 51MB PagedReferenceImpl in
> > >> memory.
> > >> > When 100 queues occupy 5100MB memory, the message references are
> very
> > >> > likely to be removed.
> > >> >
> > >> >
> > >> > Francesco Nigro  于2019年6月27日周四 下午5:05写道:
> > >> >
> > >> > > >
> > >> > > >  which means the offset info is 100 times large compared to the
> > >> shared
> > >> > > > page index cache.
> > >> > >
> > >> > >
> > >> > > I would check with JOL plugin for exact numbers..
> > >> > > I see with it that we would have an increase of 4 bytes for each
> > >> > > PagedRefeferenceImpl, totally decrentralized vs
> > >> > > a centralized approach (the cache). In the economy of a fully
> loaded
> > >> > > broker, if we care about scaling need to understand if the memory
> > >> > tradeoff
> > >> > > is important enough
> > >> > > to choose one of the 2 approaches.
> > >> > > My point is that paging could be made totally based on the OS page
> > >> cache
> > >> > if
> > >> > > GC would get in the middle, deleting any previous mechanism of
> page
> > >> > > caching...simplifying the process at it is.
> > >> > > Using a 2 level cache with such centralized approach can work, but
> > >> will
> > >> > add
> > >> > > a level of complexity that IMO could be saved...
> > >> > > What do you think could be the benefit of the decentralized
> solution
> > >> if
> > >> > > compared with the one proposed in the PR?
> > >> > >
> > >> > >
> > >> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> > >> > > scritto:
> > >> > >
> > >> > > > Sorry, I missed the PageReferece part.
> > >> > > >
> > >> > > > The lifecyle of PageReference is: depage(in
> > >> > > intermediateMessageReferences)
> > >> > > > -> deliver(in messageReferences) -> waiting for ack(in
> > >> deliveringRefs)
> > >> > ->
> > >> > > > removed. Every queue would create it's own PageReference which
> means
> > >> > the
> > >> > > > offset info is 100 times large compared to the shared page index
> > >> cache.
> > >> > > > If we keep 51MB pageReference size in memory, as i said in pr,
> "For
> > >> > > > multiple subscribers to the same address, just one executor is
> > >> > > responsible
> > >> > > > for delivering which means at the same moment only one queue is
> > >> > > delivering.
> > >> > > > Thus the queue maybe stalled for a long time. We get
> queueMemorySize
> > >> > > > messages into memory, and when we deliver these after a long
> time,
> > >> we
> > >> > > > probably need to query message and read page file again.".  At
> last
> > >> for
> > >> > > one
> > >> > > > message we maybe read twice: first we read page and create
> > >> > pagereference;
> > >> > > > second we requery message after its reference is removed.
> > >> > > >
> > >> > > > For the shared page index cache design, each message just need
> to be
> > >> > read
> > >> > > > from file once.
> > >> > > >
> > >> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> > >> > > >
> > >> > > > > Hi
> > >> > > > >
> > >> > > > > First of all i think this is an excellent effort, and could
> be a
> > >> > > > potential
> > >> > > > > massive positive change.
> > >> > > > >
> > >> > > > > Before making any change on such scale, i do think we need to
> > >> ensure
> > >> > we
> > >> > > > > have sufficient benchmarks on a number of scenarios, not just
> one
> > >> use
> > >> > > > case,
> > >> > > > > and the benchmark tool used does need to be available openly
> so
> > >> that
> > >> > > > others
> > >> > > > > can verify the measures and check on their setups.
> > >> > > > >
> > >> > > > > Some additional scenarios i would want/need covering are:
> > >> > > > >
> > >> > > > > PageCache set to 5, and all consumers keeping up, but lagging
> > >> enough
> > >> > to
> > >> > > > be
> > >> > > > > reading from the same 1st page cache, latency and throughput
> need
> > >> to
> > >> > be
> > >> > > > > measured for all.
> > >> > > > > PageCache set to 5 and all consumers but one keeping up but
> > >> lagging
> > >> > > > enough
> > >> > > > > to be reading from the same 1st page cahce, but the one is
> falling
> > >> > off
> > >> > > > the
> > >> > > > > end, causing the page cache swapping, measure latecy and
> > >> througput of
> > >> > > > those
> > >> > > > > keeping up in the 1st page cache not caring for the one.
> > >> > > > >
> > >> > > > > Regards to solution some alternative approach to discuss
> > >> > > > >
> > >> > > > > In your scenario if i understand correctly each subscriber is
> > >> > > effectivly
> > >> > > > > having their own queue (1 to 1 mapping) not sharing.
> > >> > > > > You mention kafka and say multiple consumers doent read
> serailly
> > >> on
> > >> > the
> > >> > > > > address and this is true, but per queue processing through
> > >> messages
> > >> > > > > (dispatch) is still serial even with multiple shared
> consumers on
> > >> a
> > >> > > > queue.
> > >> > > > >
> > >> > > > > What about keeping the existing mechanism but having a queue
> hold
> > >> > > > reference
> > >> > > > > to a page cache that the queue is currently on, being kept
> from gc
> > >> > > (e.g.
> > >> > > > > not soft) therefore meaning page cache isnt being swapped
> around,
> > >> > when
> > >> > > > you
> > >> > > > > have queues (in your case subscribers) swapping pagecaches
> back
> > >> and
> > >> > > forth
> > >> > > > > avoidning the constant re-read issue.
> > >> > > > >
> > >> > > > > Also i think Franz had an excellent idea, do away with
> pagecache
> > >> in
> > >> > its
> > >> > > > > current form entirely, ensure the offset is kept with the
> > >> reference
> > >> > and
> > >> > > > > rely on OS caching keeping hot blocks/data.
> > >> > > > >
> > >> > > > > Best
> > >> > > > > Michael
> > >> > > > >
> > >> > > > >
> > >> > > > >
> > >> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> > >> > > > >
> > >> > > > > > Hi, folks
> > >> > > > > >
> > >> > > > > > This is the discussion about "ARTEMIS-2399 Fix performance
> > >> > > degradation
> > >> > > > > > when there are a lot of subscribers".
> > >> > > > > >
> > >> > > > > > First apologize i didn't clarify our thoughts.
> > >> > > > > >
> > >> > > > > > As noted in the part of Environment, page-max-cache-size is
> set
> > >> to
> > >> > 1
> > >> > > > > > meaning at most one page is allowed in softValueCache. We
> have
> > >> > tested
> > >> > > > > with
> > >> > > > > > the default page-max-cache-size which is 5, it would take
> some
> > >> time
> > >> > > to
> > >> > > > > > see the performance degradation since at start the cursor
> > >> positions
> > >> > > of
> > >> > > > > 100
> > >> > > > > > subscribers are similar when all the messages read hits the
> > >> > > > > softValueCache.
> > >> > > > > > But after some time, the cursor positions are different.
> When
> > >> these
> > >> > > > > > positions are located more than 5 pages, it means some page
> > >> would
> > >> > be
> > >> > > > read
> > >> > > > > > back and forth. This can be proved by the trace log "adding
> > >> > pageCache
> > >> > > > > > pageNr=xxx into cursor = test-topic" in
> PageCursorProviderImpl
> > >> > where
> > >> > > > some
> > >> > > > > > pages are read a lot of times for the same subscriber. From
> the
> > >> > time
> > >> > > > on,
> > >> > > > > > the performance starts to degrade. So we set
> page-max-cache-size
> > >> > to 1
> > >> > > > > > here just to make the test process more fast and it doesn't
> > >> change
> > >> > > the
> > >> > > > > > final result.
> > >> > > > > >
> > >> > > > > > The softValueCache would be removed if memory is really
> low, in
> > >> > > > addition
> > >> > > > > > the map size reaches capacity(default 5). In most cases, the
> > >> > > > subscribers
> > >> > > > > > are tailing read which are served by softValueCache(no need
> to
> > >> > bother
> > >> > > > > > disk), thus we need to keep it. But When some subscribers
> fall
> > >> > > behind,
> > >> > > > > they
> > >> > > > > > need to read page not in softValueCache. After looking up
> code,
> > >> we
> > >> > > > found
> > >> > > > > one
> > >> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS
> deliver
> > >> > round
> > >> > > > in
> > >> > > > > > most situations, and that's to say at most
> > >> MAX_DELIVERIES_IN_LOOP *
> > >> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged
> next.
> > >> If
> > >> > > you
> > >> > > > > > adjust QueueImpl logger to debug level, you would see logs
> like
> > >> > > "Queue
> > >> > > > > > Memory Size after depage on queue=sub4 is 53478769 with
> maxSize
> > >> =
> > >> > > > > 52428800.
> > >> > > > > > Depaged 68 messages, pendingDelivery=1002,
> > >> > > > intermediateMessageReferences=
> > >> > > > > > 23162, queueDelivering=0". In order to depage less than 2000
> > >> > > messages,
> > >> > > > > > each subscriber has to read a whole page which is
> unnecessary
> > >> and
> > >> > > > > wasteful.
> > >> > > > > > In our test where one page(50MB) contains ~40000 messages,
> one
> > >> > > > subscriber
> > >> > > > > > maybe read 40000/2000=20 times of page if softValueCache is
> > >> evicted
> > >> > > to
> > >> > > > > > finish delivering it. This has drastically slowed down the
> > >> process
> > >> > > and
> > >> > > > > > burdened on the disk. So we add the PageIndexCacheImpl and
> read
> > >> one
> > >> > > > > message
> > >> > > > > > each time rather than read all messages of page. In this
> way,
> > >> for
> > >> > > each
> > >> > > > > > subscriber each page is read only once after finishing
> > >> delivering.
> > >> > > > > >
> > >> > > > > > Having said that, the softValueCache is used for tailing
> read.
> > >> If
> > >> > > it's
> > >> > > > > > evicted, it won't be reloaded to prevent from the issue
> > >> illustrated
> > >> > > > > above.
> > >> > > > > > Instead the pageIndexCache would be used.
> > >> > > > > >
> > >> > > > > > Regarding implementation details, we noted that before
> > >> delivering
> > >> > > > page, a
> > >> > > > > > pageCursorInfo is constructed which needs to read the whole
> > >> page.
> > >> > We
> > >> > > > can
> > >> > > > > > take this opportunity to construct the pageIndexCache. It's
> very
> > >> > > simple
> > >> > > > > to
> > >> > > > > > code. We also think of building a offset index file and some
> > >> > concerns
> > >> > > > > > stemed from following:
> > >> > > > > >
> > >> > > > > >    1. When to write and sync index file? Would it have some
> > >> > > performance
> > >> > > > > >    implications?
> > >> > > > > >    2. If we have a index file, we can construct
> pageCursorInfo
> > >> > > through
> > >> > > > > >    it(no need to read the page like before), but we need to
> > >> write
> > >> > the
> > >> > > > > total
> > >> > > > > >    message number into it first. Seems a little weird
> putting
> > >> this
> > >> > > into
> > >> > > > > the
> > >> > > > > >    index file.
> > >> > > > > >    3. If experiencing hard crash, a recover mechanism would
> be
> > >> > needed
> > >> > > > to
> > >> > > > > >    recover page and page index files, E.g. truncating to the
> > >> valid
> > >> > > > size.
> > >> > > > > So
> > >> > > > > >    how do we know which files need to be sanity checked?
> > >> > > > > >    4. A variant binary search algorithm maybe needed, see
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > >> > > > > >     .
> > >> > > > > >    5. Unlike kafka from which user fetches lots of messages
> at
> > >> once
> > >> > > and
> > >> > > > > >    broker just needs to look up start offset from the index
> file
> > >> > > once,
> > >> > > > > artemis
> > >> > > > > >    delivers message one by one and that means we have to
> look up
> > >> > the
> > >> > > > > index
> > >> > > > > >    every time we deliver a message. Although the index file
> is
> > >> > > possibly
> > >> > > > > in
> > >> > > > > >    page cache, there are still chances we miss cache.
> > >> > > > > >    6. Compatibility with old files.
> > >> > > > > >
> > >> > > > > > To sum that, kafka uses a mmaped index file and we use a
> index
> > >> > cache.
> > >> > > > > Both
> > >> > > > > > are designed to find physical file position according
> > >> offset(kafka)
> > >> > > or
> > >> > > > > > message number(artemis). And we prefer the index cache bcs
> it's
> > >> > easy
> > >> > > to
> > >> > > > > > understand and maintain.
> > >> > > > > >
> > >> > > > > > We also tested the one subscriber case with the same setup.
> > >> > > > > > The original:
> > >> > > > > > consumer tps(11000msg/s) and latency:
> > >> > > > > > [image: orig_single_subscriber.png]
> > >> > > > > > producer tps(30000msg/s) and latency:
> > >> > > > > > [image: orig_single_producer.png]
> > >> > > > > > The pr:
> > >> > > > > > consumer tps(14000msg/s) and latency:
> > >> > > > > > [image: pr_single_consumer.png]
> > >> > > > > > producer tps(30000msg/s) and latency:
> > >> > > > > > [image: pr_single_producer.png]
> > >> > > > > > It showed result is similar and event a little better in the
> > >> case
> > >> > of
> > >> > > > > > single subscriber.
> > >> > > > > >
> > >> > > > > > We used our inner test platform and i think jmeter can also
> be
> > >> used
> > >> > > to
> > >> > > > > > test again it.
> > >> > > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >>
> > >>
> > >>
> > >>
> > >>
> > >>
> >
> >
> >
> >
> >
>
>
> --
> Clebert Suconic
>
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

clebertsuconic
But the real problem here will be the number of openFiles. Each Page
will have an Open File, what will keep a lot of open files on the
system. Correct?

I believe the impact of having the files moving to the Subscription
wouldn't be that much, and we would fix the problem. WE wouldn't need
a cache at all, as we just keep the File we need at the current
cursor.

On Tue, Jul 16, 2019 at 10:40 PM yw yw <[hidden email]> wrote:

>
> I did consider the case where all pages are instantiated as PageReaders.
> That's really a problem.
>
> The pros of pr is every page is read only once to build PageReader and
> shared by all the queues. The cons of pr is many PageReaders are probably
> instantiated if consumers make slow/no progress in several queues whereas
> fast in other queues(I think it's the only cause leading to the corner
> case, right?). This means too many open files and too much memory.
>
> The pros of duplicated PageReader is there are fixed number of PageReaders
> as with queues at the same time.
> The cons is each queue has to read the page once to build their own
> PageReader if page cache is evicted. I'm not sure how this will affect
> performance.
>
> The point is we need the number of messages in the page which is used by
> PageCursorInfo and PageSubscription::internalGetNext, so we have to read
> the page file. How about we only cache the number of messages in each page
> instead of PageReader and build PageReader in each queue. While we
> encounter the corner case, only <long, int> pair data is permanently in
> memory that I assume is smaller than completed PageCursorInfo data. This
> way we achieve the performance gain at a small price.
>
> Clebert Suconic <[hidden email]> 于2019年7月16日周二 下午10:18写道:
>
> > I just came back after a 2 weeks deserved break and I was looking at
> > this.. and I can say. it's well done.. nice job! it's a lot simpler!
> >
> > However there's one question now. which is probably a further
> > improvement. Shouldn't the pageReader be instantiated at the
> > PageSubscription.
> >
> > That means.. if there's no page cache, in case of the page been
> > evicted, the Subscription would then create a new Page/PageReader
> > pair. and dispose it when it's done (meaning, moved to a different
> > page).
> >
> > As you are solving the case with many subscriptions, wouldn't you hit
> > a corner case where all Pages are instantiated as PageReaders?
> >
> >
> > I feel like it would be better to eventually duplicate a PageReader
> > and close it when done.
> >
> >
> > Or did you already consider that possibility and still think it's best
> > to keep this cache of PageReaders?
> >
> > On Sat, Jul 13, 2019 at 12:15 AM <[hidden email]>
> > wrote:
> > >
> > > Could a squashed PR be sent?
> > >
> > >
> > >
> > >
> > > Get Outlook for Android
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw" <[hidden email]>
> > wrote:
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Hi,
> > >
> > > I have finished work on the new implementation(not yet tests and
> > > configuration) as suggested by franz.
> > >
> > > I put fileOffsetset in the PagePosition and add a new class PageReader
> > > which is a wrapper of the page that implements PageCache interface. The
> > > PageReader class is used to read page file if cache is evicted. For
> > detail,
> > > see
> > >
> > https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987
> > >
> > > I deployed some tests and results below:
> > > 1. Running in 51MB size page and 1 page cache in the case of 100
> > multicast
> > > queues.
> > > https://filebin.net/wnyan7d2n1qgfsvg
> > > 2. Running in 5MB size page and 100 page cache in the case of 100
> > multicast
> > > queues.
> > > https://filebin.net/re0989vz7ib1c5mc
> > > 3. Running in 51MB size page and 1 page cache in the case of 1 queue.
> > > https://filebin.net/3qndct7f11qckrus
> > >
> > > The results seem good, similar with the implementation in the pr. The
> > most
> > > important is the index cache data is removed, no worry about extra
> > overhead
> > > :)
> > >
> > > yw yw  于2019年7月4日周四 下午5:38写道:
> > >
> > > > Hi,  michael
> > > >
> > > > Thanks for the advise. For the current pr, we can use two arrays where
> > one
> > > > records the message number and the other one corresponding offset to
> > > > optimize the memory usage. For the franz's approch, we will also work
> > on
> > > > a early prototyping implementation. After that, we would take some
> > basic
> > > > tests in different scenarios.
> > > >
> > > >  于2019年7月2日周二 上午7:08写道:
> > > >
> > > >> Point though is an extra index cache layer is needed. The overhead of
> > > >> that means the total paged capacity will be more limited as that
> > overhead
> > > >> isnt just an extra int per reference. E.g. in the pr the current impl
> > isnt
> > > >> very memory optimised, could an int array be used or at worst an open
> > > >> primitive int int hashmap.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> This is why i really prefer franz's approach.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Also what ever we do, we need the new behaviour configurable, so
> > should a
> > > >> use case not thought about they won't be impacted. E.g. the change
> > should
> > > >> not be a surprise, it should be something you toggle on.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Get Outlook for Android
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw"  wrote:
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Hi,
> > > >> We've took a test against your configuration:
> > > >> 5Mb10010Mb.
> > > >> The current code: 7000msg/s sent and 18000msg/s received.
> > > >> Pr code:16000msg/s received and 8200msg/s sent.
> > > >> Like you said, the performance boosts by using much smaller page file
> > and
> > > >> holding many more for current code.
> > > >>
> > > >> Not sure what implications would have using smaller page file, the
> > > >> producer
> > > >> performance may reduce since switching files is more frequent, number
> > of
> > > >> file handle would increase?
> > > >>
> > > >> While our consumer in the test just echos, nothing to do after
> > receiving
> > > >> message, the consumer in the real world may be busy doing business.
> > This
> > > >> means references and page caches reside in memory longer and may be
> > > >> evicted
> > > >> more easily when producers are sending all the time.
> > > >>
> > > >> Since We don't know how many subscribers there are, it is not a
> > scalable
> > > >> approch. We can't reduce page file size unlimited to fit the number of
> > > >> subscribers. The code should accommodate to all kinds of
> > configurations.
> > > >> We
> > > >> adjust configuration for trade off as needed, not work around IMO.
> > > >> In our company, ~200 queues(60% are owned by some addresses) are
> > deployed
> > > >> in the broker. We can't set all to e.g. 100 page caches(too much
> > memory),
> > > >> and neither set different size according to address pattern(hard for
> > > >> operation). In the multi tenants cluster, we prefer availability and
> > to
> > > >> avoid memory exhausted, we set pageSize to 30MB, max cache size to 1
> > and
> > > >> max size to 31MB. It's running well in one of our clusters now:)
> > > >>
> > > >>  于2019年6月29日周六 上午2:35写道:
> > > >>
> > > >> > I think some of that is down to configuration. If you think you
> > could
> > > >> > configure paging to have much smaller page files but have many more
> > > >> held.
> > > >> > That way the reference sizes will be far smaller and pages dropping
> > in
> > > >> and
> > > >> > out would be less. E.g. if you expect 100 being read make it 100 but
> > > >> make
> > > >> > the page sizes smaller so the overhead is far less
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > Get Outlook for Android
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > "At last for one message we maybe read twice: first we read page and
> > > >> create
> > > >> > pagereference; second we requery message after its reference is
> > > >> removed.  "
> > > >> >
> > > >> > I just realized it was wrong. One message maybe read many times.
> > Think
> > > >> of
> > > >> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000
> > msg,
> > > >> > reading the whole page; When #2001~#4000 msg is deliverd, need to
> > depage
> > > >> > #4001~#6000 msg, reading page again, etc.
> > > >> >
> > > >> > One message maybe read three times if we don't depage until all
> > messages
> > > >> > are delivered. For example, we have 3 pages p1, p2,p3 and message m1
> > > >> which
> > > >> > is at top part of the p2. In our case(max-size-bytes=51MB, a little
> > > >> bigger
> > > >> > than page size), first depage round reads bottom half of p1 and top
> > > >> part of
> > > >> > p2; second depage round reads bottom half of p2 and top part of p3.
> > > >> > Therforce p2 is read twice and m1 maybe read three times if
> > requeryed.
> > > >> >
> > > >> > Be honest, i don't know how to fix the problem above with the
> > > >> > decrentralized approch. The point is not how we rely on os cache,
> > it's
> > > >> that
> > > >> > we do it the wrong way, shouldn't read whole page(50MB) just for
> > ~2000
> > > >> > messages. Also there is no need to save 51MB PagedReferenceImpl in
> > > >> memory.
> > > >> > When 100 queues occupy 5100MB memory, the message references are
> > very
> > > >> > likely to be removed.
> > > >> >
> > > >> >
> > > >> > Francesco Nigro  于2019年6月27日周四 下午5:05写道:
> > > >> >
> > > >> > > >
> > > >> > > >  which means the offset info is 100 times large compared to the
> > > >> shared
> > > >> > > > page index cache.
> > > >> > >
> > > >> > >
> > > >> > > I would check with JOL plugin for exact numbers..
> > > >> > > I see with it that we would have an increase of 4 bytes for each
> > > >> > > PagedRefeferenceImpl, totally decrentralized vs
> > > >> > > a centralized approach (the cache). In the economy of a fully
> > loaded
> > > >> > > broker, if we care about scaling need to understand if the memory
> > > >> > tradeoff
> > > >> > > is important enough
> > > >> > > to choose one of the 2 approaches.
> > > >> > > My point is that paging could be made totally based on the OS page
> > > >> cache
> > > >> > if
> > > >> > > GC would get in the middle, deleting any previous mechanism of
> > page
> > > >> > > caching...simplifying the process at it is.
> > > >> > > Using a 2 level cache with such centralized approach can work, but
> > > >> will
> > > >> > add
> > > >> > > a level of complexity that IMO could be saved...
> > > >> > > What do you think could be the benefit of the decentralized
> > solution
> > > >> if
> > > >> > > compared with the one proposed in the PR?
> > > >> > >
> > > >> > >
> > > >> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> > > >> > > scritto:
> > > >> > >
> > > >> > > > Sorry, I missed the PageReferece part.
> > > >> > > >
> > > >> > > > The lifecyle of PageReference is: depage(in
> > > >> > > intermediateMessageReferences)
> > > >> > > > -> deliver(in messageReferences) -> waiting for ack(in
> > > >> deliveringRefs)
> > > >> > ->
> > > >> > > > removed. Every queue would create it's own PageReference which
> > means
> > > >> > the
> > > >> > > > offset info is 100 times large compared to the shared page index
> > > >> cache.
> > > >> > > > If we keep 51MB pageReference size in memory, as i said in pr,
> > "For
> > > >> > > > multiple subscribers to the same address, just one executor is
> > > >> > > responsible
> > > >> > > > for delivering which means at the same moment only one queue is
> > > >> > > delivering.
> > > >> > > > Thus the queue maybe stalled for a long time. We get
> > queueMemorySize
> > > >> > > > messages into memory, and when we deliver these after a long
> > time,
> > > >> we
> > > >> > > > probably need to query message and read page file again.".  At
> > last
> > > >> for
> > > >> > > one
> > > >> > > > message we maybe read twice: first we read page and create
> > > >> > pagereference;
> > > >> > > > second we requery message after its reference is removed.
> > > >> > > >
> > > >> > > > For the shared page index cache design, each message just need
> > to be
> > > >> > read
> > > >> > > > from file once.
> > > >> > > >
> > > >> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> > > >> > > >
> > > >> > > > > Hi
> > > >> > > > >
> > > >> > > > > First of all i think this is an excellent effort, and could
> > be a
> > > >> > > > potential
> > > >> > > > > massive positive change.
> > > >> > > > >
> > > >> > > > > Before making any change on such scale, i do think we need to
> > > >> ensure
> > > >> > we
> > > >> > > > > have sufficient benchmarks on a number of scenarios, not just
> > one
> > > >> use
> > > >> > > > case,
> > > >> > > > > and the benchmark tool used does need to be available openly
> > so
> > > >> that
> > > >> > > > others
> > > >> > > > > can verify the measures and check on their setups.
> > > >> > > > >
> > > >> > > > > Some additional scenarios i would want/need covering are:
> > > >> > > > >
> > > >> > > > > PageCache set to 5, and all consumers keeping up, but lagging
> > > >> enough
> > > >> > to
> > > >> > > > be
> > > >> > > > > reading from the same 1st page cache, latency and throughput
> > need
> > > >> to
> > > >> > be
> > > >> > > > > measured for all.
> > > >> > > > > PageCache set to 5 and all consumers but one keeping up but
> > > >> lagging
> > > >> > > > enough
> > > >> > > > > to be reading from the same 1st page cahce, but the one is
> > falling
> > > >> > off
> > > >> > > > the
> > > >> > > > > end, causing the page cache swapping, measure latecy and
> > > >> througput of
> > > >> > > > those
> > > >> > > > > keeping up in the 1st page cache not caring for the one.
> > > >> > > > >
> > > >> > > > > Regards to solution some alternative approach to discuss
> > > >> > > > >
> > > >> > > > > In your scenario if i understand correctly each subscriber is
> > > >> > > effectivly
> > > >> > > > > having their own queue (1 to 1 mapping) not sharing.
> > > >> > > > > You mention kafka and say multiple consumers doent read
> > serailly
> > > >> on
> > > >> > the
> > > >> > > > > address and this is true, but per queue processing through
> > > >> messages
> > > >> > > > > (dispatch) is still serial even with multiple shared
> > consumers on
> > > >> a
> > > >> > > > queue.
> > > >> > > > >
> > > >> > > > > What about keeping the existing mechanism but having a queue
> > hold
> > > >> > > > reference
> > > >> > > > > to a page cache that the queue is currently on, being kept
> > from gc
> > > >> > > (e.g.
> > > >> > > > > not soft) therefore meaning page cache isnt being swapped
> > around,
> > > >> > when
> > > >> > > > you
> > > >> > > > > have queues (in your case subscribers) swapping pagecaches
> > back
> > > >> and
> > > >> > > forth
> > > >> > > > > avoidning the constant re-read issue.
> > > >> > > > >
> > > >> > > > > Also i think Franz had an excellent idea, do away with
> > pagecache
> > > >> in
> > > >> > its
> > > >> > > > > current form entirely, ensure the offset is kept with the
> > > >> reference
> > > >> > and
> > > >> > > > > rely on OS caching keeping hot blocks/data.
> > > >> > > > >
> > > >> > > > > Best
> > > >> > > > > Michael
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> > > >> > > > >
> > > >> > > > > > Hi, folks
> > > >> > > > > >
> > > >> > > > > > This is the discussion about "ARTEMIS-2399 Fix performance
> > > >> > > degradation
> > > >> > > > > > when there are a lot of subscribers".
> > > >> > > > > >
> > > >> > > > > > First apologize i didn't clarify our thoughts.
> > > >> > > > > >
> > > >> > > > > > As noted in the part of Environment, page-max-cache-size is
> > set
> > > >> to
> > > >> > 1
> > > >> > > > > > meaning at most one page is allowed in softValueCache. We
> > have
> > > >> > tested
> > > >> > > > > with
> > > >> > > > > > the default page-max-cache-size which is 5, it would take
> > some
> > > >> time
> > > >> > > to
> > > >> > > > > > see the performance degradation since at start the cursor
> > > >> positions
> > > >> > > of
> > > >> > > > > 100
> > > >> > > > > > subscribers are similar when all the messages read hits the
> > > >> > > > > softValueCache.
> > > >> > > > > > But after some time, the cursor positions are different.
> > When
> > > >> these
> > > >> > > > > > positions are located more than 5 pages, it means some page
> > > >> would
> > > >> > be
> > > >> > > > read
> > > >> > > > > > back and forth. This can be proved by the trace log "adding
> > > >> > pageCache
> > > >> > > > > > pageNr=xxx into cursor = test-topic" in
> > PageCursorProviderImpl
> > > >> > where
> > > >> > > > some
> > > >> > > > > > pages are read a lot of times for the same subscriber. From
> > the
> > > >> > time
> > > >> > > > on,
> > > >> > > > > > the performance starts to degrade. So we set
> > page-max-cache-size
> > > >> > to 1
> > > >> > > > > > here just to make the test process more fast and it doesn't
> > > >> change
> > > >> > > the
> > > >> > > > > > final result.
> > > >> > > > > >
> > > >> > > > > > The softValueCache would be removed if memory is really
> > low, in
> > > >> > > > addition
> > > >> > > > > > the map size reaches capacity(default 5). In most cases, the
> > > >> > > > subscribers
> > > >> > > > > > are tailing read which are served by softValueCache(no need
> > to
> > > >> > bother
> > > >> > > > > > disk), thus we need to keep it. But When some subscribers
> > fall
> > > >> > > behind,
> > > >> > > > > they
> > > >> > > > > > need to read page not in softValueCache. After looking up
> > code,
> > > >> we
> > > >> > > > found
> > > >> > > > > one
> > > >> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS
> > deliver
> > > >> > round
> > > >> > > > in
> > > >> > > > > > most situations, and that's to say at most
> > > >> MAX_DELIVERIES_IN_LOOP *
> > > >> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged
> > next.
> > > >> If
> > > >> > > you
> > > >> > > > > > adjust QueueImpl logger to debug level, you would see logs
> > like
> > > >> > > "Queue
> > > >> > > > > > Memory Size after depage on queue=sub4 is 53478769 with
> > maxSize
> > > >> =
> > > >> > > > > 52428800.
> > > >> > > > > > Depaged 68 messages, pendingDelivery=1002,
> > > >> > > > intermediateMessageReferences=
> > > >> > > > > > 23162, queueDelivering=0". In order to depage less than 2000
> > > >> > > messages,
> > > >> > > > > > each subscriber has to read a whole page which is
> > unnecessary
> > > >> and
> > > >> > > > > wasteful.
> > > >> > > > > > In our test where one page(50MB) contains ~40000 messages,
> > one
> > > >> > > > subscriber
> > > >> > > > > > maybe read 40000/2000=20 times of page if softValueCache is
> > > >> evicted
> > > >> > > to
> > > >> > > > > > finish delivering it. This has drastically slowed down the
> > > >> process
> > > >> > > and
> > > >> > > > > > burdened on the disk. So we add the PageIndexCacheImpl and
> > read
> > > >> one
> > > >> > > > > message
> > > >> > > > > > each time rather than read all messages of page. In this
> > way,
> > > >> for
> > > >> > > each
> > > >> > > > > > subscriber each page is read only once after finishing
> > > >> delivering.
> > > >> > > > > >
> > > >> > > > > > Having said that, the softValueCache is used for tailing
> > read.
> > > >> If
> > > >> > > it's
> > > >> > > > > > evicted, it won't be reloaded to prevent from the issue
> > > >> illustrated
> > > >> > > > > above.
> > > >> > > > > > Instead the pageIndexCache would be used.
> > > >> > > > > >
> > > >> > > > > > Regarding implementation details, we noted that before
> > > >> delivering
> > > >> > > > page, a
> > > >> > > > > > pageCursorInfo is constructed which needs to read the whole
> > > >> page.
> > > >> > We
> > > >> > > > can
> > > >> > > > > > take this opportunity to construct the pageIndexCache. It's
> > very
> > > >> > > simple
> > > >> > > > > to
> > > >> > > > > > code. We also think of building a offset index file and some
> > > >> > concerns
> > > >> > > > > > stemed from following:
> > > >> > > > > >
> > > >> > > > > >    1. When to write and sync index file? Would it have some
> > > >> > > performance
> > > >> > > > > >    implications?
> > > >> > > > > >    2. If we have a index file, we can construct
> > pageCursorInfo
> > > >> > > through
> > > >> > > > > >    it(no need to read the page like before), but we need to
> > > >> write
> > > >> > the
> > > >> > > > > total
> > > >> > > > > >    message number into it first. Seems a little weird
> > putting
> > > >> this
> > > >> > > into
> > > >> > > > > the
> > > >> > > > > >    index file.
> > > >> > > > > >    3. If experiencing hard crash, a recover mechanism would
> > be
> > > >> > needed
> > > >> > > > to
> > > >> > > > > >    recover page and page index files, E.g. truncating to the
> > > >> valid
> > > >> > > > size.
> > > >> > > > > So
> > > >> > > > > >    how do we know which files need to be sanity checked?
> > > >> > > > > >    4. A variant binary search algorithm maybe needed, see
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > > >> > > > > >     .
> > > >> > > > > >    5. Unlike kafka from which user fetches lots of messages
> > at
> > > >> once
> > > >> > > and
> > > >> > > > > >    broker just needs to look up start offset from the index
> > file
> > > >> > > once,
> > > >> > > > > artemis
> > > >> > > > > >    delivers message one by one and that means we have to
> > look up
> > > >> > the
> > > >> > > > > index
> > > >> > > > > >    every time we deliver a message. Although the index file
> > is
> > > >> > > possibly
> > > >> > > > > in
> > > >> > > > > >    page cache, there are still chances we miss cache.
> > > >> > > > > >    6. Compatibility with old files.
> > > >> > > > > >
> > > >> > > > > > To sum that, kafka uses a mmaped index file and we use a
> > index
> > > >> > cache.
> > > >> > > > > Both
> > > >> > > > > > are designed to find physical file position according
> > > >> offset(kafka)
> > > >> > > or
> > > >> > > > > > message number(artemis). And we prefer the index cache bcs
> > it's
> > > >> > easy
> > > >> > > to
> > > >> > > > > > understand and maintain.
> > > >> > > > > >
> > > >> > > > > > We also tested the one subscriber case with the same setup.
> > > >> > > > > > The original:
> > > >> > > > > > consumer tps(11000msg/s) and latency:
> > > >> > > > > > [image: orig_single_subscriber.png]
> > > >> > > > > > producer tps(30000msg/s) and latency:
> > > >> > > > > > [image: orig_single_producer.png]
> > > >> > > > > > The pr:
> > > >> > > > > > consumer tps(14000msg/s) and latency:
> > > >> > > > > > [image: pr_single_consumer.png]
> > > >> > > > > > producer tps(30000msg/s) and latency:
> > > >> > > > > > [image: pr_single_producer.png]
> > > >> > > > > > It showed result is similar and event a little better in the
> > > >> case
> > > >> > of
> > > >> > > > > > single subscriber.
> > > >> > > > > >
> > > >> > > > > > We used our inner test platform and i think jmeter can also
> > be
> > > >> used
> > > >> > > to
> > > >> > > > > > test again it.
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Clebert Suconic
> >



--
Clebert Suconic
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

michael.andre.pearce
+1 for having one per queue. Def a better idea than having to hold a cache.




Get Outlook for Android







On Fri, Jul 19, 2019 at 4:37 AM +0100, "Clebert Suconic" <[hidden email]> wrote:










But the real problem here will be the number of openFiles. Each Page
will have an Open File, what will keep a lot of open files on the
system. Correct?

I believe the impact of having the files moving to the Subscription
wouldn't be that much, and we would fix the problem. WE wouldn't need
a cache at all, as we just keep the File we need at the current
cursor.

On Tue, Jul 16, 2019 at 10:40 PM yw yw  wrote:

>
> I did consider the case where all pages are instantiated as PageReaders.
> That's really a problem.
>
> The pros of pr is every page is read only once to build PageReader and
> shared by all the queues. The cons of pr is many PageReaders are probably
> instantiated if consumers make slow/no progress in several queues whereas
> fast in other queues(I think it's the only cause leading to the corner
> case, right?). This means too many open files and too much memory.
>
> The pros of duplicated PageReader is there are fixed number of PageReaders
> as with queues at the same time.
> The cons is each queue has to read the page once to build their own
> PageReader if page cache is evicted. I'm not sure how this will affect
> performance.
>
> The point is we need the number of messages in the page which is used by
> PageCursorInfo and PageSubscription::internalGetNext, so we have to read
> the page file. How about we only cache the number of messages in each page
> instead of PageReader and build PageReader in each queue. While we
> encounter the corner case, only  pair data is permanently in
> memory that I assume is smaller than completed PageCursorInfo data. This
> way we achieve the performance gain at a small price.
>
> Clebert Suconic  于2019年7月16日周二 下午10:18写道:
>
> > I just came back after a 2 weeks deserved break and I was looking at
> > this.. and I can say. it's well done.. nice job! it's a lot simpler!
> >
> > However there's one question now. which is probably a further
> > improvement. Shouldn't the pageReader be instantiated at the
> > PageSubscription.
> >
> > That means.. if there's no page cache, in case of the page been
> > evicted, the Subscription would then create a new Page/PageReader
> > pair. and dispose it when it's done (meaning, moved to a different
> > page).
> >
> > As you are solving the case with many subscriptions, wouldn't you hit
> > a corner case where all Pages are instantiated as PageReaders?
> >
> >
> > I feel like it would be better to eventually duplicate a PageReader
> > and close it when done.
> >
> >
> > Or did you already consider that possibility and still think it's best
> > to keep this cache of PageReaders?
> >
> > On Sat, Jul 13, 2019 at 12:15 AM
> > wrote:
> > >
> > > Could a squashed PR be sent?
> > >
> > >
> > >
> > >
> > > Get Outlook for Android
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw"
> > wrote:
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > Hi,
> > >
> > > I have finished work on the new implementation(not yet tests and
> > > configuration) as suggested by franz.
> > >
> > > I put fileOffsetset in the PagePosition and add a new class PageReader
> > > which is a wrapper of the page that implements PageCache interface. The
> > > PageReader class is used to read page file if cache is evicted. For
> > detail,
> > > see
> > >
> > https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987
> > >
> > > I deployed some tests and results below:
> > > 1. Running in 51MB size page and 1 page cache in the case of 100
> > multicast
> > > queues.
> > > https://filebin.net/wnyan7d2n1qgfsvg
> > > 2. Running in 5MB size page and 100 page cache in the case of 100
> > multicast
> > > queues.
> > > https://filebin.net/re0989vz7ib1c5mc
> > > 3. Running in 51MB size page and 1 page cache in the case of 1 queue.
> > > https://filebin.net/3qndct7f11qckrus
> > >
> > > The results seem good, similar with the implementation in the pr. The
> > most
> > > important is the index cache data is removed, no worry about extra
> > overhead
> > > :)
> > >
> > > yw yw  于2019年7月4日周四 下午5:38写道:
> > >
> > > > Hi,  michael
> > > >
> > > > Thanks for the advise. For the current pr, we can use two arrays where
> > one
> > > > records the message number and the other one corresponding offset to
> > > > optimize the memory usage. For the franz's approch, we will also work
> > on
> > > > a early prototyping implementation. After that, we would take some
> > basic
> > > > tests in different scenarios.
> > > >
> > > >  于2019年7月2日周二 上午7:08写道:
> > > >
> > > >> Point though is an extra index cache layer is needed. The overhead of
> > > >> that means the total paged capacity will be more limited as that
> > overhead
> > > >> isnt just an extra int per reference. E.g. in the pr the current impl
> > isnt
> > > >> very memory optimised, could an int array be used or at worst an open
> > > >> primitive int int hashmap.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> This is why i really prefer franz's approach.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Also what ever we do, we need the new behaviour configurable, so
> > should a
> > > >> use case not thought about they won't be impacted. E.g. the change
> > should
> > > >> not be a surprise, it should be something you toggle on.
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Get Outlook for Android
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw"  wrote:
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >> Hi,
> > > >> We've took a test against your configuration:
> > > >> 5Mb10010Mb.
> > > >> The current code: 7000msg/s sent and 18000msg/s received.
> > > >> Pr code:16000msg/s received and 8200msg/s sent.
> > > >> Like you said, the performance boosts by using much smaller page file
> > and
> > > >> holding many more for current code.
> > > >>
> > > >> Not sure what implications would have using smaller page file, the
> > > >> producer
> > > >> performance may reduce since switching files is more frequent, number
> > of
> > > >> file handle would increase?
> > > >>
> > > >> While our consumer in the test just echos, nothing to do after
> > receiving
> > > >> message, the consumer in the real world may be busy doing business.
> > This
> > > >> means references and page caches reside in memory longer and may be
> > > >> evicted
> > > >> more easily when producers are sending all the time.
> > > >>
> > > >> Since We don't know how many subscribers there are, it is not a
> > scalable
> > > >> approch. We can't reduce page file size unlimited to fit the number of
> > > >> subscribers. The code should accommodate to all kinds of
> > configurations.
> > > >> We
> > > >> adjust configuration for trade off as needed, not work around IMO.
> > > >> In our company, ~200 queues(60% are owned by some addresses) are
> > deployed
> > > >> in the broker. We can't set all to e.g. 100 page caches(too much
> > memory),
> > > >> and neither set different size according to address pattern(hard for
> > > >> operation). In the multi tenants cluster, we prefer availability and
> > to
> > > >> avoid memory exhausted, we set pageSize to 30MB, max cache size to 1
> > and
> > > >> max size to 31MB. It's running well in one of our clusters now:)
> > > >>
> > > >>  于2019年6月29日周六 上午2:35写道:
> > > >>
> > > >> > I think some of that is down to configuration. If you think you
> > could
> > > >> > configure paging to have much smaller page files but have many more
> > > >> held.
> > > >> > That way the reference sizes will be far smaller and pages dropping
> > in
> > > >> and
> > > >> > out would be less. E.g. if you expect 100 being read make it 100 but
> > > >> make
> > > >> > the page sizes smaller so the overhead is far less
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > Get Outlook for Android
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > "At last for one message we maybe read twice: first we read page and
> > > >> create
> > > >> > pagereference; second we requery message after its reference is
> > > >> removed.  "
> > > >> >
> > > >> > I just realized it was wrong. One message maybe read many times.
> > Think
> > > >> of
> > > >> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000
> > msg,
> > > >> > reading the whole page; When #2001~#4000 msg is deliverd, need to
> > depage
> > > >> > #4001~#6000 msg, reading page again, etc.
> > > >> >
> > > >> > One message maybe read three times if we don't depage until all
> > messages
> > > >> > are delivered. For example, we have 3 pages p1, p2,p3 and message m1
> > > >> which
> > > >> > is at top part of the p2. In our case(max-size-bytes=51MB, a little
> > > >> bigger
> > > >> > than page size), first depage round reads bottom half of p1 and top
> > > >> part of
> > > >> > p2; second depage round reads bottom half of p2 and top part of p3.
> > > >> > Therforce p2 is read twice and m1 maybe read three times if
> > requeryed.
> > > >> >
> > > >> > Be honest, i don't know how to fix the problem above with the
> > > >> > decrentralized approch. The point is not how we rely on os cache,
> > it's
> > > >> that
> > > >> > we do it the wrong way, shouldn't read whole page(50MB) just for
> > ~2000
> > > >> > messages. Also there is no need to save 51MB PagedReferenceImpl in
> > > >> memory.
> > > >> > When 100 queues occupy 5100MB memory, the message references are
> > very
> > > >> > likely to be removed.
> > > >> >
> > > >> >
> > > >> > Francesco Nigro  于2019年6月27日周四 下午5:05写道:
> > > >> >
> > > >> > > >
> > > >> > > >  which means the offset info is 100 times large compared to the
> > > >> shared
> > > >> > > > page index cache.
> > > >> > >
> > > >> > >
> > > >> > > I would check with JOL plugin for exact numbers..
> > > >> > > I see with it that we would have an increase of 4 bytes for each
> > > >> > > PagedRefeferenceImpl, totally decrentralized vs
> > > >> > > a centralized approach (the cache). In the economy of a fully
> > loaded
> > > >> > > broker, if we care about scaling need to understand if the memory
> > > >> > tradeoff
> > > >> > > is important enough
> > > >> > > to choose one of the 2 approaches.
> > > >> > > My point is that paging could be made totally based on the OS page
> > > >> cache
> > > >> > if
> > > >> > > GC would get in the middle, deleting any previous mechanism of
> > page
> > > >> > > caching...simplifying the process at it is.
> > > >> > > Using a 2 level cache with such centralized approach can work, but
> > > >> will
> > > >> > add
> > > >> > > a level of complexity that IMO could be saved...
> > > >> > > What do you think could be the benefit of the decentralized
> > solution
> > > >> if
> > > >> > > compared with the one proposed in the PR?
> > > >> > >
> > > >> > >
> > > >> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> > > >> > > scritto:
> > > >> > >
> > > >> > > > Sorry, I missed the PageReferece part.
> > > >> > > >
> > > >> > > > The lifecyle of PageReference is: depage(in
> > > >> > > intermediateMessageReferences)
> > > >> > > > -> deliver(in messageReferences) -> waiting for ack(in
> > > >> deliveringRefs)
> > > >> > ->
> > > >> > > > removed. Every queue would create it's own PageReference which
> > means
> > > >> > the
> > > >> > > > offset info is 100 times large compared to the shared page index
> > > >> cache.
> > > >> > > > If we keep 51MB pageReference size in memory, as i said in pr,
> > "For
> > > >> > > > multiple subscribers to the same address, just one executor is
> > > >> > > responsible
> > > >> > > > for delivering which means at the same moment only one queue is
> > > >> > > delivering.
> > > >> > > > Thus the queue maybe stalled for a long time. We get
> > queueMemorySize
> > > >> > > > messages into memory, and when we deliver these after a long
> > time,
> > > >> we
> > > >> > > > probably need to query message and read page file again.".  At
> > last
> > > >> for
> > > >> > > one
> > > >> > > > message we maybe read twice: first we read page and create
> > > >> > pagereference;
> > > >> > > > second we requery message after its reference is removed.
> > > >> > > >
> > > >> > > > For the shared page index cache design, each message just need
> > to be
> > > >> > read
> > > >> > > > from file once.
> > > >> > > >
> > > >> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> > > >> > > >
> > > >> > > > > Hi
> > > >> > > > >
> > > >> > > > > First of all i think this is an excellent effort, and could
> > be a
> > > >> > > > potential
> > > >> > > > > massive positive change.
> > > >> > > > >
> > > >> > > > > Before making any change on such scale, i do think we need to
> > > >> ensure
> > > >> > we
> > > >> > > > > have sufficient benchmarks on a number of scenarios, not just
> > one
> > > >> use
> > > >> > > > case,
> > > >> > > > > and the benchmark tool used does need to be available openly
> > so
> > > >> that
> > > >> > > > others
> > > >> > > > > can verify the measures and check on their setups.
> > > >> > > > >
> > > >> > > > > Some additional scenarios i would want/need covering are:
> > > >> > > > >
> > > >> > > > > PageCache set to 5, and all consumers keeping up, but lagging
> > > >> enough
> > > >> > to
> > > >> > > > be
> > > >> > > > > reading from the same 1st page cache, latency and throughput
> > need
> > > >> to
> > > >> > be
> > > >> > > > > measured for all.
> > > >> > > > > PageCache set to 5 and all consumers but one keeping up but
> > > >> lagging
> > > >> > > > enough
> > > >> > > > > to be reading from the same 1st page cahce, but the one is
> > falling
> > > >> > off
> > > >> > > > the
> > > >> > > > > end, causing the page cache swapping, measure latecy and
> > > >> througput of
> > > >> > > > those
> > > >> > > > > keeping up in the 1st page cache not caring for the one.
> > > >> > > > >
> > > >> > > > > Regards to solution some alternative approach to discuss
> > > >> > > > >
> > > >> > > > > In your scenario if i understand correctly each subscriber is
> > > >> > > effectivly
> > > >> > > > > having their own queue (1 to 1 mapping) not sharing.
> > > >> > > > > You mention kafka and say multiple consumers doent read
> > serailly
> > > >> on
> > > >> > the
> > > >> > > > > address and this is true, but per queue processing through
> > > >> messages
> > > >> > > > > (dispatch) is still serial even with multiple shared
> > consumers on
> > > >> a
> > > >> > > > queue.
> > > >> > > > >
> > > >> > > > > What about keeping the existing mechanism but having a queue
> > hold
> > > >> > > > reference
> > > >> > > > > to a page cache that the queue is currently on, being kept
> > from gc
> > > >> > > (e.g.
> > > >> > > > > not soft) therefore meaning page cache isnt being swapped
> > around,
> > > >> > when
> > > >> > > > you
> > > >> > > > > have queues (in your case subscribers) swapping pagecaches
> > back
> > > >> and
> > > >> > > forth
> > > >> > > > > avoidning the constant re-read issue.
> > > >> > > > >
> > > >> > > > > Also i think Franz had an excellent idea, do away with
> > pagecache
> > > >> in
> > > >> > its
> > > >> > > > > current form entirely, ensure the offset is kept with the
> > > >> reference
> > > >> > and
> > > >> > > > > rely on OS caching keeping hot blocks/data.
> > > >> > > > >
> > > >> > > > > Best
> > > >> > > > > Michael
> > > >> > > > >
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> > > >> > > > >
> > > >> > > > > > Hi, folks
> > > >> > > > > >
> > > >> > > > > > This is the discussion about "ARTEMIS-2399 Fix performance
> > > >> > > degradation
> > > >> > > > > > when there are a lot of subscribers".
> > > >> > > > > >
> > > >> > > > > > First apologize i didn't clarify our thoughts.
> > > >> > > > > >
> > > >> > > > > > As noted in the part of Environment, page-max-cache-size is
> > set
> > > >> to
> > > >> > 1
> > > >> > > > > > meaning at most one page is allowed in softValueCache. We
> > have
> > > >> > tested
> > > >> > > > > with
> > > >> > > > > > the default page-max-cache-size which is 5, it would take
> > some
> > > >> time
> > > >> > > to
> > > >> > > > > > see the performance degradation since at start the cursor
> > > >> positions
> > > >> > > of
> > > >> > > > > 100
> > > >> > > > > > subscribers are similar when all the messages read hits the
> > > >> > > > > softValueCache.
> > > >> > > > > > But after some time, the cursor positions are different.
> > When
> > > >> these
> > > >> > > > > > positions are located more than 5 pages, it means some page
> > > >> would
> > > >> > be
> > > >> > > > read
> > > >> > > > > > back and forth. This can be proved by the trace log "adding
> > > >> > pageCache
> > > >> > > > > > pageNr=xxx into cursor = test-topic" in
> > PageCursorProviderImpl
> > > >> > where
> > > >> > > > some
> > > >> > > > > > pages are read a lot of times for the same subscriber. From
> > the
> > > >> > time
> > > >> > > > on,
> > > >> > > > > > the performance starts to degrade. So we set
> > page-max-cache-size
> > > >> > to 1
> > > >> > > > > > here just to make the test process more fast and it doesn't
> > > >> change
> > > >> > > the
> > > >> > > > > > final result.
> > > >> > > > > >
> > > >> > > > > > The softValueCache would be removed if memory is really
> > low, in
> > > >> > > > addition
> > > >> > > > > > the map size reaches capacity(default 5). In most cases, the
> > > >> > > > subscribers
> > > >> > > > > > are tailing read which are served by softValueCache(no need
> > to
> > > >> > bother
> > > >> > > > > > disk), thus we need to keep it. But When some subscribers
> > fall
> > > >> > > behind,
> > > >> > > > > they
> > > >> > > > > > need to read page not in softValueCache. After looking up
> > code,
> > > >> we
> > > >> > > > found
> > > >> > > > > one
> > > >> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS
> > deliver
> > > >> > round
> > > >> > > > in
> > > >> > > > > > most situations, and that's to say at most
> > > >> MAX_DELIVERIES_IN_LOOP *
> > > >> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be depaged
> > next.
> > > >> If
> > > >> > > you
> > > >> > > > > > adjust QueueImpl logger to debug level, you would see logs
> > like
> > > >> > > "Queue
> > > >> > > > > > Memory Size after depage on queue=sub4 is 53478769 with
> > maxSize
> > > >> =
> > > >> > > > > 52428800.
> > > >> > > > > > Depaged 68 messages, pendingDelivery=1002,
> > > >> > > > intermediateMessageReferences=
> > > >> > > > > > 23162, queueDelivering=0". In order to depage less than 2000
> > > >> > > messages,
> > > >> > > > > > each subscriber has to read a whole page which is
> > unnecessary
> > > >> and
> > > >> > > > > wasteful.
> > > >> > > > > > In our test where one page(50MB) contains ~40000 messages,
> > one
> > > >> > > > subscriber
> > > >> > > > > > maybe read 40000/2000=20 times of page if softValueCache is
> > > >> evicted
> > > >> > > to
> > > >> > > > > > finish delivering it. This has drastically slowed down the
> > > >> process
> > > >> > > and
> > > >> > > > > > burdened on the disk. So we add the PageIndexCacheImpl and
> > read
> > > >> one
> > > >> > > > > message
> > > >> > > > > > each time rather than read all messages of page. In this
> > way,
> > > >> for
> > > >> > > each
> > > >> > > > > > subscriber each page is read only once after finishing
> > > >> delivering.
> > > >> > > > > >
> > > >> > > > > > Having said that, the softValueCache is used for tailing
> > read.
> > > >> If
> > > >> > > it's
> > > >> > > > > > evicted, it won't be reloaded to prevent from the issue
> > > >> illustrated
> > > >> > > > > above.
> > > >> > > > > > Instead the pageIndexCache would be used.
> > > >> > > > > >
> > > >> > > > > > Regarding implementation details, we noted that before
> > > >> delivering
> > > >> > > > page, a
> > > >> > > > > > pageCursorInfo is constructed which needs to read the whole
> > > >> page.
> > > >> > We
> > > >> > > > can
> > > >> > > > > > take this opportunity to construct the pageIndexCache. It's
> > very
> > > >> > > simple
> > > >> > > > > to
> > > >> > > > > > code. We also think of building a offset index file and some
> > > >> > concerns
> > > >> > > > > > stemed from following:
> > > >> > > > > >
> > > >> > > > > >    1. When to write and sync index file? Would it have some
> > > >> > > performance
> > > >> > > > > >    implications?
> > > >> > > > > >    2. If we have a index file, we can construct
> > pageCursorInfo
> > > >> > > through
> > > >> > > > > >    it(no need to read the page like before), but we need to
> > > >> write
> > > >> > the
> > > >> > > > > total
> > > >> > > > > >    message number into it first. Seems a little weird
> > putting
> > > >> this
> > > >> > > into
> > > >> > > > > the
> > > >> > > > > >    index file.
> > > >> > > > > >    3. If experiencing hard crash, a recover mechanism would
> > be
> > > >> > needed
> > > >> > > > to
> > > >> > > > > >    recover page and page index files, E.g. truncating to the
> > > >> valid
> > > >> > > > size.
> > > >> > > > > So
> > > >> > > > > >    how do we know which files need to be sanity checked?
> > > >> > > > > >    4. A variant binary search algorithm maybe needed, see
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > > >> > > > > >     .
> > > >> > > > > >    5. Unlike kafka from which user fetches lots of messages
> > at
> > > >> once
> > > >> > > and
> > > >> > > > > >    broker just needs to look up start offset from the index
> > file
> > > >> > > once,
> > > >> > > > > artemis
> > > >> > > > > >    delivers message one by one and that means we have to
> > look up
> > > >> > the
> > > >> > > > > index
> > > >> > > > > >    every time we deliver a message. Although the index file
> > is
> > > >> > > possibly
> > > >> > > > > in
> > > >> > > > > >    page cache, there are still chances we miss cache.
> > > >> > > > > >    6. Compatibility with old files.
> > > >> > > > > >
> > > >> > > > > > To sum that, kafka uses a mmaped index file and we use a
> > index
> > > >> > cache.
> > > >> > > > > Both
> > > >> > > > > > are designed to find physical file position according
> > > >> offset(kafka)
> > > >> > > or
> > > >> > > > > > message number(artemis). And we prefer the index cache bcs
> > it's
> > > >> > easy
> > > >> > > to
> > > >> > > > > > understand and maintain.
> > > >> > > > > >
> > > >> > > > > > We also tested the one subscriber case with the same setup.
> > > >> > > > > > The original:
> > > >> > > > > > consumer tps(11000msg/s) and latency:
> > > >> > > > > > [image: orig_single_subscriber.png]
> > > >> > > > > > producer tps(30000msg/s) and latency:
> > > >> > > > > > [image: orig_single_producer.png]
> > > >> > > > > > The pr:
> > > >> > > > > > consumer tps(14000msg/s) and latency:
> > > >> > > > > > [image: pr_single_consumer.png]
> > > >> > > > > > producer tps(30000msg/s) and latency:
> > > >> > > > > > [image: pr_single_producer.png]
> > > >> > > > > > It showed result is similar and event a little better in the
> > > >> case
> > > >> > of
> > > >> > > > > > single subscriber.
> > > >> > > > > >
> > > >> > > > > > We used our inner test platform and i think jmeter can also
> > be
> > > >> used
> > > >> > > to
> > > >> > > > > > test again it.
> > > >> > > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > > >>
> > >
> > >
> > >
> > >
> > >
> >
> >
> > --
> > Clebert Suconic
> >



--
Clebert Suconic





Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

wei yang
>  But the real problem here will be the number of openFiles. Each Page
will have an Open File, what will keep a lot of open files on the
system. Correct?
Not sure I made it clear enough.
My thought is: Since PageCursorInfo decides whether entire page is consumed
based on numberOfMessages and PageSubscriptionImpl decides whether to move
to next page based on the current cursor page position and
numberOfMessages, we store a map of <pageId, numberOfMessages> in
PageCursorProviderImpl after page is evicted. According to
numberOfMessageseach PageSubscriptionImpl can build PageCursorInfo and if
current cursor page position is in the range of current page, PageReader
can be built to help read messages. So there are really no opened page
files in PageCursorProviderImpl.
Without this map, each PageSubscriptionImpl has to first read the page file
to get numberOfMessages, then build PageCursorInfo/PageReader.

I agree to put the PageReader to PageSubscriptionImpl, just not sure
specific implementation details :)

<[hidden email]> 于2019年7月19日周五 下午2:10写道:

> +1 for having one per queue. Def a better idea than having to hold a
> cache.
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Fri, Jul 19, 2019 at 4:37 AM +0100, "Clebert Suconic" <
> [hidden email]> wrote:
>
>
>
>
>
>
>
>
>
>
> But the real problem here will be the number of openFiles. Each Page
> will have an Open File, what will keep a lot of open files on the
> system. Correct?
>
> I believe the impact of having the files moving to the Subscription
> wouldn't be that much, and we would fix the problem. WE wouldn't need
> a cache at all, as we just keep the File we need at the current
> cursor.
>
> On Tue, Jul 16, 2019 at 10:40 PM yw yw  wrote:
> >
> > I did consider the case where all pages are instantiated as PageReaders.
> > That's really a problem.
> >
> > The pros of pr is every page is read only once to build PageReader and
> > shared by all the queues. The cons of pr is many PageReaders are probably
> > instantiated if consumers make slow/no progress in several queues whereas
> > fast in other queues(I think it's the only cause leading to the corner
> > case, right?). This means too many open files and too much memory.
> >
> > The pros of duplicated PageReader is there are fixed number of
> PageReaders
> > as with queues at the same time.
> > The cons is each queue has to read the page once to build their own
> > PageReader if page cache is evicted. I'm not sure how this will affect
> > performance.
> >
> > The point is we need the number of messages in the page which is used by
> > PageCursorInfo and PageSubscription::internalGetNext, so we have to read
> > the page file. How about we only cache the number of messages in each
> page
> > instead of PageReader and build PageReader in each queue. While we
> > encounter the corner case, only  pair data is permanently in
> > memory that I assume is smaller than completed PageCursorInfo data. This
> > way we achieve the performance gain at a small price.
> >
> > Clebert Suconic  于2019年7月16日周二 下午10:18写道:
> >
> > > I just came back after a 2 weeks deserved break and I was looking at
> > > this.. and I can say. it's well done.. nice job! it's a lot simpler!
> > >
> > > However there's one question now. which is probably a further
> > > improvement. Shouldn't the pageReader be instantiated at the
> > > PageSubscription.
> > >
> > > That means.. if there's no page cache, in case of the page been
> > > evicted, the Subscription would then create a new Page/PageReader
> > > pair. and dispose it when it's done (meaning, moved to a different
> > > page).
> > >
> > > As you are solving the case with many subscriptions, wouldn't you hit
> > > a corner case where all Pages are instantiated as PageReaders?
> > >
> > >
> > > I feel like it would be better to eventually duplicate a PageReader
> > > and close it when done.
> > >
> > >
> > > Or did you already consider that possibility and still think it's best
> > > to keep this cache of PageReaders?
> > >
> > > On Sat, Jul 13, 2019 at 12:15 AM
> > > wrote:
> > > >
> > > > Could a squashed PR be sent?
> > > >
> > > >
> > > >
> > > >
> > > > Get Outlook for Android
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw"
> > > wrote:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I have finished work on the new implementation(not yet tests and
> > > > configuration) as suggested by franz.
> > > >
> > > > I put fileOffsetset in the PagePosition and add a new class
> PageReader
> > > > which is a wrapper of the page that implements PageCache interface.
> The
> > > > PageReader class is used to read page file if cache is evicted. For
> > > detail,
> > > > see
> > > >
> > >
> https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987
> > > >
> > > > I deployed some tests and results below:
> > > > 1. Running in 51MB size page and 1 page cache in the case of 100
> > > multicast
> > > > queues.
> > > > https://filebin.net/wnyan7d2n1qgfsvg
> > > > 2. Running in 5MB size page and 100 page cache in the case of 100
> > > multicast
> > > > queues.
> > > > https://filebin.net/re0989vz7ib1c5mc
> > > > 3. Running in 51MB size page and 1 page cache in the case of 1 queue.
> > > > https://filebin.net/3qndct7f11qckrus
> > > >
> > > > The results seem good, similar with the implementation in the pr. The
> > > most
> > > > important is the index cache data is removed, no worry about extra
> > > overhead
> > > > :)
> > > >
> > > > yw yw  于2019年7月4日周四 下午5:38写道:
> > > >
> > > > > Hi,  michael
> > > > >
> > > > > Thanks for the advise. For the current pr, we can use two arrays
> where
> > > one
> > > > > records the message number and the other one corresponding offset
> to
> > > > > optimize the memory usage. For the franz's approch, we will also
> work
> > > on
> > > > > a early prototyping implementation. After that, we would take some
> > > basic
> > > > > tests in different scenarios.
> > > > >
> > > > >  于2019年7月2日周二 上午7:08写道:
> > > > >
> > > > >> Point though is an extra index cache layer is needed. The
> overhead of
> > > > >> that means the total paged capacity will be more limited as that
> > > overhead
> > > > >> isnt just an extra int per reference. E.g. in the pr the current
> impl
> > > isnt
> > > > >> very memory optimised, could an int array be used or at worst an
> open
> > > > >> primitive int int hashmap.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> This is why i really prefer franz's approach.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Also what ever we do, we need the new behaviour configurable, so
> > > should a
> > > > >> use case not thought about they won't be impacted. E.g. the change
> > > should
> > > > >> not be a surprise, it should be something you toggle on.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Get Outlook for Android
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw"  wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Hi,
> > > > >> We've took a test against your configuration:
> > > > >> 5Mb10010Mb.
> > > > >> The current code: 7000msg/s sent and 18000msg/s received.
> > > > >> Pr code:16000msg/s received and 8200msg/s sent.
> > > > >> Like you said, the performance boosts by using much smaller page
> file
> > > and
> > > > >> holding many more for current code.
> > > > >>
> > > > >> Not sure what implications would have using smaller page file, the
> > > > >> producer
> > > > >> performance may reduce since switching files is more frequent,
> number
> > > of
> > > > >> file handle would increase?
> > > > >>
> > > > >> While our consumer in the test just echos, nothing to do after
> > > receiving
> > > > >> message, the consumer in the real world may be busy doing
> business.
> > > This
> > > > >> means references and page caches reside in memory longer and may
> be
> > > > >> evicted
> > > > >> more easily when producers are sending all the time.
> > > > >>
> > > > >> Since We don't know how many subscribers there are, it is not a
> > > scalable
> > > > >> approch. We can't reduce page file size unlimited to fit the
> number of
> > > > >> subscribers. The code should accommodate to all kinds of
> > > configurations.
> > > > >> We
> > > > >> adjust configuration for trade off as needed, not work around IMO.
> > > > >> In our company, ~200 queues(60% are owned by some addresses) are
> > > deployed
> > > > >> in the broker. We can't set all to e.g. 100 page caches(too much
> > > memory),
> > > > >> and neither set different size according to address pattern(hard
> for
> > > > >> operation). In the multi tenants cluster, we prefer availability
> and
> > > to
> > > > >> avoid memory exhausted, we set pageSize to 30MB, max cache size
> to 1
> > > and
> > > > >> max size to 31MB. It's running well in one of our clusters now:)
> > > > >>
> > > > >>  于2019年6月29日周六 上午2:35写道:
> > > > >>
> > > > >> > I think some of that is down to configuration. If you think you
> > > could
> > > > >> > configure paging to have much smaller page files but have many
> more
> > > > >> held.
> > > > >> > That way the reference sizes will be far smaller and pages
> dropping
> > > in
> > > > >> and
> > > > >> > out would be less. E.g. if you expect 100 being read make it
> 100 but
> > > > >> make
> > > > >> > the page sizes smaller so the overhead is far less
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > Get Outlook for Android
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > "At last for one message we maybe read twice: first we read
> page and
> > > > >> create
> > > > >> > pagereference; second we requery message after its reference is
> > > > >> removed.  "
> > > > >> >
> > > > >> > I just realized it was wrong. One message maybe read many times.
> > > Think
> > > > >> of
> > > > >> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000
> > > msg,
> > > > >> > reading the whole page; When #2001~#4000 msg is deliverd, need
> to
> > > depage
> > > > >> > #4001~#6000 msg, reading page again, etc.
> > > > >> >
> > > > >> > One message maybe read three times if we don't depage until all
> > > messages
> > > > >> > are delivered. For example, we have 3 pages p1, p2,p3 and
> message m1
> > > > >> which
> > > > >> > is at top part of the p2. In our case(max-size-bytes=51MB, a
> little
> > > > >> bigger
> > > > >> > than page size), first depage round reads bottom half of p1 and
> top
> > > > >> part of
> > > > >> > p2; second depage round reads bottom half of p2 and top part of
> p3.
> > > > >> > Therforce p2 is read twice and m1 maybe read three times if
> > > requeryed.
> > > > >> >
> > > > >> > Be honest, i don't know how to fix the problem above with the
> > > > >> > decrentralized approch. The point is not how we rely on os
> cache,
> > > it's
> > > > >> that
> > > > >> > we do it the wrong way, shouldn't read whole page(50MB) just for
> > > ~2000
> > > > >> > messages. Also there is no need to save 51MB PagedReferenceImpl
> in
> > > > >> memory.
> > > > >> > When 100 queues occupy 5100MB memory, the message references are
> > > very
> > > > >> > likely to be removed.
> > > > >> >
> > > > >> >
> > > > >> > Francesco Nigro  于2019年6月27日周四 下午5:05写道:
> > > > >> >
> > > > >> > > >
> > > > >> > > >  which means the offset info is 100 times large compared to
> the
> > > > >> shared
> > > > >> > > > page index cache.
> > > > >> > >
> > > > >> > >
> > > > >> > > I would check with JOL plugin for exact numbers..
> > > > >> > > I see with it that we would have an increase of 4 bytes for
> each
> > > > >> > > PagedRefeferenceImpl, totally decrentralized vs
> > > > >> > > a centralized approach (the cache). In the economy of a fully
> > > loaded
> > > > >> > > broker, if we care about scaling need to understand if the
> memory
> > > > >> > tradeoff
> > > > >> > > is important enough
> > > > >> > > to choose one of the 2 approaches.
> > > > >> > > My point is that paging could be made totally based on the OS
> page
> > > > >> cache
> > > > >> > if
> > > > >> > > GC would get in the middle, deleting any previous mechanism of
> > > page
> > > > >> > > caching...simplifying the process at it is.
> > > > >> > > Using a 2 level cache with such centralized approach can
> work, but
> > > > >> will
> > > > >> > add
> > > > >> > > a level of complexity that IMO could be saved...
> > > > >> > > What do you think could be the benefit of the decentralized
> > > solution
> > > > >> if
> > > > >> > > compared with the one proposed in the PR?
> > > > >> > >
> > > > >> > >
> > > > >> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> > > > >> > > scritto:
> > > > >> > >
> > > > >> > > > Sorry, I missed the PageReferece part.
> > > > >> > > >
> > > > >> > > > The lifecyle of PageReference is: depage(in
> > > > >> > > intermediateMessageReferences)
> > > > >> > > > -> deliver(in messageReferences) -> waiting for ack(in
> > > > >> deliveringRefs)
> > > > >> > ->
> > > > >> > > > removed. Every queue would create it's own PageReference
> which
> > > means
> > > > >> > the
> > > > >> > > > offset info is 100 times large compared to the shared page
> index
> > > > >> cache.
> > > > >> > > > If we keep 51MB pageReference size in memory, as i said in
> pr,
> > > "For
> > > > >> > > > multiple subscribers to the same address, just one executor
> is
> > > > >> > > responsible
> > > > >> > > > for delivering which means at the same moment only one
> queue is
> > > > >> > > delivering.
> > > > >> > > > Thus the queue maybe stalled for a long time. We get
> > > queueMemorySize
> > > > >> > > > messages into memory, and when we deliver these after a long
> > > time,
> > > > >> we
> > > > >> > > > probably need to query message and read page file again.".
> At
> > > last
> > > > >> for
> > > > >> > > one
> > > > >> > > > message we maybe read twice: first we read page and create
> > > > >> > pagereference;
> > > > >> > > > second we requery message after its reference is removed.
> > > > >> > > >
> > > > >> > > > For the shared page index cache design, each message just
> need
> > > to be
> > > > >> > read
> > > > >> > > > from file once.
> > > > >> > > >
> > > > >> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> > > > >> > > >
> > > > >> > > > > Hi
> > > > >> > > > >
> > > > >> > > > > First of all i think this is an excellent effort, and
> could
> > > be a
> > > > >> > > > potential
> > > > >> > > > > massive positive change.
> > > > >> > > > >
> > > > >> > > > > Before making any change on such scale, i do think we
> need to
> > > > >> ensure
> > > > >> > we
> > > > >> > > > > have sufficient benchmarks on a number of scenarios, not
> just
> > > one
> > > > >> use
> > > > >> > > > case,
> > > > >> > > > > and the benchmark tool used does need to be available
> openly
> > > so
> > > > >> that
> > > > >> > > > others
> > > > >> > > > > can verify the measures and check on their setups.
> > > > >> > > > >
> > > > >> > > > > Some additional scenarios i would want/need covering are:
> > > > >> > > > >
> > > > >> > > > > PageCache set to 5, and all consumers keeping up, but
> lagging
> > > > >> enough
> > > > >> > to
> > > > >> > > > be
> > > > >> > > > > reading from the same 1st page cache, latency and
> throughput
> > > need
> > > > >> to
> > > > >> > be
> > > > >> > > > > measured for all.
> > > > >> > > > > PageCache set to 5 and all consumers but one keeping up
> but
> > > > >> lagging
> > > > >> > > > enough
> > > > >> > > > > to be reading from the same 1st page cahce, but the one is
> > > falling
> > > > >> > off
> > > > >> > > > the
> > > > >> > > > > end, causing the page cache swapping, measure latecy and
> > > > >> througput of
> > > > >> > > > those
> > > > >> > > > > keeping up in the 1st page cache not caring for the one.
> > > > >> > > > >
> > > > >> > > > > Regards to solution some alternative approach to discuss
> > > > >> > > > >
> > > > >> > > > > In your scenario if i understand correctly each
> subscriber is
> > > > >> > > effectivly
> > > > >> > > > > having their own queue (1 to 1 mapping) not sharing.
> > > > >> > > > > You mention kafka and say multiple consumers doent read
> > > serailly
> > > > >> on
> > > > >> > the
> > > > >> > > > > address and this is true, but per queue processing through
> > > > >> messages
> > > > >> > > > > (dispatch) is still serial even with multiple shared
> > > consumers on
> > > > >> a
> > > > >> > > > queue.
> > > > >> > > > >
> > > > >> > > > > What about keeping the existing mechanism but having a
> queue
> > > hold
> > > > >> > > > reference
> > > > >> > > > > to a page cache that the queue is currently on, being kept
> > > from gc
> > > > >> > > (e.g.
> > > > >> > > > > not soft) therefore meaning page cache isnt being swapped
> > > around,
> > > > >> > when
> > > > >> > > > you
> > > > >> > > > > have queues (in your case subscribers) swapping pagecaches
> > > back
> > > > >> and
> > > > >> > > forth
> > > > >> > > > > avoidning the constant re-read issue.
> > > > >> > > > >
> > > > >> > > > > Also i think Franz had an excellent idea, do away with
> > > pagecache
> > > > >> in
> > > > >> > its
> > > > >> > > > > current form entirely, ensure the offset is kept with the
> > > > >> reference
> > > > >> > and
> > > > >> > > > > rely on OS caching keeping hot blocks/data.
> > > > >> > > > >
> > > > >> > > > > Best
> > > > >> > > > > Michael
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi, folks
> > > > >> > > > > >
> > > > >> > > > > > This is the discussion about "ARTEMIS-2399 Fix
> performance
> > > > >> > > degradation
> > > > >> > > > > > when there are a lot of subscribers".
> > > > >> > > > > >
> > > > >> > > > > > First apologize i didn't clarify our thoughts.
> > > > >> > > > > >
> > > > >> > > > > > As noted in the part of Environment,
> page-max-cache-size is
> > > set
> > > > >> to
> > > > >> > 1
> > > > >> > > > > > meaning at most one page is allowed in softValueCache.
> We
> > > have
> > > > >> > tested
> > > > >> > > > > with
> > > > >> > > > > > the default page-max-cache-size which is 5, it would
> take
> > > some
> > > > >> time
> > > > >> > > to
> > > > >> > > > > > see the performance degradation since at start the
> cursor
> > > > >> positions
> > > > >> > > of
> > > > >> > > > > 100
> > > > >> > > > > > subscribers are similar when all the messages read hits
> the
> > > > >> > > > > softValueCache.
> > > > >> > > > > > But after some time, the cursor positions are different.
> > > When
> > > > >> these
> > > > >> > > > > > positions are located more than 5 pages, it means some
> page
> > > > >> would
> > > > >> > be
> > > > >> > > > read
> > > > >> > > > > > back and forth. This can be proved by the trace log
> "adding
> > > > >> > pageCache
> > > > >> > > > > > pageNr=xxx into cursor = test-topic" in
> > > PageCursorProviderImpl
> > > > >> > where
> > > > >> > > > some
> > > > >> > > > > > pages are read a lot of times for the same subscriber.
> From
> > > the
> > > > >> > time
> > > > >> > > > on,
> > > > >> > > > > > the performance starts to degrade. So we set
> > > page-max-cache-size
> > > > >> > to 1
> > > > >> > > > > > here just to make the test process more fast and it
> doesn't
> > > > >> change
> > > > >> > > the
> > > > >> > > > > > final result.
> > > > >> > > > > >
> > > > >> > > > > > The softValueCache would be removed if memory is really
> > > low, in
> > > > >> > > > addition
> > > > >> > > > > > the map size reaches capacity(default 5). In most
> cases, the
> > > > >> > > > subscribers
> > > > >> > > > > > are tailing read which are served by softValueCache(no
> need
> > > to
> > > > >> > bother
> > > > >> > > > > > disk), thus we need to keep it. But When some
> subscribers
> > > fall
> > > > >> > > behind,
> > > > >> > > > > they
> > > > >> > > > > > need to read page not in softValueCache. After looking
> up
> > > code,
> > > > >> we
> > > > >> > > > found
> > > > >> > > > > one
> > > > >> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS
> > > deliver
> > > > >> > round
> > > > >> > > > in
> > > > >> > > > > > most situations, and that's to say at most
> > > > >> MAX_DELIVERIES_IN_LOOP *
> > > > >> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be
> depaged
> > > next.
> > > > >> If
> > > > >> > > you
> > > > >> > > > > > adjust QueueImpl logger to debug level, you would see
> logs
> > > like
> > > > >> > > "Queue
> > > > >> > > > > > Memory Size after depage on queue=sub4 is 53478769 with
> > > maxSize
> > > > >> =
> > > > >> > > > > 52428800.
> > > > >> > > > > > Depaged 68 messages, pendingDelivery=1002,
> > > > >> > > > intermediateMessageReferences=
> > > > >> > > > > > 23162, queueDelivering=0". In order to depage less than
> 2000
> > > > >> > > messages,
> > > > >> > > > > > each subscriber has to read a whole page which is
> > > unnecessary
> > > > >> and
> > > > >> > > > > wasteful.
> > > > >> > > > > > In our test where one page(50MB) contains ~40000
> messages,
> > > one
> > > > >> > > > subscriber
> > > > >> > > > > > maybe read 40000/2000=20 times of page if
> softValueCache is
> > > > >> evicted
> > > > >> > > to
> > > > >> > > > > > finish delivering it. This has drastically slowed down
> the
> > > > >> process
> > > > >> > > and
> > > > >> > > > > > burdened on the disk. So we add the PageIndexCacheImpl
> and
> > > read
> > > > >> one
> > > > >> > > > > message
> > > > >> > > > > > each time rather than read all messages of page. In this
> > > way,
> > > > >> for
> > > > >> > > each
> > > > >> > > > > > subscriber each page is read only once after finishing
> > > > >> delivering.
> > > > >> > > > > >
> > > > >> > > > > > Having said that, the softValueCache is used for tailing
> > > read.
> > > > >> If
> > > > >> > > it's
> > > > >> > > > > > evicted, it won't be reloaded to prevent from the issue
> > > > >> illustrated
> > > > >> > > > > above.
> > > > >> > > > > > Instead the pageIndexCache would be used.
> > > > >> > > > > >
> > > > >> > > > > > Regarding implementation details, we noted that before
> > > > >> delivering
> > > > >> > > > page, a
> > > > >> > > > > > pageCursorInfo is constructed which needs to read the
> whole
> > > > >> page.
> > > > >> > We
> > > > >> > > > can
> > > > >> > > > > > take this opportunity to construct the pageIndexCache.
> It's
> > > very
> > > > >> > > simple
> > > > >> > > > > to
> > > > >> > > > > > code. We also think of building a offset index file and
> some
> > > > >> > concerns
> > > > >> > > > > > stemed from following:
> > > > >> > > > > >
> > > > >> > > > > >    1. When to write and sync index file? Would it have
> some
> > > > >> > > performance
> > > > >> > > > > >    implications?
> > > > >> > > > > >    2. If we have a index file, we can construct
> > > pageCursorInfo
> > > > >> > > through
> > > > >> > > > > >    it(no need to read the page like before), but we
> need to
> > > > >> write
> > > > >> > the
> > > > >> > > > > total
> > > > >> > > > > >    message number into it first. Seems a little weird
> > > putting
> > > > >> this
> > > > >> > > into
> > > > >> > > > > the
> > > > >> > > > > >    index file.
> > > > >> > > > > >    3. If experiencing hard crash, a recover mechanism
> would
> > > be
> > > > >> > needed
> > > > >> > > > to
> > > > >> > > > > >    recover page and page index files, E.g. truncating
> to the
> > > > >> valid
> > > > >> > > > size.
> > > > >> > > > > So
> > > > >> > > > > >    how do we know which files need to be sanity checked?
> > > > >> > > > > >    4. A variant binary search algorithm maybe needed,
> see
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > >
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > > > >> > > > > >     .
> > > > >> > > > > >    5. Unlike kafka from which user fetches lots of
> messages
> > > at
> > > > >> once
> > > > >> > > and
> > > > >> > > > > >    broker just needs to look up start offset from the
> index
> > > file
> > > > >> > > once,
> > > > >> > > > > artemis
> > > > >> > > > > >    delivers message one by one and that means we have to
> > > look up
> > > > >> > the
> > > > >> > > > > index
> > > > >> > > > > >    every time we deliver a message. Although the index
> file
> > > is
> > > > >> > > possibly
> > > > >> > > > > in
> > > > >> > > > > >    page cache, there are still chances we miss cache.
> > > > >> > > > > >    6. Compatibility with old files.
> > > > >> > > > > >
> > > > >> > > > > > To sum that, kafka uses a mmaped index file and we use a
> > > index
> > > > >> > cache.
> > > > >> > > > > Both
> > > > >> > > > > > are designed to find physical file position according
> > > > >> offset(kafka)
> > > > >> > > or
> > > > >> > > > > > message number(artemis). And we prefer the index cache
> bcs
> > > it's
> > > > >> > easy
> > > > >> > > to
> > > > >> > > > > > understand and maintain.
> > > > >> > > > > >
> > > > >> > > > > > We also tested the one subscriber case with the same
> setup.
> > > > >> > > > > > The original:
> > > > >> > > > > > consumer tps(11000msg/s) and latency:
> > > > >> > > > > > [image: orig_single_subscriber.png]
> > > > >> > > > > > producer tps(30000msg/s) and latency:
> > > > >> > > > > > [image: orig_single_producer.png]
> > > > >> > > > > > The pr:
> > > > >> > > > > > consumer tps(14000msg/s) and latency:
> > > > >> > > > > > [image: pr_single_consumer.png]
> > > > >> > > > > > producer tps(30000msg/s) and latency:
> > > > >> > > > > > [image: pr_single_producer.png]
> > > > >> > > > > > It showed result is similar and event a little better
> in the
> > > > >> case
> > > > >> > of
> > > > >> > > > > > single subscriber.
> > > > >> > > > > >
> > > > >> > > > > > We used our inner test platform and i think jmeter can
> also
> > > be
> > > > >> used
> > > > >> > > to
> > > > >> > > > > > test again it.
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Clebert Suconic
> > >
>
>
>
> --
> Clebert Suconic
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

michael.andre.pearce
Surly you keep the file open, else you will incur perf penalty of having to open the file constantly.




Would be faster to have the reader hold the file open and have one per queue. Avoiding constant opening and closing of a file. And all the overhead of that at the os level




Get Outlook for Android







On Fri, Jul 19, 2019 at 8:30 AM +0100, "yw yw" <[hidden email]> wrote:










>  But the real problem here will be the number of openFiles. Each Page
will have an Open File, what will keep a lot of open files on the
system. Correct?
Not sure I made it clear enough.
My thought is: Since PageCursorInfo decides whether entire page is consumed
based on numberOfMessages and PageSubscriptionImpl decides whether to move
to next page based on the current cursor page position and
numberOfMessages, we store a map of  in
PageCursorProviderImpl after page is evicted. According to
numberOfMessageseach PageSubscriptionImpl can build PageCursorInfo and if
current cursor page position is in the range of current page, PageReader
can be built to help read messages. So there are really no opened page
files in PageCursorProviderImpl.
Without this map, each PageSubscriptionImpl has to first read the page file
to get numberOfMessages, then build PageCursorInfo/PageReader.

I agree to put the PageReader to PageSubscriptionImpl, just not sure
specific implementation details :)

 于2019年7月19日周五 下午2:10写道:

> +1 for having one per queue. Def a better idea than having to hold a
> cache.
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Fri, Jul 19, 2019 at 4:37 AM +0100, "Clebert Suconic" <
> [hidden email]> wrote:
>
>
>
>
>
>
>
>
>
>
> But the real problem here will be the number of openFiles. Each Page
> will have an Open File, what will keep a lot of open files on the
> system. Correct?
>
> I believe the impact of having the files moving to the Subscription
> wouldn't be that much, and we would fix the problem. WE wouldn't need
> a cache at all, as we just keep the File we need at the current
> cursor.
>
> On Tue, Jul 16, 2019 at 10:40 PM yw yw  wrote:
> >
> > I did consider the case where all pages are instantiated as PageReaders.
> > That's really a problem.
> >
> > The pros of pr is every page is read only once to build PageReader and
> > shared by all the queues. The cons of pr is many PageReaders are probably
> > instantiated if consumers make slow/no progress in several queues whereas
> > fast in other queues(I think it's the only cause leading to the corner
> > case, right?). This means too many open files and too much memory.
> >
> > The pros of duplicated PageReader is there are fixed number of
> PageReaders
> > as with queues at the same time.
> > The cons is each queue has to read the page once to build their own
> > PageReader if page cache is evicted. I'm not sure how this will affect
> > performance.
> >
> > The point is we need the number of messages in the page which is used by
> > PageCursorInfo and PageSubscription::internalGetNext, so we have to read
> > the page file. How about we only cache the number of messages in each
> page
> > instead of PageReader and build PageReader in each queue. While we
> > encounter the corner case, only  pair data is permanently in
> > memory that I assume is smaller than completed PageCursorInfo data. This
> > way we achieve the performance gain at a small price.
> >
> > Clebert Suconic  于2019年7月16日周二 下午10:18写道:
> >
> > > I just came back after a 2 weeks deserved break and I was looking at
> > > this.. and I can say. it's well done.. nice job! it's a lot simpler!
> > >
> > > However there's one question now. which is probably a further
> > > improvement. Shouldn't the pageReader be instantiated at the
> > > PageSubscription.
> > >
> > > That means.. if there's no page cache, in case of the page been
> > > evicted, the Subscription would then create a new Page/PageReader
> > > pair. and dispose it when it's done (meaning, moved to a different
> > > page).
> > >
> > > As you are solving the case with many subscriptions, wouldn't you hit
> > > a corner case where all Pages are instantiated as PageReaders?
> > >
> > >
> > > I feel like it would be better to eventually duplicate a PageReader
> > > and close it when done.
> > >
> > >
> > > Or did you already consider that possibility and still think it's best
> > > to keep this cache of PageReaders?
> > >
> > > On Sat, Jul 13, 2019 at 12:15 AM
> > > wrote:
> > > >
> > > > Could a squashed PR be sent?
> > > >
> > > >
> > > >
> > > >
> > > > Get Outlook for Android
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw"
> > > wrote:
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > > I have finished work on the new implementation(not yet tests and
> > > > configuration) as suggested by franz.
> > > >
> > > > I put fileOffsetset in the PagePosition and add a new class
> PageReader
> > > > which is a wrapper of the page that implements PageCache interface.
> The
> > > > PageReader class is used to read page file if cache is evicted. For
> > > detail,
> > > > see
> > > >
> > >
> https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987
> > > >
> > > > I deployed some tests and results below:
> > > > 1. Running in 51MB size page and 1 page cache in the case of 100
> > > multicast
> > > > queues.
> > > > https://filebin.net/wnyan7d2n1qgfsvg
> > > > 2. Running in 5MB size page and 100 page cache in the case of 100
> > > multicast
> > > > queues.
> > > > https://filebin.net/re0989vz7ib1c5mc
> > > > 3. Running in 51MB size page and 1 page cache in the case of 1 queue.
> > > > https://filebin.net/3qndct7f11qckrus
> > > >
> > > > The results seem good, similar with the implementation in the pr. The
> > > most
> > > > important is the index cache data is removed, no worry about extra
> > > overhead
> > > > :)
> > > >
> > > > yw yw  于2019年7月4日周四 下午5:38写道:
> > > >
> > > > > Hi,  michael
> > > > >
> > > > > Thanks for the advise. For the current pr, we can use two arrays
> where
> > > one
> > > > > records the message number and the other one corresponding offset
> to
> > > > > optimize the memory usage. For the franz's approch, we will also
> work
> > > on
> > > > > a early prototyping implementation. After that, we would take some
> > > basic
> > > > > tests in different scenarios.
> > > > >
> > > > >  于2019年7月2日周二 上午7:08写道:
> > > > >
> > > > >> Point though is an extra index cache layer is needed. The
> overhead of
> > > > >> that means the total paged capacity will be more limited as that
> > > overhead
> > > > >> isnt just an extra int per reference. E.g. in the pr the current
> impl
> > > isnt
> > > > >> very memory optimised, could an int array be used or at worst an
> open
> > > > >> primitive int int hashmap.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> This is why i really prefer franz's approach.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Also what ever we do, we need the new behaviour configurable, so
> > > should a
> > > > >> use case not thought about they won't be impacted. E.g. the change
> > > should
> > > > >> not be a surprise, it should be something you toggle on.
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Get Outlook for Android
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw"  wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >> Hi,
> > > > >> We've took a test against your configuration:
> > > > >> 5Mb10010Mb.
> > > > >> The current code: 7000msg/s sent and 18000msg/s received.
> > > > >> Pr code:16000msg/s received and 8200msg/s sent.
> > > > >> Like you said, the performance boosts by using much smaller page
> file
> > > and
> > > > >> holding many more for current code.
> > > > >>
> > > > >> Not sure what implications would have using smaller page file, the
> > > > >> producer
> > > > >> performance may reduce since switching files is more frequent,
> number
> > > of
> > > > >> file handle would increase?
> > > > >>
> > > > >> While our consumer in the test just echos, nothing to do after
> > > receiving
> > > > >> message, the consumer in the real world may be busy doing
> business.
> > > This
> > > > >> means references and page caches reside in memory longer and may
> be
> > > > >> evicted
> > > > >> more easily when producers are sending all the time.
> > > > >>
> > > > >> Since We don't know how many subscribers there are, it is not a
> > > scalable
> > > > >> approch. We can't reduce page file size unlimited to fit the
> number of
> > > > >> subscribers. The code should accommodate to all kinds of
> > > configurations.
> > > > >> We
> > > > >> adjust configuration for trade off as needed, not work around IMO.
> > > > >> In our company, ~200 queues(60% are owned by some addresses) are
> > > deployed
> > > > >> in the broker. We can't set all to e.g. 100 page caches(too much
> > > memory),
> > > > >> and neither set different size according to address pattern(hard
> for
> > > > >> operation). In the multi tenants cluster, we prefer availability
> and
> > > to
> > > > >> avoid memory exhausted, we set pageSize to 30MB, max cache size
> to 1
> > > and
> > > > >> max size to 31MB. It's running well in one of our clusters now:)
> > > > >>
> > > > >>  于2019年6月29日周六 上午2:35写道:
> > > > >>
> > > > >> > I think some of that is down to configuration. If you think you
> > > could
> > > > >> > configure paging to have much smaller page files but have many
> more
> > > > >> held.
> > > > >> > That way the reference sizes will be far smaller and pages
> dropping
> > > in
> > > > >> and
> > > > >> > out would be less. E.g. if you expect 100 being read make it
> 100 but
> > > > >> make
> > > > >> > the page sizes smaller so the overhead is far less
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > Get Outlook for Android
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > "At last for one message we maybe read twice: first we read
> page and
> > > > >> create
> > > > >> > pagereference; second we requery message after its reference is
> > > > >> removed.  "
> > > > >> >
> > > > >> > I just realized it was wrong. One message maybe read many times.
> > > Think
> > > > >> of
> > > > >> > this: When #1~#2000 msg is delivered, need to depage #2001-#4000
> > > msg,
> > > > >> > reading the whole page; When #2001~#4000 msg is deliverd, need
> to
> > > depage
> > > > >> > #4001~#6000 msg, reading page again, etc.
> > > > >> >
> > > > >> > One message maybe read three times if we don't depage until all
> > > messages
> > > > >> > are delivered. For example, we have 3 pages p1, p2,p3 and
> message m1
> > > > >> which
> > > > >> > is at top part of the p2. In our case(max-size-bytes=51MB, a
> little
> > > > >> bigger
> > > > >> > than page size), first depage round reads bottom half of p1 and
> top
> > > > >> part of
> > > > >> > p2; second depage round reads bottom half of p2 and top part of
> p3.
> > > > >> > Therforce p2 is read twice and m1 maybe read three times if
> > > requeryed.
> > > > >> >
> > > > >> > Be honest, i don't know how to fix the problem above with the
> > > > >> > decrentralized approch. The point is not how we rely on os
> cache,
> > > it's
> > > > >> that
> > > > >> > we do it the wrong way, shouldn't read whole page(50MB) just for
> > > ~2000
> > > > >> > messages. Also there is no need to save 51MB PagedReferenceImpl
> in
> > > > >> memory.
> > > > >> > When 100 queues occupy 5100MB memory, the message references are
> > > very
> > > > >> > likely to be removed.
> > > > >> >
> > > > >> >
> > > > >> > Francesco Nigro  于2019年6月27日周四 下午5:05写道:
> > > > >> >
> > > > >> > > >
> > > > >> > > >  which means the offset info is 100 times large compared to
> the
> > > > >> shared
> > > > >> > > > page index cache.
> > > > >> > >
> > > > >> > >
> > > > >> > > I would check with JOL plugin for exact numbers..
> > > > >> > > I see with it that we would have an increase of 4 bytes for
> each
> > > > >> > > PagedRefeferenceImpl, totally decrentralized vs
> > > > >> > > a centralized approach (the cache). In the economy of a fully
> > > loaded
> > > > >> > > broker, if we care about scaling need to understand if the
> memory
> > > > >> > tradeoff
> > > > >> > > is important enough
> > > > >> > > to choose one of the 2 approaches.
> > > > >> > > My point is that paging could be made totally based on the OS
> page
> > > > >> cache
> > > > >> > if
> > > > >> > > GC would get in the middle, deleting any previous mechanism of
> > > page
> > > > >> > > caching...simplifying the process at it is.
> > > > >> > > Using a 2 level cache with such centralized approach can
> work, but
> > > > >> will
> > > > >> > add
> > > > >> > > a level of complexity that IMO could be saved...
> > > > >> > > What do you think could be the benefit of the decentralized
> > > solution
> > > > >> if
> > > > >> > > compared with the one proposed in the PR?
> > > > >> > >
> > > > >> > >
> > > > >> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> > > > >> > > scritto:
> > > > >> > >
> > > > >> > > > Sorry, I missed the PageReferece part.
> > > > >> > > >
> > > > >> > > > The lifecyle of PageReference is: depage(in
> > > > >> > > intermediateMessageReferences)
> > > > >> > > > -> deliver(in messageReferences) -> waiting for ack(in
> > > > >> deliveringRefs)
> > > > >> > ->
> > > > >> > > > removed. Every queue would create it's own PageReference
> which
> > > means
> > > > >> > the
> > > > >> > > > offset info is 100 times large compared to the shared page
> index
> > > > >> cache.
> > > > >> > > > If we keep 51MB pageReference size in memory, as i said in
> pr,
> > > "For
> > > > >> > > > multiple subscribers to the same address, just one executor
> is
> > > > >> > > responsible
> > > > >> > > > for delivering which means at the same moment only one
> queue is
> > > > >> > > delivering.
> > > > >> > > > Thus the queue maybe stalled for a long time. We get
> > > queueMemorySize
> > > > >> > > > messages into memory, and when we deliver these after a long
> > > time,
> > > > >> we
> > > > >> > > > probably need to query message and read page file again.".
> At
> > > last
> > > > >> for
> > > > >> > > one
> > > > >> > > > message we maybe read twice: first we read page and create
> > > > >> > pagereference;
> > > > >> > > > second we requery message after its reference is removed.
> > > > >> > > >
> > > > >> > > > For the shared page index cache design, each message just
> need
> > > to be
> > > > >> > read
> > > > >> > > > from file once.
> > > > >> > > >
> > > > >> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> > > > >> > > >
> > > > >> > > > > Hi
> > > > >> > > > >
> > > > >> > > > > First of all i think this is an excellent effort, and
> could
> > > be a
> > > > >> > > > potential
> > > > >> > > > > massive positive change.
> > > > >> > > > >
> > > > >> > > > > Before making any change on such scale, i do think we
> need to
> > > > >> ensure
> > > > >> > we
> > > > >> > > > > have sufficient benchmarks on a number of scenarios, not
> just
> > > one
> > > > >> use
> > > > >> > > > case,
> > > > >> > > > > and the benchmark tool used does need to be available
> openly
> > > so
> > > > >> that
> > > > >> > > > others
> > > > >> > > > > can verify the measures and check on their setups.
> > > > >> > > > >
> > > > >> > > > > Some additional scenarios i would want/need covering are:
> > > > >> > > > >
> > > > >> > > > > PageCache set to 5, and all consumers keeping up, but
> lagging
> > > > >> enough
> > > > >> > to
> > > > >> > > > be
> > > > >> > > > > reading from the same 1st page cache, latency and
> throughput
> > > need
> > > > >> to
> > > > >> > be
> > > > >> > > > > measured for all.
> > > > >> > > > > PageCache set to 5 and all consumers but one keeping up
> but
> > > > >> lagging
> > > > >> > > > enough
> > > > >> > > > > to be reading from the same 1st page cahce, but the one is
> > > falling
> > > > >> > off
> > > > >> > > > the
> > > > >> > > > > end, causing the page cache swapping, measure latecy and
> > > > >> througput of
> > > > >> > > > those
> > > > >> > > > > keeping up in the 1st page cache not caring for the one.
> > > > >> > > > >
> > > > >> > > > > Regards to solution some alternative approach to discuss
> > > > >> > > > >
> > > > >> > > > > In your scenario if i understand correctly each
> subscriber is
> > > > >> > > effectivly
> > > > >> > > > > having their own queue (1 to 1 mapping) not sharing.
> > > > >> > > > > You mention kafka and say multiple consumers doent read
> > > serailly
> > > > >> on
> > > > >> > the
> > > > >> > > > > address and this is true, but per queue processing through
> > > > >> messages
> > > > >> > > > > (dispatch) is still serial even with multiple shared
> > > consumers on
> > > > >> a
> > > > >> > > > queue.
> > > > >> > > > >
> > > > >> > > > > What about keeping the existing mechanism but having a
> queue
> > > hold
> > > > >> > > > reference
> > > > >> > > > > to a page cache that the queue is currently on, being kept
> > > from gc
> > > > >> > > (e.g.
> > > > >> > > > > not soft) therefore meaning page cache isnt being swapped
> > > around,
> > > > >> > when
> > > > >> > > > you
> > > > >> > > > > have queues (in your case subscribers) swapping pagecaches
> > > back
> > > > >> and
> > > > >> > > forth
> > > > >> > > > > avoidning the constant re-read issue.
> > > > >> > > > >
> > > > >> > > > > Also i think Franz had an excellent idea, do away with
> > > pagecache
> > > > >> in
> > > > >> > its
> > > > >> > > > > current form entirely, ensure the offset is kept with the
> > > > >> reference
> > > > >> > and
> > > > >> > > > > rely on OS caching keeping hot blocks/data.
> > > > >> > > > >
> > > > >> > > > > Best
> > > > >> > > > > Michael
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> > > > >> > > > >
> > > > >> > > > > > Hi, folks
> > > > >> > > > > >
> > > > >> > > > > > This is the discussion about "ARTEMIS-2399 Fix
> performance
> > > > >> > > degradation
> > > > >> > > > > > when there are a lot of subscribers".
> > > > >> > > > > >
> > > > >> > > > > > First apologize i didn't clarify our thoughts.
> > > > >> > > > > >
> > > > >> > > > > > As noted in the part of Environment,
> page-max-cache-size is
> > > set
> > > > >> to
> > > > >> > 1
> > > > >> > > > > > meaning at most one page is allowed in softValueCache.
> We
> > > have
> > > > >> > tested
> > > > >> > > > > with
> > > > >> > > > > > the default page-max-cache-size which is 5, it would
> take
> > > some
> > > > >> time
> > > > >> > > to
> > > > >> > > > > > see the performance degradation since at start the
> cursor
> > > > >> positions
> > > > >> > > of
> > > > >> > > > > 100
> > > > >> > > > > > subscribers are similar when all the messages read hits
> the
> > > > >> > > > > softValueCache.
> > > > >> > > > > > But after some time, the cursor positions are different.
> > > When
> > > > >> these
> > > > >> > > > > > positions are located more than 5 pages, it means some
> page
> > > > >> would
> > > > >> > be
> > > > >> > > > read
> > > > >> > > > > > back and forth. This can be proved by the trace log
> "adding
> > > > >> > pageCache
> > > > >> > > > > > pageNr=xxx into cursor = test-topic" in
> > > PageCursorProviderImpl
> > > > >> > where
> > > > >> > > > some
> > > > >> > > > > > pages are read a lot of times for the same subscriber.
> From
> > > the
> > > > >> > time
> > > > >> > > > on,
> > > > >> > > > > > the performance starts to degrade. So we set
> > > page-max-cache-size
> > > > >> > to 1
> > > > >> > > > > > here just to make the test process more fast and it
> doesn't
> > > > >> change
> > > > >> > > the
> > > > >> > > > > > final result.
> > > > >> > > > > >
> > > > >> > > > > > The softValueCache would be removed if memory is really
> > > low, in
> > > > >> > > > addition
> > > > >> > > > > > the map size reaches capacity(default 5). In most
> cases, the
> > > > >> > > > subscribers
> > > > >> > > > > > are tailing read which are served by softValueCache(no
> need
> > > to
> > > > >> > bother
> > > > >> > > > > > disk), thus we need to keep it. But When some
> subscribers
> > > fall
> > > > >> > > behind,
> > > > >> > > > > they
> > > > >> > > > > > need to read page not in softValueCache. After looking
> up
> > > code,
> > > > >> we
> > > > >> > > > found
> > > > >> > > > > one
> > > > >> > > > > > depage round is following at most MAX_SCHEDULED_RUNNERS
> > > deliver
> > > > >> > round
> > > > >> > > > in
> > > > >> > > > > > most situations, and that's to say at most
> > > > >> MAX_DELIVERIES_IN_LOOP *
> > > > >> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be
> depaged
> > > next.
> > > > >> If
> > > > >> > > you
> > > > >> > > > > > adjust QueueImpl logger to debug level, you would see
> logs
> > > like
> > > > >> > > "Queue
> > > > >> > > > > > Memory Size after depage on queue=sub4 is 53478769 with
> > > maxSize
> > > > >> =
> > > > >> > > > > 52428800.
> > > > >> > > > > > Depaged 68 messages, pendingDelivery=1002,
> > > > >> > > > intermediateMessageReferences=
> > > > >> > > > > > 23162, queueDelivering=0". In order to depage less than
> 2000
> > > > >> > > messages,
> > > > >> > > > > > each subscriber has to read a whole page which is
> > > unnecessary
> > > > >> and
> > > > >> > > > > wasteful.
> > > > >> > > > > > In our test where one page(50MB) contains ~40000
> messages,
> > > one
> > > > >> > > > subscriber
> > > > >> > > > > > maybe read 40000/2000=20 times of page if
> softValueCache is
> > > > >> evicted
> > > > >> > > to
> > > > >> > > > > > finish delivering it. This has drastically slowed down
> the
> > > > >> process
> > > > >> > > and
> > > > >> > > > > > burdened on the disk. So we add the PageIndexCacheImpl
> and
> > > read
> > > > >> one
> > > > >> > > > > message
> > > > >> > > > > > each time rather than read all messages of page. In this
> > > way,
> > > > >> for
> > > > >> > > each
> > > > >> > > > > > subscriber each page is read only once after finishing
> > > > >> delivering.
> > > > >> > > > > >
> > > > >> > > > > > Having said that, the softValueCache is used for tailing
> > > read.
> > > > >> If
> > > > >> > > it's
> > > > >> > > > > > evicted, it won't be reloaded to prevent from the issue
> > > > >> illustrated
> > > > >> > > > > above.
> > > > >> > > > > > Instead the pageIndexCache would be used.
> > > > >> > > > > >
> > > > >> > > > > > Regarding implementation details, we noted that before
> > > > >> delivering
> > > > >> > > > page, a
> > > > >> > > > > > pageCursorInfo is constructed which needs to read the
> whole
> > > > >> page.
> > > > >> > We
> > > > >> > > > can
> > > > >> > > > > > take this opportunity to construct the pageIndexCache.
> It's
> > > very
> > > > >> > > simple
> > > > >> > > > > to
> > > > >> > > > > > code. We also think of building a offset index file and
> some
> > > > >> > concerns
> > > > >> > > > > > stemed from following:
> > > > >> > > > > >
> > > > >> > > > > >    1. When to write and sync index file? Would it have
> some
> > > > >> > > performance
> > > > >> > > > > >    implications?
> > > > >> > > > > >    2. If we have a index file, we can construct
> > > pageCursorInfo
> > > > >> > > through
> > > > >> > > > > >    it(no need to read the page like before), but we
> need to
> > > > >> write
> > > > >> > the
> > > > >> > > > > total
> > > > >> > > > > >    message number into it first. Seems a little weird
> > > putting
> > > > >> this
> > > > >> > > into
> > > > >> > > > > the
> > > > >> > > > > >    index file.
> > > > >> > > > > >    3. If experiencing hard crash, a recover mechanism
> would
> > > be
> > > > >> > needed
> > > > >> > > > to
> > > > >> > > > > >    recover page and page index files, E.g. truncating
> to the
> > > > >> valid
> > > > >> > > > size.
> > > > >> > > > > So
> > > > >> > > > > >    how do we know which files need to be sanity checked?
> > > > >> > > > > >    4. A variant binary search algorithm maybe needed,
> see
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > >
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > > > >> > > > > >     .
> > > > >> > > > > >    5. Unlike kafka from which user fetches lots of
> messages
> > > at
> > > > >> once
> > > > >> > > and
> > > > >> > > > > >    broker just needs to look up start offset from the
> index
> > > file
> > > > >> > > once,
> > > > >> > > > > artemis
> > > > >> > > > > >    delivers message one by one and that means we have to
> > > look up
> > > > >> > the
> > > > >> > > > > index
> > > > >> > > > > >    every time we deliver a message. Although the index
> file
> > > is
> > > > >> > > possibly
> > > > >> > > > > in
> > > > >> > > > > >    page cache, there are still chances we miss cache.
> > > > >> > > > > >    6. Compatibility with old files.
> > > > >> > > > > >
> > > > >> > > > > > To sum that, kafka uses a mmaped index file and we use a
> > > index
> > > > >> > cache.
> > > > >> > > > > Both
> > > > >> > > > > > are designed to find physical file position according
> > > > >> offset(kafka)
> > > > >> > > or
> > > > >> > > > > > message number(artemis). And we prefer the index cache
> bcs
> > > it's
> > > > >> > easy
> > > > >> > > to
> > > > >> > > > > > understand and maintain.
> > > > >> > > > > >
> > > > >> > > > > > We also tested the one subscriber case with the same
> setup.
> > > > >> > > > > > The original:
> > > > >> > > > > > consumer tps(11000msg/s) and latency:
> > > > >> > > > > > [image: orig_single_subscriber.png]
> > > > >> > > > > > producer tps(30000msg/s) and latency:
> > > > >> > > > > > [image: orig_single_producer.png]
> > > > >> > > > > > The pr:
> > > > >> > > > > > consumer tps(14000msg/s) and latency:
> > > > >> > > > > > [image: pr_single_consumer.png]
> > > > >> > > > > > producer tps(30000msg/s) and latency:
> > > > >> > > > > > [image: pr_single_producer.png]
> > > > >> > > > > > It showed result is similar and event a little better
> in the
> > > > >> case
> > > > >> > of
> > > > >> > > > > > single subscriber.
> > > > >> > > > > >
> > > > >> > > > > > We used our inner test platform and i think jmeter can
> also
> > > be
> > > > >> used
> > > > >> > > to
> > > > >> > > > > > test again it.
> > > > >> > > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > > >>
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Clebert Suconic
> > >
>
>
>
> --
> Clebert Suconic
>
>
>
>
>
>





Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

wei yang
Yes, the reader per queue would not close page file until moving to next
page. However there are chances that files might be opened constantly:
1. Paged transactions. Suppose current cursor position is at page2 and page
transactions are at page1, when transactions are committed, page1 might be
opened constantly to read message.
2. Scheduled or rollbacked messages. Positions of theses messages might
fall behind current cursor position, leading to page files opening
constantly.

<[hidden email]> 于2019年7月19日周五 下午3:45写道:

> Surly you keep the file open, else you will incur perf penalty of having
> to open the file constantly.
>
>
>
>
> Would be faster to have the reader hold the file open and have one per
> queue. Avoiding constant opening and closing of a file. And all the
> overhead of that at the os level
>
>
>
>
> Get Outlook for Android
>
>
>
>
>
>
>
> On Fri, Jul 19, 2019 at 8:30 AM +0100, "yw yw" <[hidden email]> wrote:
>
>
>
>
>
>
>
>
>
>
> >  But the real problem here will be the number of openFiles. Each Page
> will have an Open File, what will keep a lot of open files on the
> system. Correct?
> Not sure I made it clear enough.
> My thought is: Since PageCursorInfo decides whether entire page is consumed
> based on numberOfMessages and PageSubscriptionImpl decides whether to move
> to next page based on the current cursor page position and
> numberOfMessages, we store a map of  in
> PageCursorProviderImpl after page is evicted. According to
> numberOfMessageseach PageSubscriptionImpl can build PageCursorInfo and if
> current cursor page position is in the range of current page, PageReader
> can be built to help read messages. So there are really no opened page
> files in PageCursorProviderImpl.
> Without this map, each PageSubscriptionImpl has to first read the page file
> to get numberOfMessages, then build PageCursorInfo/PageReader.
>
> I agree to put the PageReader to PageSubscriptionImpl, just not sure
> specific implementation details :)
>
>  于2019年7月19日周五 下午2:10写道:
>
> > +1 for having one per queue. Def a better idea than having to hold a
> > cache.
> >
> >
> >
> >
> > Get Outlook for Android
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Jul 19, 2019 at 4:37 AM +0100, "Clebert Suconic" <
> > [hidden email]> wrote:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > But the real problem here will be the number of openFiles. Each Page
> > will have an Open File, what will keep a lot of open files on the
> > system. Correct?
> >
> > I believe the impact of having the files moving to the Subscription
> > wouldn't be that much, and we would fix the problem. WE wouldn't need
> > a cache at all, as we just keep the File we need at the current
> > cursor.
> >
> > On Tue, Jul 16, 2019 at 10:40 PM yw yw  wrote:
> > >
> > > I did consider the case where all pages are instantiated as
> PageReaders.
> > > That's really a problem.
> > >
> > > The pros of pr is every page is read only once to build PageReader and
> > > shared by all the queues. The cons of pr is many PageReaders are
> probably
> > > instantiated if consumers make slow/no progress in several queues
> whereas
> > > fast in other queues(I think it's the only cause leading to the corner
> > > case, right?). This means too many open files and too much memory.
> > >
> > > The pros of duplicated PageReader is there are fixed number of
> > PageReaders
> > > as with queues at the same time.
> > > The cons is each queue has to read the page once to build their own
> > > PageReader if page cache is evicted. I'm not sure how this will affect
> > > performance.
> > >
> > > The point is we need the number of messages in the page which is used
> by
> > > PageCursorInfo and PageSubscription::internalGetNext, so we have to
> read
> > > the page file. How about we only cache the number of messages in each
> > page
> > > instead of PageReader and build PageReader in each queue. While we
> > > encounter the corner case, only  pair data is permanently in
> > > memory that I assume is smaller than completed PageCursorInfo data.
> This
> > > way we achieve the performance gain at a small price.
> > >
> > > Clebert Suconic  于2019年7月16日周二 下午10:18写道:
> > >
> > > > I just came back after a 2 weeks deserved break and I was looking at
> > > > this.. and I can say. it's well done.. nice job! it's a lot simpler!
> > > >
> > > > However there's one question now. which is probably a further
> > > > improvement. Shouldn't the pageReader be instantiated at the
> > > > PageSubscription.
> > > >
> > > > That means.. if there's no page cache, in case of the page been
> > > > evicted, the Subscription would then create a new Page/PageReader
> > > > pair. and dispose it when it's done (meaning, moved to a different
> > > > page).
> > > >
> > > > As you are solving the case with many subscriptions, wouldn't you hit
> > > > a corner case where all Pages are instantiated as PageReaders?
> > > >
> > > >
> > > > I feel like it would be better to eventually duplicate a PageReader
> > > > and close it when done.
> > > >
> > > >
> > > > Or did you already consider that possibility and still think it's
> best
> > > > to keep this cache of PageReaders?
> > > >
> > > > On Sat, Jul 13, 2019 at 12:15 AM
> > > > wrote:
> > > > >
> > > > > Could a squashed PR be sent?
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Get Outlook for Android
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw"
> > > > wrote:
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > Hi,
> > > > >
> > > > > I have finished work on the new implementation(not yet tests and
> > > > > configuration) as suggested by franz.
> > > > >
> > > > > I put fileOffsetset in the PagePosition and add a new class
> > PageReader
> > > > > which is a wrapper of the page that implements PageCache interface.
> > The
> > > > > PageReader class is used to read page file if cache is evicted. For
> > > > detail,
> > > > > see
> > > > >
> > > >
> >
> https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987
> > > > >
> > > > > I deployed some tests and results below:
> > > > > 1. Running in 51MB size page and 1 page cache in the case of 100
> > > > multicast
> > > > > queues.
> > > > > https://filebin.net/wnyan7d2n1qgfsvg
> > > > > 2. Running in 5MB size page and 100 page cache in the case of 100
> > > > multicast
> > > > > queues.
> > > > > https://filebin.net/re0989vz7ib1c5mc
> > > > > 3. Running in 51MB size page and 1 page cache in the case of 1
> queue.
> > > > > https://filebin.net/3qndct7f11qckrus
> > > > >
> > > > > The results seem good, similar with the implementation in the pr.
> The
> > > > most
> > > > > important is the index cache data is removed, no worry about extra
> > > > overhead
> > > > > :)
> > > > >
> > > > > yw yw  于2019年7月4日周四 下午5:38写道:
> > > > >
> > > > > > Hi,  michael
> > > > > >
> > > > > > Thanks for the advise. For the current pr, we can use two arrays
> > where
> > > > one
> > > > > > records the message number and the other one corresponding offset
> > to
> > > > > > optimize the memory usage. For the franz's approch, we will also
> > work
> > > > on
> > > > > > a early prototyping implementation. After that, we would take
> some
> > > > basic
> > > > > > tests in different scenarios.
> > > > > >
> > > > > >  于2019年7月2日周二 上午7:08写道:
> > > > > >
> > > > > >> Point though is an extra index cache layer is needed. The
> > overhead of
> > > > > >> that means the total paged capacity will be more limited as that
> > > > overhead
> > > > > >> isnt just an extra int per reference. E.g. in the pr the current
> > impl
> > > > isnt
> > > > > >> very memory optimised, could an int array be used or at worst an
> > open
> > > > > >> primitive int int hashmap.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> This is why i really prefer franz's approach.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> Also what ever we do, we need the new behaviour configurable, so
> > > > should a
> > > > > >> use case not thought about they won't be impacted. E.g. the
> change
> > > > should
> > > > > >> not be a surprise, it should be something you toggle on.
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> Get Outlook for Android
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw"  wrote:
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> Hi,
> > > > > >> We've took a test against your configuration:
> > > > > >> 5Mb10010Mb.
> > > > > >> The current code: 7000msg/s sent and 18000msg/s received.
> > > > > >> Pr code:16000msg/s received and 8200msg/s sent.
> > > > > >> Like you said, the performance boosts by using much smaller page
> > file
> > > > and
> > > > > >> holding many more for current code.
> > > > > >>
> > > > > >> Not sure what implications would have using smaller page file,
> the
> > > > > >> producer
> > > > > >> performance may reduce since switching files is more frequent,
> > number
> > > > of
> > > > > >> file handle would increase?
> > > > > >>
> > > > > >> While our consumer in the test just echos, nothing to do after
> > > > receiving
> > > > > >> message, the consumer in the real world may be busy doing
> > business.
> > > > This
> > > > > >> means references and page caches reside in memory longer and may
> > be
> > > > > >> evicted
> > > > > >> more easily when producers are sending all the time.
> > > > > >>
> > > > > >> Since We don't know how many subscribers there are, it is not a
> > > > scalable
> > > > > >> approch. We can't reduce page file size unlimited to fit the
> > number of
> > > > > >> subscribers. The code should accommodate to all kinds of
> > > > configurations.
> > > > > >> We
> > > > > >> adjust configuration for trade off as needed, not work around
> IMO.
> > > > > >> In our company, ~200 queues(60% are owned by some addresses) are
> > > > deployed
> > > > > >> in the broker. We can't set all to e.g. 100 page caches(too much
> > > > memory),
> > > > > >> and neither set different size according to address pattern(hard
> > for
> > > > > >> operation). In the multi tenants cluster, we prefer availability
> > and
> > > > to
> > > > > >> avoid memory exhausted, we set pageSize to 30MB, max cache size
> > to 1
> > > > and
> > > > > >> max size to 31MB. It's running well in one of our clusters now:)
> > > > > >>
> > > > > >>  于2019年6月29日周六 上午2:35写道:
> > > > > >>
> > > > > >> > I think some of that is down to configuration. If you think
> you
> > > > could
> > > > > >> > configure paging to have much smaller page files but have many
> > more
> > > > > >> held.
> > > > > >> > That way the reference sizes will be far smaller and pages
> > dropping
> > > > in
> > > > > >> and
> > > > > >> > out would be less. E.g. if you expect 100 being read make it
> > 100 but
> > > > > >> make
> > > > > >> > the page sizes smaller so the overhead is far less
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > Get Outlook for Android
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > "At last for one message we maybe read twice: first we read
> > page and
> > > > > >> create
> > > > > >> > pagereference; second we requery message after its reference
> is
> > > > > >> removed.  "
> > > > > >> >
> > > > > >> > I just realized it was wrong. One message maybe read many
> times.
> > > > Think
> > > > > >> of
> > > > > >> > this: When #1~#2000 msg is delivered, need to depage
> #2001-#4000
> > > > msg,
> > > > > >> > reading the whole page; When #2001~#4000 msg is deliverd, need
> > to
> > > > depage
> > > > > >> > #4001~#6000 msg, reading page again, etc.
> > > > > >> >
> > > > > >> > One message maybe read three times if we don't depage until
> all
> > > > messages
> > > > > >> > are delivered. For example, we have 3 pages p1, p2,p3 and
> > message m1
> > > > > >> which
> > > > > >> > is at top part of the p2. In our case(max-size-bytes=51MB, a
> > little
> > > > > >> bigger
> > > > > >> > than page size), first depage round reads bottom half of p1
> and
> > top
> > > > > >> part of
> > > > > >> > p2; second depage round reads bottom half of p2 and top part
> of
> > p3.
> > > > > >> > Therforce p2 is read twice and m1 maybe read three times if
> > > > requeryed.
> > > > > >> >
> > > > > >> > Be honest, i don't know how to fix the problem above with the
> > > > > >> > decrentralized approch. The point is not how we rely on os
> > cache,
> > > > it's
> > > > > >> that
> > > > > >> > we do it the wrong way, shouldn't read whole page(50MB) just
> for
> > > > ~2000
> > > > > >> > messages. Also there is no need to save 51MB
> PagedReferenceImpl
> > in
> > > > > >> memory.
> > > > > >> > When 100 queues occupy 5100MB memory, the message references
> are
> > > > very
> > > > > >> > likely to be removed.
> > > > > >> >
> > > > > >> >
> > > > > >> > Francesco Nigro  于2019年6月27日周四 下午5:05写道:
> > > > > >> >
> > > > > >> > > >
> > > > > >> > > >  which means the offset info is 100 times large compared
> to
> > the
> > > > > >> shared
> > > > > >> > > > page index cache.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > I would check with JOL plugin for exact numbers..
> > > > > >> > > I see with it that we would have an increase of 4 bytes for
> > each
> > > > > >> > > PagedRefeferenceImpl, totally decrentralized vs
> > > > > >> > > a centralized approach (the cache). In the economy of a
> fully
> > > > loaded
> > > > > >> > > broker, if we care about scaling need to understand if the
> > memory
> > > > > >> > tradeoff
> > > > > >> > > is important enough
> > > > > >> > > to choose one of the 2 approaches.
> > > > > >> > > My point is that paging could be made totally based on the
> OS
> > page
> > > > > >> cache
> > > > > >> > if
> > > > > >> > > GC would get in the middle, deleting any previous mechanism
> of
> > > > page
> > > > > >> > > caching...simplifying the process at it is.
> > > > > >> > > Using a 2 level cache with such centralized approach can
> > work, but
> > > > > >> will
> > > > > >> > add
> > > > > >> > > a level of complexity that IMO could be saved...
> > > > > >> > > What do you think could be the benefit of the decentralized
> > > > solution
> > > > > >> if
> > > > > >> > > compared with the one proposed in the PR?
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> > > > > >> > > scritto:
> > > > > >> > >
> > > > > >> > > > Sorry, I missed the PageReferece part.
> > > > > >> > > >
> > > > > >> > > > The lifecyle of PageReference is: depage(in
> > > > > >> > > intermediateMessageReferences)
> > > > > >> > > > -> deliver(in messageReferences) -> waiting for ack(in
> > > > > >> deliveringRefs)
> > > > > >> > ->
> > > > > >> > > > removed. Every queue would create it's own PageReference
> > which
> > > > means
> > > > > >> > the
> > > > > >> > > > offset info is 100 times large compared to the shared page
> > index
> > > > > >> cache.
> > > > > >> > > > If we keep 51MB pageReference size in memory, as i said in
> > pr,
> > > > "For
> > > > > >> > > > multiple subscribers to the same address, just one
> executor
> > is
> > > > > >> > > responsible
> > > > > >> > > > for delivering which means at the same moment only one
> > queue is
> > > > > >> > > delivering.
> > > > > >> > > > Thus the queue maybe stalled for a long time. We get
> > > > queueMemorySize
> > > > > >> > > > messages into memory, and when we deliver these after a
> long
> > > > time,
> > > > > >> we
> > > > > >> > > > probably need to query message and read page file again.".
> > At
> > > > last
> > > > > >> for
> > > > > >> > > one
> > > > > >> > > > message we maybe read twice: first we read page and create
> > > > > >> > pagereference;
> > > > > >> > > > second we requery message after its reference is removed.
> > > > > >> > > >
> > > > > >> > > > For the shared page index cache design, each message just
> > need
> > > > to be
> > > > > >> > read
> > > > > >> > > > from file once.
> > > > > >> > > >
> > > > > >> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> > > > > >> > > >
> > > > > >> > > > > Hi
> > > > > >> > > > >
> > > > > >> > > > > First of all i think this is an excellent effort, and
> > could
> > > > be a
> > > > > >> > > > potential
> > > > > >> > > > > massive positive change.
> > > > > >> > > > >
> > > > > >> > > > > Before making any change on such scale, i do think we
> > need to
> > > > > >> ensure
> > > > > >> > we
> > > > > >> > > > > have sufficient benchmarks on a number of scenarios, not
> > just
> > > > one
> > > > > >> use
> > > > > >> > > > case,
> > > > > >> > > > > and the benchmark tool used does need to be available
> > openly
> > > > so
> > > > > >> that
> > > > > >> > > > others
> > > > > >> > > > > can verify the measures and check on their setups.
> > > > > >> > > > >
> > > > > >> > > > > Some additional scenarios i would want/need covering
> are:
> > > > > >> > > > >
> > > > > >> > > > > PageCache set to 5, and all consumers keeping up, but
> > lagging
> > > > > >> enough
> > > > > >> > to
> > > > > >> > > > be
> > > > > >> > > > > reading from the same 1st page cache, latency and
> > throughput
> > > > need
> > > > > >> to
> > > > > >> > be
> > > > > >> > > > > measured for all.
> > > > > >> > > > > PageCache set to 5 and all consumers but one keeping up
> > but
> > > > > >> lagging
> > > > > >> > > > enough
> > > > > >> > > > > to be reading from the same 1st page cahce, but the one
> is
> > > > falling
> > > > > >> > off
> > > > > >> > > > the
> > > > > >> > > > > end, causing the page cache swapping, measure latecy and
> > > > > >> througput of
> > > > > >> > > > those
> > > > > >> > > > > keeping up in the 1st page cache not caring for the one.
> > > > > >> > > > >
> > > > > >> > > > > Regards to solution some alternative approach to discuss
> > > > > >> > > > >
> > > > > >> > > > > In your scenario if i understand correctly each
> > subscriber is
> > > > > >> > > effectivly
> > > > > >> > > > > having their own queue (1 to 1 mapping) not sharing.
> > > > > >> > > > > You mention kafka and say multiple consumers doent read
> > > > serailly
> > > > > >> on
> > > > > >> > the
> > > > > >> > > > > address and this is true, but per queue processing
> through
> > > > > >> messages
> > > > > >> > > > > (dispatch) is still serial even with multiple shared
> > > > consumers on
> > > > > >> a
> > > > > >> > > > queue.
> > > > > >> > > > >
> > > > > >> > > > > What about keeping the existing mechanism but having a
> > queue
> > > > hold
> > > > > >> > > > reference
> > > > > >> > > > > to a page cache that the queue is currently on, being
> kept
> > > > from gc
> > > > > >> > > (e.g.
> > > > > >> > > > > not soft) therefore meaning page cache isnt being
> swapped
> > > > around,
> > > > > >> > when
> > > > > >> > > > you
> > > > > >> > > > > have queues (in your case subscribers) swapping
> pagecaches
> > > > back
> > > > > >> and
> > > > > >> > > forth
> > > > > >> > > > > avoidning the constant re-read issue.
> > > > > >> > > > >
> > > > > >> > > > > Also i think Franz had an excellent idea, do away with
> > > > pagecache
> > > > > >> in
> > > > > >> > its
> > > > > >> > > > > current form entirely, ensure the offset is kept with
> the
> > > > > >> reference
> > > > > >> > and
> > > > > >> > > > > rely on OS caching keeping hot blocks/data.
> > > > > >> > > > >
> > > > > >> > > > > Best
> > > > > >> > > > > Michael
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> > > > > >> > > > >
> > > > > >> > > > > > Hi, folks
> > > > > >> > > > > >
> > > > > >> > > > > > This is the discussion about "ARTEMIS-2399 Fix
> > performance
> > > > > >> > > degradation
> > > > > >> > > > > > when there are a lot of subscribers".
> > > > > >> > > > > >
> > > > > >> > > > > > First apologize i didn't clarify our thoughts.
> > > > > >> > > > > >
> > > > > >> > > > > > As noted in the part of Environment,
> > page-max-cache-size is
> > > > set
> > > > > >> to
> > > > > >> > 1
> > > > > >> > > > > > meaning at most one page is allowed in softValueCache.
> > We
> > > > have
> > > > > >> > tested
> > > > > >> > > > > with
> > > > > >> > > > > > the default page-max-cache-size which is 5, it would
> > take
> > > > some
> > > > > >> time
> > > > > >> > > to
> > > > > >> > > > > > see the performance degradation since at start the
> > cursor
> > > > > >> positions
> > > > > >> > > of
> > > > > >> > > > > 100
> > > > > >> > > > > > subscribers are similar when all the messages read
> hits
> > the
> > > > > >> > > > > softValueCache.
> > > > > >> > > > > > But after some time, the cursor positions are
> different.
> > > > When
> > > > > >> these
> > > > > >> > > > > > positions are located more than 5 pages, it means some
> > page
> > > > > >> would
> > > > > >> > be
> > > > > >> > > > read
> > > > > >> > > > > > back and forth. This can be proved by the trace log
> > "adding
> > > > > >> > pageCache
> > > > > >> > > > > > pageNr=xxx into cursor = test-topic" in
> > > > PageCursorProviderImpl
> > > > > >> > where
> > > > > >> > > > some
> > > > > >> > > > > > pages are read a lot of times for the same subscriber.
> > From
> > > > the
> > > > > >> > time
> > > > > >> > > > on,
> > > > > >> > > > > > the performance starts to degrade. So we set
> > > > page-max-cache-size
> > > > > >> > to 1
> > > > > >> > > > > > here just to make the test process more fast and it
> > doesn't
> > > > > >> change
> > > > > >> > > the
> > > > > >> > > > > > final result.
> > > > > >> > > > > >
> > > > > >> > > > > > The softValueCache would be removed if memory is
> really
> > > > low, in
> > > > > >> > > > addition
> > > > > >> > > > > > the map size reaches capacity(default 5). In most
> > cases, the
> > > > > >> > > > subscribers
> > > > > >> > > > > > are tailing read which are served by softValueCache(no
> > need
> > > > to
> > > > > >> > bother
> > > > > >> > > > > > disk), thus we need to keep it. But When some
> > subscribers
> > > > fall
> > > > > >> > > behind,
> > > > > >> > > > > they
> > > > > >> > > > > > need to read page not in softValueCache. After looking
> > up
> > > > code,
> > > > > >> we
> > > > > >> > > > found
> > > > > >> > > > > one
> > > > > >> > > > > > depage round is following at most
> MAX_SCHEDULED_RUNNERS
> > > > deliver
> > > > > >> > round
> > > > > >> > > > in
> > > > > >> > > > > > most situations, and that's to say at most
> > > > > >> MAX_DELIVERIES_IN_LOOP *
> > > > > >> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be
> > depaged
> > > > next.
> > > > > >> If
> > > > > >> > > you
> > > > > >> > > > > > adjust QueueImpl logger to debug level, you would see
> > logs
> > > > like
> > > > > >> > > "Queue
> > > > > >> > > > > > Memory Size after depage on queue=sub4 is 53478769
> with
> > > > maxSize
> > > > > >> =
> > > > > >> > > > > 52428800.
> > > > > >> > > > > > Depaged 68 messages, pendingDelivery=1002,
> > > > > >> > > > intermediateMessageReferences=
> > > > > >> > > > > > 23162, queueDelivering=0". In order to depage less
> than
> > 2000
> > > > > >> > > messages,
> > > > > >> > > > > > each subscriber has to read a whole page which is
> > > > unnecessary
> > > > > >> and
> > > > > >> > > > > wasteful.
> > > > > >> > > > > > In our test where one page(50MB) contains ~40000
> > messages,
> > > > one
> > > > > >> > > > subscriber
> > > > > >> > > > > > maybe read 40000/2000=20 times of page if
> > softValueCache is
> > > > > >> evicted
> > > > > >> > > to
> > > > > >> > > > > > finish delivering it. This has drastically slowed down
> > the
> > > > > >> process
> > > > > >> > > and
> > > > > >> > > > > > burdened on the disk. So we add the PageIndexCacheImpl
> > and
> > > > read
> > > > > >> one
> > > > > >> > > > > message
> > > > > >> > > > > > each time rather than read all messages of page. In
> this
> > > > way,
> > > > > >> for
> > > > > >> > > each
> > > > > >> > > > > > subscriber each page is read only once after finishing
> > > > > >> delivering.
> > > > > >> > > > > >
> > > > > >> > > > > > Having said that, the softValueCache is used for
> tailing
> > > > read.
> > > > > >> If
> > > > > >> > > it's
> > > > > >> > > > > > evicted, it won't be reloaded to prevent from the
> issue
> > > > > >> illustrated
> > > > > >> > > > > above.
> > > > > >> > > > > > Instead the pageIndexCache would be used.
> > > > > >> > > > > >
> > > > > >> > > > > > Regarding implementation details, we noted that before
> > > > > >> delivering
> > > > > >> > > > page, a
> > > > > >> > > > > > pageCursorInfo is constructed which needs to read the
> > whole
> > > > > >> page.
> > > > > >> > We
> > > > > >> > > > can
> > > > > >> > > > > > take this opportunity to construct the pageIndexCache.
> > It's
> > > > very
> > > > > >> > > simple
> > > > > >> > > > > to
> > > > > >> > > > > > code. We also think of building a offset index file
> and
> > some
> > > > > >> > concerns
> > > > > >> > > > > > stemed from following:
> > > > > >> > > > > >
> > > > > >> > > > > >    1. When to write and sync index file? Would it have
> > some
> > > > > >> > > performance
> > > > > >> > > > > >    implications?
> > > > > >> > > > > >    2. If we have a index file, we can construct
> > > > pageCursorInfo
> > > > > >> > > through
> > > > > >> > > > > >    it(no need to read the page like before), but we
> > need to
> > > > > >> write
> > > > > >> > the
> > > > > >> > > > > total
> > > > > >> > > > > >    message number into it first. Seems a little weird
> > > > putting
> > > > > >> this
> > > > > >> > > into
> > > > > >> > > > > the
> > > > > >> > > > > >    index file.
> > > > > >> > > > > >    3. If experiencing hard crash, a recover mechanism
> > would
> > > > be
> > > > > >> > needed
> > > > > >> > > > to
> > > > > >> > > > > >    recover page and page index files, E.g. truncating
> > to the
> > > > > >> valid
> > > > > >> > > > size.
> > > > > >> > > > > So
> > > > > >> > > > > >    how do we know which files need to be sanity
> checked?
> > > > > >> > > > > >    4. A variant binary search algorithm maybe needed,
> > see
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > >
> >
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > > > > >> > > > > >     .
> > > > > >> > > > > >    5. Unlike kafka from which user fetches lots of
> > messages
> > > > at
> > > > > >> once
> > > > > >> > > and
> > > > > >> > > > > >    broker just needs to look up start offset from the
> > index
> > > > file
> > > > > >> > > once,
> > > > > >> > > > > artemis
> > > > > >> > > > > >    delivers message one by one and that means we have
> to
> > > > look up
> > > > > >> > the
> > > > > >> > > > > index
> > > > > >> > > > > >    every time we deliver a message. Although the index
> > file
> > > > is
> > > > > >> > > possibly
> > > > > >> > > > > in
> > > > > >> > > > > >    page cache, there are still chances we miss cache.
> > > > > >> > > > > >    6. Compatibility with old files.
> > > > > >> > > > > >
> > > > > >> > > > > > To sum that, kafka uses a mmaped index file and we
> use a
> > > > index
> > > > > >> > cache.
> > > > > >> > > > > Both
> > > > > >> > > > > > are designed to find physical file position according
> > > > > >> offset(kafka)
> > > > > >> > > or
> > > > > >> > > > > > message number(artemis). And we prefer the index cache
> > bcs
> > > > it's
> > > > > >> > easy
> > > > > >> > > to
> > > > > >> > > > > > understand and maintain.
> > > > > >> > > > > >
> > > > > >> > > > > > We also tested the one subscriber case with the same
> > setup.
> > > > > >> > > > > > The original:
> > > > > >> > > > > > consumer tps(11000msg/s) and latency:
> > > > > >> > > > > > [image: orig_single_subscriber.png]
> > > > > >> > > > > > producer tps(30000msg/s) and latency:
> > > > > >> > > > > > [image: orig_single_producer.png]
> > > > > >> > > > > > The pr:
> > > > > >> > > > > > consumer tps(14000msg/s) and latency:
> > > > > >> > > > > > [image: pr_single_consumer.png]
> > > > > >> > > > > > producer tps(30000msg/s) and latency:
> > > > > >> > > > > > [image: pr_single_producer.png]
> > > > > >> > > > > > It showed result is similar and event a little better
> > in the
> > > > > >> case
> > > > > >> > of
> > > > > >> > > > > > single subscriber.
> > > > > >> > > > > >
> > > > > >> > > > > > We used our inner test platform and i think jmeter can
> > also
> > > > be
> > > > > >> used
> > > > > >> > > to
> > > > > >> > > > > > test again it.
> > > > > >> > > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >>
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Clebert Suconic
> > > >
> >
> >
> >
> > --
> > Clebert Suconic
> >
> >
> >
> >
> >
> >
>
>
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Improve paging performance when there are lots of subscribers

clebertsuconic
It shouldn’t need to read it unless it is a rollback and it’s being
redelovered.  They would be an exceptional case.

On Mon, Jul 22, 2019 at 5:46 AM yw yw <[hidden email]> wrote:

> Yes, the reader per queue would not close page file until moving to next
> page. However there are chances that files might be opened constantly:
> 1. Paged transactions. Suppose current cursor position is at page2 and page
> transactions are at page1, when transactions are committed, page1 might be
> opened constantly to read message.
> 2. Scheduled or rollbacked messages. Positions of theses messages might
> fall behind current cursor position, leading to page files opening
> constantly.
>
> <[hidden email]> 于2019年7月19日周五 下午3:45写道:
>
> > Surly you keep the file open, else you will incur perf penalty of having
> > to open the file constantly.
> >
> >
> >
> >
> > Would be faster to have the reader hold the file open and have one per
> > queue. Avoiding constant opening and closing of a file. And all the
> > overhead of that at the os level
> >
> >
> >
> >
> > Get Outlook for Android
> >
> >
> >
> >
> >
> >
> >
> > On Fri, Jul 19, 2019 at 8:30 AM +0100, "yw yw" <[hidden email]>
> wrote:
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > >  But the real problem here will be the number of openFiles. Each Page
> > will have an Open File, what will keep a lot of open files on the
> > system. Correct?
> > Not sure I made it clear enough.
> > My thought is: Since PageCursorInfo decides whether entire page is
> consumed
> > based on numberOfMessages and PageSubscriptionImpl decides whether to
> move
> > to next page based on the current cursor page position and
> > numberOfMessages, we store a map of  in
> > PageCursorProviderImpl after page is evicted. According to
> > numberOfMessageseach PageSubscriptionImpl can build PageCursorInfo and if
> > current cursor page position is in the range of current page, PageReader
> > can be built to help read messages. So there are really no opened page
> > files in PageCursorProviderImpl.
> > Without this map, each PageSubscriptionImpl has to first read the page
> file
> > to get numberOfMessages, then build PageCursorInfo/PageReader.
> >
> > I agree to put the PageReader to PageSubscriptionImpl, just not sure
> > specific implementation details :)
> >
> >  于2019年7月19日周五 下午2:10写道:
> >
> > > +1 for having one per queue. Def a better idea than having to hold a
> > > cache.
> > >
> > >
> > >
> > >
> > > Get Outlook for Android
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > On Fri, Jul 19, 2019 at 4:37 AM +0100, "Clebert Suconic" <
> > > [hidden email]> wrote:
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > But the real problem here will be the number of openFiles. Each Page
> > > will have an Open File, what will keep a lot of open files on the
> > > system. Correct?
> > >
> > > I believe the impact of having the files moving to the Subscription
> > > wouldn't be that much, and we would fix the problem. WE wouldn't need
> > > a cache at all, as we just keep the File we need at the current
> > > cursor.
> > >
> > > On Tue, Jul 16, 2019 at 10:40 PM yw yw  wrote:
> > > >
> > > > I did consider the case where all pages are instantiated as
> > PageReaders.
> > > > That's really a problem.
> > > >
> > > > The pros of pr is every page is read only once to build PageReader
> and
> > > > shared by all the queues. The cons of pr is many PageReaders are
> > probably
> > > > instantiated if consumers make slow/no progress in several queues
> > whereas
> > > > fast in other queues(I think it's the only cause leading to the
> corner
> > > > case, right?). This means too many open files and too much memory.
> > > >
> > > > The pros of duplicated PageReader is there are fixed number of
> > > PageReaders
> > > > as with queues at the same time.
> > > > The cons is each queue has to read the page once to build their own
> > > > PageReader if page cache is evicted. I'm not sure how this will
> affect
> > > > performance.
> > > >
> > > > The point is we need the number of messages in the page which is used
> > by
> > > > PageCursorInfo and PageSubscription::internalGetNext, so we have to
> > read
> > > > the page file. How about we only cache the number of messages in each
> > > page
> > > > instead of PageReader and build PageReader in each queue. While we
> > > > encounter the corner case, only  pair data is permanently in
> > > > memory that I assume is smaller than completed PageCursorInfo data.
> > This
> > > > way we achieve the performance gain at a small price.
> > > >
> > > > Clebert Suconic  于2019年7月16日周二 下午10:18写道:
> > > >
> > > > > I just came back after a 2 weeks deserved break and I was looking
> at
> > > > > this.. and I can say. it's well done.. nice job! it's a lot
> simpler!
> > > > >
> > > > > However there's one question now. which is probably a further
> > > > > improvement. Shouldn't the pageReader be instantiated at the
> > > > > PageSubscription.
> > > > >
> > > > > That means.. if there's no page cache, in case of the page been
> > > > > evicted, the Subscription would then create a new Page/PageReader
> > > > > pair. and dispose it when it's done (meaning, moved to a different
> > > > > page).
> > > > >
> > > > > As you are solving the case with many subscriptions, wouldn't you
> hit
> > > > > a corner case where all Pages are instantiated as PageReaders?
> > > > >
> > > > >
> > > > > I feel like it would be better to eventually duplicate a PageReader
> > > > > and close it when done.
> > > > >
> > > > >
> > > > > Or did you already consider that possibility and still think it's
> > best
> > > > > to keep this cache of PageReaders?
> > > > >
> > > > > On Sat, Jul 13, 2019 at 12:15 AM
> > > > > wrote:
> > > > > >
> > > > > > Could a squashed PR be sent?
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Get Outlook for Android
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 12, 2019 at 2:23 PM +0100, "yw yw"
> > > > > wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I have finished work on the new implementation(not yet tests and
> > > > > > configuration) as suggested by franz.
> > > > > >
> > > > > > I put fileOffsetset in the PagePosition and add a new class
> > > PageReader
> > > > > > which is a wrapper of the page that implements PageCache
> interface.
> > > The
> > > > > > PageReader class is used to read page file if cache is evicted.
> For
> > > > > detail,
> > > > > > see
> > > > > >
> > > > >
> > >
> >
> https://github.com/wy96f/activemq-artemis/commit/3f388c2324738f01f53ce806b813220d28d40987
> > > > > >
> > > > > > I deployed some tests and results below:
> > > > > > 1. Running in 51MB size page and 1 page cache in the case of 100
> > > > > multicast
> > > > > > queues.
> > > > > > https://filebin.net/wnyan7d2n1qgfsvg
> > > > > > 2. Running in 5MB size page and 100 page cache in the case of 100
> > > > > multicast
> > > > > > queues.
> > > > > > https://filebin.net/re0989vz7ib1c5mc
> > > > > > 3. Running in 51MB size page and 1 page cache in the case of 1
> > queue.
> > > > > > https://filebin.net/3qndct7f11qckrus
> > > > > >
> > > > > > The results seem good, similar with the implementation in the pr.
> > The
> > > > > most
> > > > > > important is the index cache data is removed, no worry about
> extra
> > > > > overhead
> > > > > > :)
> > > > > >
> > > > > > yw yw  于2019年7月4日周四 下午5:38写道:
> > > > > >
> > > > > > > Hi,  michael
> > > > > > >
> > > > > > > Thanks for the advise. For the current pr, we can use two
> arrays
> > > where
> > > > > one
> > > > > > > records the message number and the other one corresponding
> offset
> > > to
> > > > > > > optimize the memory usage. For the franz's approch, we will
> also
> > > work
> > > > > on
> > > > > > > a early prototyping implementation. After that, we would take
> > some
> > > > > basic
> > > > > > > tests in different scenarios.
> > > > > > >
> > > > > > >  于2019年7月2日周二 上午7:08写道:
> > > > > > >
> > > > > > >> Point though is an extra index cache layer is needed. The
> > > overhead of
> > > > > > >> that means the total paged capacity will be more limited as
> that
> > > > > overhead
> > > > > > >> isnt just an extra int per reference. E.g. in the pr the
> current
> > > impl
> > > > > isnt
> > > > > > >> very memory optimised, could an int array be used or at worst
> an
> > > open
> > > > > > >> primitive int int hashmap.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> This is why i really prefer franz's approach.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> Also what ever we do, we need the new behaviour configurable,
> so
> > > > > should a
> > > > > > >> use case not thought about they won't be impacted. E.g. the
> > change
> > > > > should
> > > > > > >> not be a surprise, it should be something you toggle on.
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> Get Outlook for Android
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On Mon, Jul 1, 2019 at 1:01 PM +0100, "yw yw"  wrote:
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> Hi,
> > > > > > >> We've took a test against your configuration:
> > > > > > >> 5Mb10010Mb.
> > > > > > >> The current code: 7000msg/s sent and 18000msg/s received.
> > > > > > >> Pr code:16000msg/s received and 8200msg/s sent.
> > > > > > >> Like you said, the performance boosts by using much smaller
> page
> > > file
> > > > > and
> > > > > > >> holding many more for current code.
> > > > > > >>
> > > > > > >> Not sure what implications would have using smaller page file,
> > the
> > > > > > >> producer
> > > > > > >> performance may reduce since switching files is more frequent,
> > > number
> > > > > of
> > > > > > >> file handle would increase?
> > > > > > >>
> > > > > > >> While our consumer in the test just echos, nothing to do after
> > > > > receiving
> > > > > > >> message, the consumer in the real world may be busy doing
> > > business.
> > > > > This
> > > > > > >> means references and page caches reside in memory longer and
> may
> > > be
> > > > > > >> evicted
> > > > > > >> more easily when producers are sending all the time.
> > > > > > >>
> > > > > > >> Since We don't know how many subscribers there are, it is not
> a
> > > > > scalable
> > > > > > >> approch. We can't reduce page file size unlimited to fit the
> > > number of
> > > > > > >> subscribers. The code should accommodate to all kinds of
> > > > > configurations.
> > > > > > >> We
> > > > > > >> adjust configuration for trade off as needed, not work around
> > IMO.
> > > > > > >> In our company, ~200 queues(60% are owned by some addresses)
> are
> > > > > deployed
> > > > > > >> in the broker. We can't set all to e.g. 100 page caches(too
> much
> > > > > memory),
> > > > > > >> and neither set different size according to address
> pattern(hard
> > > for
> > > > > > >> operation). In the multi tenants cluster, we prefer
> availability
> > > and
> > > > > to
> > > > > > >> avoid memory exhausted, we set pageSize to 30MB, max cache
> size
> > > to 1
> > > > > and
> > > > > > >> max size to 31MB. It's running well in one of our clusters
> now:)
> > > > > > >>
> > > > > > >>  于2019年6月29日周六 上午2:35写道:
> > > > > > >>
> > > > > > >> > I think some of that is down to configuration. If you think
> > you
> > > > > could
> > > > > > >> > configure paging to have much smaller page files but have
> many
> > > more
> > > > > > >> held.
> > > > > > >> > That way the reference sizes will be far smaller and pages
> > > dropping
> > > > > in
> > > > > > >> and
> > > > > > >> > out would be less. E.g. if you expect 100 being read make it
> > > 100 but
> > > > > > >> make
> > > > > > >> > the page sizes smaller so the overhead is far less
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Get Outlook for Android
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > On Thu, Jun 27, 2019 at 11:10 AM +0100, "yw yw"  wrote:
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > "At last for one message we maybe read twice: first we read
> > > page and
> > > > > > >> create
> > > > > > >> > pagereference; second we requery message after its reference
> > is
> > > > > > >> removed.  "
> > > > > > >> >
> > > > > > >> > I just realized it was wrong. One message maybe read many
> > times.
> > > > > Think
> > > > > > >> of
> > > > > > >> > this: When #1~#2000 msg is delivered, need to depage
> > #2001-#4000
> > > > > msg,
> > > > > > >> > reading the whole page; When #2001~#4000 msg is deliverd,
> need
> > > to
> > > > > depage
> > > > > > >> > #4001~#6000 msg, reading page again, etc.
> > > > > > >> >
> > > > > > >> > One message maybe read three times if we don't depage until
> > all
> > > > > messages
> > > > > > >> > are delivered. For example, we have 3 pages p1, p2,p3 and
> > > message m1
> > > > > > >> which
> > > > > > >> > is at top part of the p2. In our case(max-size-bytes=51MB, a
> > > little
> > > > > > >> bigger
> > > > > > >> > than page size), first depage round reads bottom half of p1
> > and
> > > top
> > > > > > >> part of
> > > > > > >> > p2; second depage round reads bottom half of p2 and top part
> > of
> > > p3.
> > > > > > >> > Therforce p2 is read twice and m1 maybe read three times if
> > > > > requeryed.
> > > > > > >> >
> > > > > > >> > Be honest, i don't know how to fix the problem above with
> the
> > > > > > >> > decrentralized approch. The point is not how we rely on os
> > > cache,
> > > > > it's
> > > > > > >> that
> > > > > > >> > we do it the wrong way, shouldn't read whole page(50MB) just
> > for
> > > > > ~2000
> > > > > > >> > messages. Also there is no need to save 51MB
> > PagedReferenceImpl
> > > in
> > > > > > >> memory.
> > > > > > >> > When 100 queues occupy 5100MB memory, the message references
> > are
> > > > > very
> > > > > > >> > likely to be removed.
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Francesco Nigro  于2019年6月27日周四 下午5:05写道:
> > > > > > >> >
> > > > > > >> > > >
> > > > > > >> > > >  which means the offset info is 100 times large compared
> > to
> > > the
> > > > > > >> shared
> > > > > > >> > > > page index cache.
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > I would check with JOL plugin for exact numbers..
> > > > > > >> > > I see with it that we would have an increase of 4 bytes
> for
> > > each
> > > > > > >> > > PagedRefeferenceImpl, totally decrentralized vs
> > > > > > >> > > a centralized approach (the cache). In the economy of a
> > fully
> > > > > loaded
> > > > > > >> > > broker, if we care about scaling need to understand if the
> > > memory
> > > > > > >> > tradeoff
> > > > > > >> > > is important enough
> > > > > > >> > > to choose one of the 2 approaches.
> > > > > > >> > > My point is that paging could be made totally based on the
> > OS
> > > page
> > > > > > >> cache
> > > > > > >> > if
> > > > > > >> > > GC would get in the middle, deleting any previous
> mechanism
> > of
> > > > > page
> > > > > > >> > > caching...simplifying the process at it is.
> > > > > > >> > > Using a 2 level cache with such centralized approach can
> > > work, but
> > > > > > >> will
> > > > > > >> > add
> > > > > > >> > > a level of complexity that IMO could be saved...
> > > > > > >> > > What do you think could be the benefit of the
> decentralized
> > > > > solution
> > > > > > >> if
> > > > > > >> > > compared with the one proposed in the PR?
> > > > > > >> > >
> > > > > > >> > >
> > > > > > >> > > Il giorno gio 27 giu 2019 alle ore 10:41 yw yw  ha
> > > > > > >> > > scritto:
> > > > > > >> > >
> > > > > > >> > > > Sorry, I missed the PageReferece part.
> > > > > > >> > > >
> > > > > > >> > > > The lifecyle of PageReference is: depage(in
> > > > > > >> > > intermediateMessageReferences)
> > > > > > >> > > > -> deliver(in messageReferences) -> waiting for ack(in
> > > > > > >> deliveringRefs)
> > > > > > >> > ->
> > > > > > >> > > > removed. Every queue would create it's own PageReference
> > > which
> > > > > means
> > > > > > >> > the
> > > > > > >> > > > offset info is 100 times large compared to the shared
> page
> > > index
> > > > > > >> cache.
> > > > > > >> > > > If we keep 51MB pageReference size in memory, as i said
> in
> > > pr,
> > > > > "For
> > > > > > >> > > > multiple subscribers to the same address, just one
> > executor
> > > is
> > > > > > >> > > responsible
> > > > > > >> > > > for delivering which means at the same moment only one
> > > queue is
> > > > > > >> > > delivering.
> > > > > > >> > > > Thus the queue maybe stalled for a long time. We get
> > > > > queueMemorySize
> > > > > > >> > > > messages into memory, and when we deliver these after a
> > long
> > > > > time,
> > > > > > >> we
> > > > > > >> > > > probably need to query message and read page file
> again.".
> > > At
> > > > > last
> > > > > > >> for
> > > > > > >> > > one
> > > > > > >> > > > message we maybe read twice: first we read page and
> create
> > > > > > >> > pagereference;
> > > > > > >> > > > second we requery message after its reference is
> removed.
> > > > > > >> > > >
> > > > > > >> > > > For the shared page index cache design, each message
> just
> > > need
> > > > > to be
> > > > > > >> > read
> > > > > > >> > > > from file once.
> > > > > > >> > > >
> > > > > > >> > > > Michael Pearce  于2019年6月27日周四 下午3:03写道:
> > > > > > >> > > >
> > > > > > >> > > > > Hi
> > > > > > >> > > > >
> > > > > > >> > > > > First of all i think this is an excellent effort, and
> > > could
> > > > > be a
> > > > > > >> > > > potential
> > > > > > >> > > > > massive positive change.
> > > > > > >> > > > >
> > > > > > >> > > > > Before making any change on such scale, i do think we
> > > need to
> > > > > > >> ensure
> > > > > > >> > we
> > > > > > >> > > > > have sufficient benchmarks on a number of scenarios,
> not
> > > just
> > > > > one
> > > > > > >> use
> > > > > > >> > > > case,
> > > > > > >> > > > > and the benchmark tool used does need to be available
> > > openly
> > > > > so
> > > > > > >> that
> > > > > > >> > > > others
> > > > > > >> > > > > can verify the measures and check on their setups.
> > > > > > >> > > > >
> > > > > > >> > > > > Some additional scenarios i would want/need covering
> > are:
> > > > > > >> > > > >
> > > > > > >> > > > > PageCache set to 5, and all consumers keeping up, but
> > > lagging
> > > > > > >> enough
> > > > > > >> > to
> > > > > > >> > > > be
> > > > > > >> > > > > reading from the same 1st page cache, latency and
> > > throughput
> > > > > need
> > > > > > >> to
> > > > > > >> > be
> > > > > > >> > > > > measured for all.
> > > > > > >> > > > > PageCache set to 5 and all consumers but one keeping
> up
> > > but
> > > > > > >> lagging
> > > > > > >> > > > enough
> > > > > > >> > > > > to be reading from the same 1st page cahce, but the
> one
> > is
> > > > > falling
> > > > > > >> > off
> > > > > > >> > > > the
> > > > > > >> > > > > end, causing the page cache swapping, measure latecy
> and
> > > > > > >> througput of
> > > > > > >> > > > those
> > > > > > >> > > > > keeping up in the 1st page cache not caring for the
> one.
> > > > > > >> > > > >
> > > > > > >> > > > > Regards to solution some alternative approach to
> discuss
> > > > > > >> > > > >
> > > > > > >> > > > > In your scenario if i understand correctly each
> > > subscriber is
> > > > > > >> > > effectivly
> > > > > > >> > > > > having their own queue (1 to 1 mapping) not sharing.
> > > > > > >> > > > > You mention kafka and say multiple consumers doent
> read
> > > > > serailly
> > > > > > >> on
> > > > > > >> > the
> > > > > > >> > > > > address and this is true, but per queue processing
> > through
> > > > > > >> messages
> > > > > > >> > > > > (dispatch) is still serial even with multiple shared
> > > > > consumers on
> > > > > > >> a
> > > > > > >> > > > queue.
> > > > > > >> > > > >
> > > > > > >> > > > > What about keeping the existing mechanism but having a
> > > queue
> > > > > hold
> > > > > > >> > > > reference
> > > > > > >> > > > > to a page cache that the queue is currently on, being
> > kept
> > > > > from gc
> > > > > > >> > > (e.g.
> > > > > > >> > > > > not soft) therefore meaning page cache isnt being
> > swapped
> > > > > around,
> > > > > > >> > when
> > > > > > >> > > > you
> > > > > > >> > > > > have queues (in your case subscribers) swapping
> > pagecaches
> > > > > back
> > > > > > >> and
> > > > > > >> > > forth
> > > > > > >> > > > > avoidning the constant re-read issue.
> > > > > > >> > > > >
> > > > > > >> > > > > Also i think Franz had an excellent idea, do away with
> > > > > pagecache
> > > > > > >> in
> > > > > > >> > its
> > > > > > >> > > > > current form entirely, ensure the offset is kept with
> > the
> > > > > > >> reference
> > > > > > >> > and
> > > > > > >> > > > > rely on OS caching keeping hot blocks/data.
> > > > > > >> > > > >
> > > > > > >> > > > > Best
> > > > > > >> > > > > Michael
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > On Thu, 27 Jun 2019 at 05:13, yw yw  wrote:
> > > > > > >> > > > >
> > > > > > >> > > > > > Hi, folks
> > > > > > >> > > > > >
> > > > > > >> > > > > > This is the discussion about "ARTEMIS-2399 Fix
> > > performance
> > > > > > >> > > degradation
> > > > > > >> > > > > > when there are a lot of subscribers".
> > > > > > >> > > > > >
> > > > > > >> > > > > > First apologize i didn't clarify our thoughts.
> > > > > > >> > > > > >
> > > > > > >> > > > > > As noted in the part of Environment,
> > > page-max-cache-size is
> > > > > set
> > > > > > >> to
> > > > > > >> > 1
> > > > > > >> > > > > > meaning at most one page is allowed in
> softValueCache.
> > > We
> > > > > have
> > > > > > >> > tested
> > > > > > >> > > > > with
> > > > > > >> > > > > > the default page-max-cache-size which is 5, it would
> > > take
> > > > > some
> > > > > > >> time
> > > > > > >> > > to
> > > > > > >> > > > > > see the performance degradation since at start the
> > > cursor
> > > > > > >> positions
> > > > > > >> > > of
> > > > > > >> > > > > 100
> > > > > > >> > > > > > subscribers are similar when all the messages read
> > hits
> > > the
> > > > > > >> > > > > softValueCache.
> > > > > > >> > > > > > But after some time, the cursor positions are
> > different.
> > > > > When
> > > > > > >> these
> > > > > > >> > > > > > positions are located more than 5 pages, it means
> some
> > > page
> > > > > > >> would
> > > > > > >> > be
> > > > > > >> > > > read
> > > > > > >> > > > > > back and forth. This can be proved by the trace log
> > > "adding
> > > > > > >> > pageCache
> > > > > > >> > > > > > pageNr=xxx into cursor = test-topic" in
> > > > > PageCursorProviderImpl
> > > > > > >> > where
> > > > > > >> > > > some
> > > > > > >> > > > > > pages are read a lot of times for the same
> subscriber.
> > > From
> > > > > the
> > > > > > >> > time
> > > > > > >> > > > on,
> > > > > > >> > > > > > the performance starts to degrade. So we set
> > > > > page-max-cache-size
> > > > > > >> > to 1
> > > > > > >> > > > > > here just to make the test process more fast and it
> > > doesn't
> > > > > > >> change
> > > > > > >> > > the
> > > > > > >> > > > > > final result.
> > > > > > >> > > > > >
> > > > > > >> > > > > > The softValueCache would be removed if memory is
> > really
> > > > > low, in
> > > > > > >> > > > addition
> > > > > > >> > > > > > the map size reaches capacity(default 5). In most
> > > cases, the
> > > > > > >> > > > subscribers
> > > > > > >> > > > > > are tailing read which are served by
> softValueCache(no
> > > need
> > > > > to
> > > > > > >> > bother
> > > > > > >> > > > > > disk), thus we need to keep it. But When some
> > > subscribers
> > > > > fall
> > > > > > >> > > behind,
> > > > > > >> > > > > they
> > > > > > >> > > > > > need to read page not in softValueCache. After
> looking
> > > up
> > > > > code,
> > > > > > >> we
> > > > > > >> > > > found
> > > > > > >> > > > > one
> > > > > > >> > > > > > depage round is following at most
> > MAX_SCHEDULED_RUNNERS
> > > > > deliver
> > > > > > >> > round
> > > > > > >> > > > in
> > > > > > >> > > > > > most situations, and that's to say at most
> > > > > > >> MAX_DELIVERIES_IN_LOOP *
> > > > > > >> > > > > > MAX_SCHEDULED_RUNNERS number of messages would be
> > > depaged
> > > > > next.
> > > > > > >> If
> > > > > > >> > > you
> > > > > > >> > > > > > adjust QueueImpl logger to debug level, you would
> see
> > > logs
> > > > > like
> > > > > > >> > > "Queue
> > > > > > >> > > > > > Memory Size after depage on queue=sub4 is 53478769
> > with
> > > > > maxSize
> > > > > > >> =
> > > > > > >> > > > > 52428800.
> > > > > > >> > > > > > Depaged 68 messages, pendingDelivery=1002,
> > > > > > >> > > > intermediateMessageReferences=
> > > > > > >> > > > > > 23162, queueDelivering=0". In order to depage less
> > than
> > > 2000
> > > > > > >> > > messages,
> > > > > > >> > > > > > each subscriber has to read a whole page which is
> > > > > unnecessary
> > > > > > >> and
> > > > > > >> > > > > wasteful.
> > > > > > >> > > > > > In our test where one page(50MB) contains ~40000
> > > messages,
> > > > > one
> > > > > > >> > > > subscriber
> > > > > > >> > > > > > maybe read 40000/2000=20 times of page if
> > > softValueCache is
> > > > > > >> evicted
> > > > > > >> > > to
> > > > > > >> > > > > > finish delivering it. This has drastically slowed
> down
> > > the
> > > > > > >> process
> > > > > > >> > > and
> > > > > > >> > > > > > burdened on the disk. So we add the
> PageIndexCacheImpl
> > > and
> > > > > read
> > > > > > >> one
> > > > > > >> > > > > message
> > > > > > >> > > > > > each time rather than read all messages of page. In
> > this
> > > > > way,
> > > > > > >> for
> > > > > > >> > > each
> > > > > > >> > > > > > subscriber each page is read only once after
> finishing
> > > > > > >> delivering.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Having said that, the softValueCache is used for
> > tailing
> > > > > read.
> > > > > > >> If
> > > > > > >> > > it's
> > > > > > >> > > > > > evicted, it won't be reloaded to prevent from the
> > issue
> > > > > > >> illustrated
> > > > > > >> > > > > above.
> > > > > > >> > > > > > Instead the pageIndexCache would be used.
> > > > > > >> > > > > >
> > > > > > >> > > > > > Regarding implementation details, we noted that
> before
> > > > > > >> delivering
> > > > > > >> > > > page, a
> > > > > > >> > > > > > pageCursorInfo is constructed which needs to read
> the
> > > whole
> > > > > > >> page.
> > > > > > >> > We
> > > > > > >> > > > can
> > > > > > >> > > > > > take this opportunity to construct the
> pageIndexCache.
> > > It's
> > > > > very
> > > > > > >> > > simple
> > > > > > >> > > > > to
> > > > > > >> > > > > > code. We also think of building a offset index file
> > and
> > > some
> > > > > > >> > concerns
> > > > > > >> > > > > > stemed from following:
> > > > > > >> > > > > >
> > > > > > >> > > > > >    1. When to write and sync index file? Would it
> have
> > > some
> > > > > > >> > > performance
> > > > > > >> > > > > >    implications?
> > > > > > >> > > > > >    2. If we have a index file, we can construct
> > > > > pageCursorInfo
> > > > > > >> > > through
> > > > > > >> > > > > >    it(no need to read the page like before), but we
> > > need to
> > > > > > >> write
> > > > > > >> > the
> > > > > > >> > > > > total
> > > > > > >> > > > > >    message number into it first. Seems a little
> weird
> > > > > putting
> > > > > > >> this
> > > > > > >> > > into
> > > > > > >> > > > > the
> > > > > > >> > > > > >    index file.
> > > > > > >> > > > > >    3. If experiencing hard crash, a recover
> mechanism
> > > would
> > > > > be
> > > > > > >> > needed
> > > > > > >> > > > to
> > > > > > >> > > > > >    recover page and page index files, E.g.
> truncating
> > > to the
> > > > > > >> valid
> > > > > > >> > > > size.
> > > > > > >> > > > > So
> > > > > > >> > > > > >    how do we know which files need to be sanity
> > checked?
> > > > > > >> > > > > >    4. A variant binary search algorithm maybe
> needed,
> > > see
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > >
> > >
> >
> https://github.com/apache/kafka/blob/70ddd8af71938b4f5f6d1bb3df6243ef13359bcf/core/src/main/scala/kafka/log/AbstractIndex.scala
> > > > > > >> > > > > >     .
> > > > > > >> > > > > >    5. Unlike kafka from which user fetches lots of
> > > messages
> > > > > at
> > > > > > >> once
> > > > > > >> > > and
> > > > > > >> > > > > >    broker just needs to look up start offset from
> the
> > > index
> > > > > file
> > > > > > >> > > once,
> > > > > > >> > > > > artemis
> > > > > > >> > > > > >    delivers message one by one and that means we
> have
> > to
> > > > > look up
> > > > > > >> > the
> > > > > > >> > > > > index
> > > > > > >> > > > > >    every time we deliver a message. Although the
> index
> > > file
> > > > > is
> > > > > > >> > > possibly
> > > > > > >> > > > > in
> > > > > > >> > > > > >    page cache, there are still chances we miss
> cache.
> > > > > > >> > > > > >    6. Compatibility with old files.
> > > > > > >> > > > > >
> > > > > > >> > > > > > To sum that, kafka uses a mmaped index file and we
> > use a
> > > > > index
> > > > > > >> > cache.
> > > > > > >> > > > > Both
> > > > > > >> > > > > > are designed to find physical file position
> according
> > > > > > >> offset(kafka)
> > > > > > >> > > or
> > > > > > >> > > > > > message number(artemis). And we prefer the index
> cache
> > > bcs
> > > > > it's
> > > > > > >> > easy
> > > > > > >> > > to
> > > > > > >> > > > > > understand and maintain.
> > > > > > >> > > > > >
> > > > > > >> > > > > > We also tested the one subscriber case with the same
> > > setup.
> > > > > > >> > > > > > The original:
> > > > > > >> > > > > > consumer tps(11000msg/s) and latency:
> > > > > > >> > > > > > [image: orig_single_subscriber.png]
> > > > > > >> > > > > > producer tps(30000msg/s) and latency:
> > > > > > >> > > > > > [image: orig_single_producer.png]
> > > > > > >> > > > > > The pr:
> > > > > > >> > > > > > consumer tps(14000msg/s) and latency:
> > > > > > >> > > > > > [image: pr_single_consumer.png]
> > > > > > >> > > > > > producer tps(30000msg/s) and latency:
> > > > > > >> > > > > > [image: pr_single_producer.png]
> > > > > > >> > > > > > It showed result is similar and event a little
> better
> > > in the
> > > > > > >> case
> > > > > > >> > of
> > > > > > >> > > > > > single subscriber.
> > > > > > >> > > > > >
> > > > > > >> > > > > > We used our inner test platform and i think jmeter
> can
> > > also
> > > > > be
> > > > > > >> used
> > > > > > >> > > to
> > > > > > >> > > > > > test again it.
> > > > > > >> > > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >> >
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Clebert Suconic
> > > > >
> > >
> > >
> > >
> > > --
> > > Clebert Suconic
> > >
> > >
> > >
> > >
> > >
> > >
> >
> >
> >
> >
> >
> >
>
--
Clebert Suconic