KAFKA-19767 - Send Share-Fetch one-node at a time for record_limit mode #20855

ShivsundarR · 2025-11-10T14:57:54Z

What

After KIP-1206, when record_limit mode was introduced, we ideally
want no more than the #records in the maxRecords field in
ShareFetchRequest.
Currently, the client broadcasts the share fetch requests to all nodes
which host the leaders of the partitions that it is subscribed to.
The application thread would be woken up after the first response
arrives, but meanwhile the responses from other nodes could bring in
those many #records next and would wait in the buffer, that would mean
we are wasting the acquisition locks for these records which are
waiting.
Instead we would want to only send the next request when we poll
again.
PR aims to send the request to only 1 node at a time in record_limit
mode.
We are using partition-rotation on each poll so that no partition is
starved.

There were NCSS checkstyle errors in ShareConsumeRequestManagerTest,
so added a few refactors there to reduce the length.

Performance

When we have more consumers than the #partitions(i.e when real sharing
of data happens in a partition), then we are seeing the performance is
almost the same as the current approach. But when we have lesser
consumers than the #partitions, then we see a performance regression as
client is waiting for a node to return a response before it can send the
next request.
Hence we have introduced this only for record_limit mode for now,
future work will be done to improve this area.

Reviewers: Andrew Schofield [email protected]

AndrewJSchofield

Thanks for the PR. Generally looks pretty good, but I have some formatting feedback for the test, and one comment about perhaps a test case which might sometimes fail unexpectedly.

Please could you merge latest changes from trunk into your branch. Thanks.

...rc/test/java/org/apache/kafka/clients/consumer/internals/ShareConsumeRequestManagerTest.java

AndrewJSchofield · 2025-11-17T17:26:45Z

...rc/test/java/org/apache/kafka/clients/consumer/internals/ShareConsumeRequestManagerTest.java

+        Set<TopicPartition> partitions = new HashSet<>();
+        partitions.add(tp0);
+        partitions.add(tp1);
+        subscriptions.assignFromSubscribed(partitions);


I don't think this is necessarily deterministic order. I worry that this might be a new flaky test.

Right, I looked into this and there are 2 ways to solve this I thought.

Change the test to be order-agnostic, this makes the test a bit bigger, but we can make it work.

Use LinkedHashMap in SubscriptionState to maintain insertion order, so that we can be deterministic in the test and in any future tests that other classes may use.

I have updated the PR with option 2 as this was just a temporary map used in assignFromSubscibed before updating the actual assignment. Does this work?

ShivsundarR · 2025-11-18T09:40:36Z

Thanks @AndrewJSchofield for the review,
I had made the refactors mainly for the checkstyle to pass, it was showing code limit exceeded(NCSS), so tried combining the asserts, added some helper functions for it to pass.
Now I reverted the assert changes, and checkstyle still passed, so I have reverted them now.

…de (apache#20855) *What* - After KIP-1206, when `record_limit` mode was introduced, we ideally want no more than the #records in the `maxRecords` field in `ShareFetchRequest`. - Currently, the client broadcasts the share fetch requests to all nodes which host the leaders of the partitions that it is subscribed to. - The application thread would be woken up after the first response arrives, but meanwhile the responses from other nodes could bring in those many #records next and would wait in the buffer, that would mean we are wasting the acquisition locks for these records which are waiting. - Instead we would want to only send the next request when we poll again. - PR aims to send the request to only 1 node at a time in record_limit mode. - We are using partition-rotation on each poll so that no partition is starved. There were NCSS checkstyle errors in `ShareConsumeRequestManagerTest`, so added a few refactors there to reduce the length. Performance - When we have more consumers than the #partitions(i.e when real sharing of data happens in a partition), then we are seeing the performance is almost the same as the current approach. But when we have lesser consumers than the #partitions, then we see a performance regression as client is waiting for a node to return a response before it can send the next request. - Hence we have introduced this only for `record_limit` mode for now, future work will be done to improve this area. Reviewers: Andrew Schofield <[email protected]>

ShivsundarR added 2 commits November 10, 2025 19:48

Implement one-node logic for record_limit mode

e40e488

Remove local file

de59e7a

github-actions bot added triage PRs from the community consumer clients labels Nov 10, 2025

AndrewJSchofield added KIP-932 Queues for Kafka ci-approved and removed triage PRs from the community labels Nov 10, 2025

AndrewJSchofield self-requested a review November 10, 2025 16:04

NCSS check fix

5dc4720

AndrewJSchofield mentioned this pull request Nov 10, 2025

[WIP] KAFKA-19767: Share fetch from one node at a time #20667

Closed

AndrewJSchofield requested changes Nov 17, 2025

View reviewed changes

ShivsundarR added 2 commits November 18, 2025 14:07

Merge remote-tracking branch 'upstream/trunk' into KAFKA-19767-shiv

cbd0704

Address review comments

196efb1

AndrewJSchofield approved these changes Nov 18, 2025

View reviewed changes

AndrewJSchofield merged commit 108ad6e into apache:trunk Nov 18, 2025
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

KAFKA-19767 - Send Share-Fetch one-node at a time for record_limit mode #20855

KAFKA-19767 - Send Share-Fetch one-node at a time for record_limit mode #20855

Uh oh!

ShivsundarR commented Nov 10, 2025 •

edited by AndrewJSchofield

Loading

Uh oh!

AndrewJSchofield left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndrewJSchofield Nov 17, 2025

Uh oh!

ShivsundarR Nov 18, 2025 •

edited

Loading

Uh oh!

ShivsundarR commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KAFKA-19767 - Send Share-Fetch one-node at a time for record_limit mode #20855

KAFKA-19767 - Send Share-Fetch one-node at a time for record_limit mode #20855

Uh oh!

Conversation

ShivsundarR commented Nov 10, 2025 • edited by AndrewJSchofield Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AndrewJSchofield left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AndrewJSchofield Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

ShivsundarR Nov 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShivsundarR commented Nov 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ShivsundarR commented Nov 10, 2025 •

edited by AndrewJSchofield

Loading

ShivsundarR Nov 18, 2025 •

edited

Loading