Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #483

Zqy11 · 2025-11-06T06:28:44Z

The DeepEP implementation for SBO (DeepEP #390) will be merged into the antgroup-opt branch.

Co-authored-by: Sulfur6 <[email protected]> Co-authored-by: AniZpZ <[email protected]>

csrc/kernels/internode_ll.cu

Co-authored-by: sky <[email protected]>

qhsc · 2025-11-18T12:05:10Z

csrc/kernels/internode_ll.cu

-            for (int token_idx = offset + sub_warp_id; token_idx < offset + num_tokens_to_send; token_idx += num_warps_per_group) {
+        if (overlap or (not is_rank_masked<true>(mask_buffer_ptr, dst_rank))) {
+            auto token_start_idx = overlap ? local_expert_signal_idx * block_m : offset;
+            auto token_end_idx = overlap ? min((local_expert_signal_idx + 1) * block_m, num_tokens_per_expert) : (offset + num_tokens_to_send);


Hi~ I have tried this great feature, and foud that combine_send was slower than the non-overlap situation, maybe block_m is too big for one SM in each itertions? I

You may consider increasing num_sms from the default value of 3 to 4-6. When block_m is set to 64, given that num_warps is 32: if num_token <= 32, one SM sends a single round; if 32 < num_token <= 64, one SM needs to send two rounds. Two rounds may take slightly longer than the origin combined send, but you can increase parallelism by raising num_sms.

wangfakang

LGTM. Thanks.

feat: Overlap down gemm and combine send

bd8194e

Co-authored-by: Sulfur6 <[email protected]> Co-authored-by: AniZpZ <[email protected]>

wangfakang reviewed Nov 6, 2025

View reviewed changes

Zqy11 and others added 6 commits November 6, 2025 21:20

Add validation to new parameters

cb0f0d6

Co-authored-by: sky <[email protected]>

Optimize judgment logic

a057e2d

Co-authored-by: sky <[email protected]>

add comments to explain the changes in NUM_WORKSPACE_BYTES

0295618

shared_vaild_signal_prefix_sum added to the end of smem_buffer

e5ee299

shared_vaild_signal_prefix_sum added to the end of smem_buffer

3c3ebb7

fix format

de8bf29

Zqy11 mentioned this pull request Nov 13, 2025

Single Batch Overlap for MoE Models sgl-project/sglang#9660

Merged

bugfix

2765317

qhsc reviewed Nov 18, 2025

View reviewed changes

Merge branch 'deepseek-ai:main' into antgroup-overlap

8130fa2

wangfakang approved these changes Nov 21, 2025

View reviewed changes

wangfakang merged commit 9f2fc4b into deepseek-ai:antgroup-opt Nov 21, 2025

Zqy11 mentioned this pull request Nov 21, 2025

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #390

Closed

Sulfur6 mentioned this pull request Dec 5, 2025

[Feat] Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send sgl-project/DeepGEMM#14

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #483

Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #483

Zqy11 commented Nov 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qhsc Nov 18, 2025

Uh oh!

Zqy11 Nov 19, 2025

Uh oh!

wangfakang left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #483

Single Batch Overlap (SBO): Overlaping of Down GEMM with Combine Send #483

Conversation

Zqy11 commented Nov 6, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

qhsc Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

Zqy11 Nov 19, 2025

Choose a reason for hiding this comment

Uh oh!

wangfakang left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants