-
Notifications
You must be signed in to change notification settings - Fork 148
fix: Improve mark_as_acked
#3056
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Or at least try.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3056 +/- ##
==========================================
- Coverage 93.41% 93.40% -0.01%
==========================================
Files 124 124
Lines 36234 36233 -1
Branches 36234 36233 -1
==========================================
- Hits 33847 33844 -3
Misses 1540 1540
- Partials 847 849 +2
|
|
| Branch | fix-mark_as_acked-opt |
| Testbed | On-prem |
Click to view all benchmark results
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| google vs. neqo (cubic, paced) | 📈 view plot 🚷 view threshold | 279.02 ms(+0.26%)Baseline: 278.31 ms | 282.67 ms (98.71%) |
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| msquic vs. neqo (cubic, paced) | 📈 view plot 🚷 view threshold | 216.98 ms(+9.06%)Baseline: 198.95 ms | 236.31 ms (91.82%) |
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| neqo vs. google (cubic, paced) | 📈 view plot 🚷 view threshold | 758.15 ms(-0.19%)Baseline: 759.61 ms | 774.57 ms (97.88%) |
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| neqo vs. msquic (cubic, paced) | 📈 view plot 🚷 view threshold | 157.38 ms(-0.25%)Baseline: 157.78 ms | 160.62 ms (97.99%) |
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| neqo vs. neqo (cubic) | 📈 view plot 🚷 view threshold | 93.45 ms(+2.11%)Baseline: 91.53 ms | 96.85 ms (96.50%) |
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| neqo vs. neqo (cubic, paced) | 📈 view plot 🚷 view threshold | 94.05 ms(+1.26%)Baseline: 92.88 ms | 98.08 ms (95.89%) |
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| neqo vs. neqo (reno) | 📈 view plot 🚷 view threshold | 91.39 ms(-0.12%)Baseline: 91.50 ms | 96.65 ms (94.56%) |
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| neqo vs. neqo (reno, paced) | 📈 view plot 🚷 view threshold | 92.77 ms(+0.01%)Baseline: 92.76 ms | 97.76 ms (94.90%) |
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| neqo vs. quiche (cubic, paced) | 📈 view plot 🚷 view threshold | 195.11 ms(+0.76%)Baseline: 193.63 ms | 196.96 ms (99.06%) |
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| neqo vs. s2n (cubic, paced) | 📈 view plot 🚷 view threshold | 220.50 ms(-0.28%)Baseline: 221.12 ms | 224.08 ms (98.40%) |
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| quiche vs. neqo (cubic, paced) | 📈 view plot 🚷 view threshold | 154.38 ms(+0.85%)Baseline: 153.07 ms | 158.35 ms (97.50%) |
| Benchmark | Latency | Benchmark Result milliseconds (ms) (Result Δ%) | Upper Boundary milliseconds (ms) (Limit %) |
|---|---|---|---|
| s2n vs. neqo (cubic, paced) | 📈 view plot 🚷 view threshold | 170.73 ms(-1.76%)Baseline: 173.79 ms | 178.01 ms (95.91%) |
|
| Branch | fix-mark_as_acked-opt |
| Testbed | On-prem |
Click to view all benchmark results
| Benchmark | Latency | Benchmark Result nanoseconds (ns) (Result Δ%) | Upper Boundary nanoseconds (ns) (Limit %) |
|---|---|---|---|
| 1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client | 📈 view plot 🚷 view threshold | 206,680,000.00 ns(-0.03%)Baseline: 206,752,325.58 ns | 216,458,370.24 ns (95.48%) |
| 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client | 📈 view plot 🚷 view threshold | 199,000,000.00 ns(-0.94%)Baseline: 200,895,174.42 ns | 211,171,780.28 ns (94.24%) |
| 1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client | 📈 view plot 🚷 view threshold | 38,694,000.00 ns(+22.29%)Baseline: 31,641,276.16 ns | 42,885,724.77 ns (90.23%) |
| 1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client | 📈 view plot 🚷 view threshold | 289,460,000.00 ns(-0.66%)Baseline: 291,387,209.30 ns | 303,642,017.47 ns (95.33%) |
| 1-streams/each-1000-bytes/simulated-time | 📈 view plot 🚷 view threshold | 119,230,000.00 ns(+0.47%)Baseline: 118,673,575.58 ns | 120,713,746.14 ns (98.77%) |
| 1-streams/each-1000-bytes/wallclock-time | 📈 view plot 🚷 view threshold | 580,530.00 ns(-2.04%)Baseline: 592,644.62 ns | 615,533.84 ns (94.31%) |
| 1000-streams/each-1-bytes/simulated-time | 📈 view plot 🚷 view threshold | 2,330,900,000.00 ns(-82.76%)Baseline: 13,523,404,069.77 ns | 23,011,158,648.63 ns (10.13%) |
| 1000-streams/each-1-bytes/wallclock-time | 📈 view plot 🚷 view threshold | 12,555,000.00 ns(-9.32%)Baseline: 13,844,950.58 ns | 15,082,092.84 ns (83.24%) |
| 1000-streams/each-1000-bytes/simulated-time | 📈 view plot 🚷 view threshold | 16,354,000,000.00 ns(-12.39%)Baseline: 18,665,799,418.60 ns | 20,609,155,466.95 ns (79.35%) |
| 1000-streams/each-1000-bytes/wallclock-time | 📈 view plot 🚷 view threshold | 49,630,000.00 ns(-2.05%)Baseline: 50,669,459.30 ns | 57,081,615.58 ns (86.95%) |
| RxStreamOrderer::inbound_frame() | 📈 view plot 🚷 view threshold | 108,150,000.00 ns(-1.41%)Baseline: 109,691,191.86 ns | 111,596,019.96 ns (96.91%) |
| coalesce_acked_from_zero 1+1 entries | 📈 view plot 🚷 view threshold | 89.52 ns(+0.73%)Baseline: 88.87 ns | 90.13 ns (99.32%) |
| coalesce_acked_from_zero 10+1 entries | 📈 view plot 🚷 view threshold | 106.17 ns(+0.09%)Baseline: 106.08 ns | 107.22 ns (99.02%) |
| coalesce_acked_from_zero 1000+1 entries | 📈 view plot 🚷 view threshold | 91.50 ns(+1.31%)Baseline: 90.32 ns | 94.94 ns (96.38%) |
| coalesce_acked_from_zero 3+1 entries | 📈 view plot 🚷 view threshold | 106.21 ns(-0.36%)Baseline: 106.59 ns | 107.68 ns (98.64%) |
| decode 1048576 bytes, mask 3f | 📈 view plot 🚷 view threshold | 1,761,000.00 ns(+7.74%)Baseline: 1,634,541.86 ns | 1,808,886.75 ns (97.35%) |
| decode 1048576 bytes, mask 7f | 📈 view plot 🚷 view threshold | 5,045,400.00 ns(-0.42%)Baseline: 5,066,546.80 ns | 5,113,075.77 ns (98.68%) |
| decode 1048576 bytes, mask ff | 📈 view plot 🚷 view threshold | 3,006,100.00 ns(-0.77%)Baseline: 3,029,460.17 ns | 3,053,652.70 ns (98.44%) |
| decode 4096 bytes, mask 3f | 📈 view plot 🚷 view threshold | 6,233.30 ns(-15.31%)Baseline: 7,360.25 ns | 10,378.43 ns (60.06%) |
| decode 4096 bytes, mask 7f | 📈 view plot 🚷 view threshold | 19,593.00 ns(-1.06%)Baseline: 19,802.90 ns | 20,466.00 ns (95.73%) |
| decode 4096 bytes, mask ff | 📈 view plot 🚷 view threshold | 11,392.00 ns(+0.23%)Baseline: 11,365.60 ns | 12,523.87 ns (90.96%) |
| sent::Packets::take_ranges | 📈 view plot 🚷 view threshold | 4,634.10 ns(-1.84%)Baseline: 4,720.99 ns | 4,959.49 ns (93.44%) |
| transfer/pacing-false/same-seed/simulated-time/run | 📈 view plot 🚷 view threshold | 25,234,000,000.00 ns(-0.70%)Baseline: 25,410,716,374.27 ns | 26,030,182,060.52 ns (96.94%) |
| transfer/pacing-false/same-seed/wallclock-time/run | 📈 view plot 🚷 view threshold | 24,373,000.00 ns(-5.12%)Baseline: 25,687,356.73 ns | 27,106,753.12 ns (89.91%) |
| transfer/pacing-false/varying-seeds/simulated-time/run | 📈 view plot 🚷 view threshold | 25,200,000,000.00 ns(+0.10%)Baseline: 25,175,423,976.61 ns | 25,224,686,559.53 ns (99.90%) |
| transfer/pacing-false/varying-seeds/wallclock-time/run | 📈 view plot 🚷 view threshold | 24,704,000.00 ns(-4.14%)Baseline: 25,770,286.55 ns | 27,415,048.33 ns (90.11%) |
| transfer/pacing-true/same-seed/simulated-time/run | 📈 view plot 🚷 view threshold | 25,301,000,000.00 ns(-1.08%)Baseline: 25,578,292,397.66 ns | 25,883,203,543.21 ns (97.75%) |
| transfer/pacing-true/same-seed/wallclock-time/run | 📈 view plot 🚷 view threshold | 26,101,000.00 ns(-3.44%)Baseline: 27,031,976.61 ns | 28,606,041.22 ns (91.24%) |
| transfer/pacing-true/varying-seeds/simulated-time/run | 📈 view plot 🚷 view threshold | 24,997,000,000.00 ns(+0.01%)Baseline: 24,995,336,257.31 ns | 25,043,987,836.67 ns (99.81%) |
| transfer/pacing-true/varying-seeds/wallclock-time/run | 📈 view plot 🚷 view threshold | 24,932,000.00 ns(-5.10%)Baseline: 26,270,652.05 ns | 27,999,795.18 ns (89.04%) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR improves the mark_as_acked method in the TxBuffer implementation by replacing an inefficient rotate_left + truncate operation with a more efficient drain operation for removing acknowledged bytes from the front of a VecDeque buffer.
- Replaces rotate_left() + truncate() with drain() for better performance
- Adds detailed comments explaining the performance benefits of the new approach
- Maintains the same functionality while improving cache efficiency
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <[email protected]> Signed-off-by: Lars Eggert <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
mxinden
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried this out in #2680. I was not able to get a clear performance improvement signal.
I don't think rotate_left "always performs a full rotation". Depending on previous operations on the VecDeque, it might simply update a pointer. See this #2680 (comment) for a detailed description.
CodSpeed Performance ReportMerging #3056 will improve performances by 22.36%Comparing Summary
Benchmarks breakdown
|
Failed Interop TestsQUIC Interop Runner, client vs. server, differences relative to b9c32c7. neqo-latest as client
neqo-latest as server
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
|
Benchmark resultsPerformance differences relative to b9c32c7. 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: Change within noise threshold. time: [198.49 ms 199.00 ms 199.52 ms]
thrpt: [501.21 MiB/s 502.52 MiB/s 503.81 MiB/s]
change:
time: [-1.5209% -1.1745% -0.8448%] (p = 0.00 < 0.05)
thrpt: [+0.8520% +1.1885% +1.5444%]
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: Change within noise threshold. time: [287.76 ms 289.46 ms 291.18 ms]
thrpt: [34.343 Kelem/s 34.547 Kelem/s 34.751 Kelem/s]
change:
time: [+0.6651% +1.4903% +2.3611%] (p = 0.00 < 0.05)
thrpt: [-2.3066% -1.4684% -0.6607%]
1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected. time: [38.564 ms 38.694 ms 38.842 ms]
thrpt: [25.746 B/s 25.844 B/s 25.931 B/s]
change:
time: [-1.0880% -0.4442% +0.1598%] (p = 0.17 > 0.05)
thrpt: [-0.1596% +0.4462% +1.1000%]
1-conn/1-100mb-req/mtu-1504 (aka. Upload)/client: No change in performance detected. time: [206.39 ms 206.68 ms 206.99 ms]
thrpt: [483.12 MiB/s 483.85 MiB/s 484.52 MiB/s]
change:
time: [-0.2502% -0.0529% +0.1373%] (p = 0.60 > 0.05)
thrpt: [-0.1371% +0.0530% +0.2508%]
decode 4096 bytes, mask ff: No change in performance detected. time: [11.345 µs 11.392 µs 11.443 µs]
change: [-0.0593% +0.5699% +1.2459%] (p = 0.10 > 0.05)
decode 1048576 bytes, mask ff: No change in performance detected. time: [2.9933 ms 3.0061 ms 3.0223 ms]
change: [-1.4035% -0.3551% +0.5513%] (p = 0.51 > 0.05)
decode 4096 bytes, mask 7f: No change in performance detected. time: [19.551 µs 19.593 µs 19.642 µs]
change: [-6.6155% -2.3772% +0.0824%] (p = 0.26 > 0.05)
decode 1048576 bytes, mask 7f: No change in performance detected. time: [5.0302 ms 5.0454 ms 5.0638 ms]
change: [-0.7014% -0.1538% +0.3940%] (p = 0.59 > 0.05)
decode 4096 bytes, mask 3f: No change in performance detected. time: [6.2069 µs 6.2333 µs 6.2671 µs]
change: [-1.3761% +0.7968% +4.4202%] (p = 0.70 > 0.05)
decode 1048576 bytes, mask 3f: No change in performance detected. time: [1.7579 ms 1.7610 ms 1.7673 ms]
change: [+0.0005% +0.1835% +0.5370%] (p = 0.24 > 0.05)
1-streams/each-1000-bytes/wallclock-time: Change within noise threshold. time: [578.64 µs 580.53 µs 582.72 µs]
change: [-2.0295% -1.4732% -0.9183%] (p = 0.00 < 0.05)
1000-streams/each-1-bytes/wallclock-time: 💚 Performance has improved. time: [12.516 ms 12.555 ms 12.595 ms]
change: [-1.8889% -1.4423% -1.0138%] (p = 0.00 < 0.05)
1000-streams/each-1-bytes/simulated-time: No change in performance detected. time: [2.3274 s 2.3309 s 2.3344 s]
thrpt: [428.37 B/s 429.02 B/s 429.67 B/s]
change:
time: [-0.3595% -0.1247% +0.1066%] (p = 0.28 > 0.05)
thrpt: [-0.1065% +0.1248% +0.3608%]
1000-streams/each-1000-bytes/wallclock-time: 💚 Performance has improved. time: [49.530 ms 49.630 ms 49.729 ms]
change: [-1.8133% -1.5147% -1.2216%] (p = 0.00 < 0.05)
coalesce_acked_from_zero 1+1 entries: No change in performance detected. time: [89.218 ns 89.518 ns 89.818 ns]
change: [-0.5183% +0.0655% +0.6084%] (p = 0.83 > 0.05)
coalesce_acked_from_zero 3+1 entries: No change in performance detected. time: [105.88 ns 106.21 ns 106.55 ns]
change: [-0.4634% -0.0494% +0.3944%] (p = 0.82 > 0.05)
coalesce_acked_from_zero 10+1 entries: No change in performance detected. time: [105.68 ns 106.17 ns 106.76 ns]
change: [-0.8417% +0.3794% +1.7740%] (p = 0.61 > 0.05)
coalesce_acked_from_zero 1000+1 entries: No change in performance detected. time: [91.390 ns 91.502 ns 91.630 ns]
change: [-0.5768% +0.0949% +0.7523%] (p = 0.78 > 0.05)
RxStreamOrderer::inbound_frame(): 💚 Performance has improved. time: [108.09 ms 108.15 ms 108.22 ms]
change: [-2.2157% -1.9440% -1.7724%] (p = 0.00 < 0.05)
sent::Packets::take_ranges: No change in performance detected. time: [4.5195 µs 4.6341 µs 4.7421 µs]
change: [-2.0541% +1.2917% +4.6908%] (p = 0.47 > 0.05)
transfer/pacing-false/varying-seeds/wallclock-time/run: Change within noise threshold. time: [24.663 ms 24.704 ms 24.749 ms]
change: [-1.0255% -0.7961% -0.5502%] (p = 0.00 < 0.05)
transfer/pacing-false/varying-seeds/simulated-time/run: No change in performance detected. time: [25.164 s 25.200 s 25.237 s]
thrpt: [162.30 KiB/s 162.54 KiB/s 162.77 KiB/s]
change:
time: [-0.2619% -0.0528% +0.1555%] (p = 0.62 > 0.05)
thrpt: [-0.1552% +0.0528% +0.2625%]
transfer/pacing-true/varying-seeds/wallclock-time/run: Change within noise threshold. time: [24.863 ms 24.932 ms 25.010 ms]
change: [-3.0101% -2.6726% -2.2948%] (p = 0.00 < 0.05)
transfer/pacing-true/varying-seeds/simulated-time/run: No change in performance detected. time: [24.956 s 24.997 s 25.039 s]
thrpt: [163.58 KiB/s 163.86 KiB/s 164.13 KiB/s]
change:
time: [-0.2402% -0.0222% +0.2000%] (p = 0.85 > 0.05)
thrpt: [-0.1996% +0.0222% +0.2408%]
transfer/pacing-false/same-seed/wallclock-time/run: Change within noise threshold. time: [24.354 ms 24.373 ms 24.394 ms]
change: [-0.5640% -0.3759% -0.2191%] (p = 0.00 < 0.05)
transfer/pacing-false/same-seed/simulated-time/run: No change in performance detected. time: [25.234 s 25.234 s 25.234 s]
thrpt: [162.32 KiB/s 162.32 KiB/s 162.32 KiB/s]
change:
time: [+0.0000% +0.0000% +0.0000%] (p = NaN > 0.05)
thrpt: [+0.0000% +0.0000% +0.0000%]
transfer/pacing-true/same-seed/wallclock-time/run: Change within noise threshold. time: [26.087 ms 26.101 ms 26.117 ms]
change: [-1.3144% -1.1085% -0.9644%] (p = 0.00 < 0.05)
transfer/pacing-true/same-seed/simulated-time/run: No change in performance detected. time: [25.301 s 25.301 s 25.301 s]
thrpt: [161.89 KiB/s 161.89 KiB/s 161.89 KiB/s]
change:
time: [+0.0000% +0.0000% +0.0000%] (p = NaN > 0.05)
thrpt: [+0.0000% +0.0000% +0.0000%]
Download data for |
Client/server transfer resultsPerformance differences relative to b9c32c7. Transfer of 33554432 bytes over loopback, min. 100 runs. All unit-less numbers are in milliseconds.
Download data for |
|
Closing here; we should revisit this once #2695 becomes actionable. |
Or at least try.