Skip to content

Copy benchmark extremely inaccurate, varies radically at default 5s. #10534

@icscript

Description

@icscript

Is there an existing issue?

  • I have searched the existing issues

Experiencing problems? Have you tried our Stack Exchange first?

  • This is not a support question.

Description of bug

[2nd UPDATE - December 4 PM]: Root Cause Identified - Insufficient Memory Benchmark Duration
After systematic testing, I've identified the root cause: the default 5-second memory benchmark duration is inadequate and produces unreliable results.
Key Finding: Extending --memory-duration from 5 to 10+ seconds completely eliminates the variance and failures.
Test Results (using --memory-duration X --disk-duration 1 --hash-duration 1 --verify-duration 1):
=== Memory Duration: 5s ===
| Memory | Copy | 17.10 GiBs | 11.49 GiBs | ✅ Pass (148.8 %) |
| Memory | Copy | 4.94 GiBs | 11.49 GiBs | ❌ Fail ( 43.0 %) |
| Memory | Copy | 5.14 GiBs | 11.49 GiBs | ❌ Fail ( 44.7 %) |

=== Memory Duration: 10s ===
| Memory | Copy | 17.37 GiBs | 11.49 GiBs | ✅ Pass (151.2 %) |
| Memory | Copy | 17.26 GiBs | 11.49 GiBs | ✅ Pass (150.2 %) |
| Memory | Copy | 17.11 GiBs | 11.49 GiBs | ✅ Pass (148.9 %) |

=== Memory Duration: 20s ===
| Memory | Copy | 16.52 GiBs | 11.49 GiBs | ✅ Pass (143.8 %) |
| Memory | Copy | 17.10 GiBs | 11.49 GiBs | ✅ Pass (148.8 %) |
| Memory | Copy | 17.14 GiBs | 11.49 GiBs | ✅ Pass (149.1 %) |

=== Memory Duration: 30s ===
| Memory | Copy | 17.13 GiBs | 11.49 GiBs | ✅ Pass (149.0 %) |
| Memory | Copy | 17.21 GiBs | 11.49 GiBs | ✅ Pass (149.8 %) |
| Memory | Copy | 16.69 GiBs | 11.49 GiBs | ✅ Pass (145.2 %) |

Statistics:
5-second duration testing (50+ total runs across multiple sessions):

  • Overall failure rate: ~25-30%

  • Failures show 3.5-4x variance (4.47-18.78 GiBs range)

  • Pass results are consistent: 16-18 GiBs

  • Representative samples:

    • Session 1: 67% failure rate (2/3 failed)
    • Session 2: 40% failure rate (2/5 failed)
    • Session 3: 20% failure rate (3/15 failed)
    • Session 4: 20% failure rate (2/10 failed)

10+ second duration testing:

  • 100% pass rate (9/9 passed)
  • Consistent results: 16.52-17.37 GiBs (5% variance)
  • No failures observed across any test session

Conclusion:
The hardware consistently delivers 16-17 GiBs memory copy performance when given adequate benchmark time. The 5-second default is insufficient, likely not allowing enough time for the test to stabilize past transient kernel activity or cache/memory subsystem warmup effects.
Recommendation:
The default --memory-duration should be increased from 5 to at least 10 seconds to ensure reliable results. The current default incorrectly flags capable hardware as inadequate.
Note: The startup benchmark appears to use the same 5-second default and does not accept custom duration parameters, which explains why validator nodes with adequate hardware are experiencing spurious Copy benchmark failures at startup.

Original issue details preserved below for reference. Initial hypothesis about startup process contention was reasonable given symptoms, but systematic testing revealed the actual issue is insufficient benchmark duration.

[1st UPDATE - Dec 4]: Initial disk benchmarks were comparing ramdisk (/tmp) to NVMe. After running benchmark machine --base-path against actual storage, Rnd Write aligns with startup values (~402 MiBs). However, this confirms the Copy (memory) benchmark shows significant non-deterministic behavior during startup, corroborated by community reports below.

Issue occurs on multiple nodes of different hardware spec -- all very fast specs far exceeding minimum threshold on 'benchmark machine'. Issue occurs during reboots of the server as well as simple restarts of the polkadot service (eliminating external server startup contention as a possibility).

It's possible the issue is due to concurrency/contention between the startup benchmark and other polkadot startup processes. I imagine the intention was for the startup benchmark to run with nothing else in parallel, however perhaps with the evolution of the code and the many processes involved, that's no longer the case.

I believe this because:

A) The polkadot log flow at startup shows log output from startup processes proceeding the benchmark, meaning polkadot is doing other work before/during the benchmark begins. Example:

23:59:23.259  INFO main sc_cli::runner: 👤 Role: AUTHORITY
23:59:23.259  INFO main sc_cli::runner: 💾 Database: ParityDb at /var/lib/polkadot/chains/polkadot/paritydb/full
23:59:24.431 DEBUG main sc_client_db: Initializing shared trie cache with size 1073741824 bytes, 0.80070156685443% of total memory
23:59:24.607  INFO main txpool: Creating transaction pool txpool_type=ForkAware ready=Limit { count: 8192, total_bytes: 20971520 } future=Limit { count: 819, total_bytes: 2097152 }
23:59:25.071  INFO main sc_service::builder: 📦 Highest known block at #28805690
23:59:25.072  WARN main polkadot_service::builder: ⚠️  The hardware does not meet the minimal requirements Failed checks: Copy(expected: 11.49 GiBs, found: 5.58 GiBs),  for role 'Authority' find out more at:

You can see things starting and posting to the log prior to the benchmark posting its results. If the benchmark is to be accurate, nothing whatsoever should start until after the benchmark is complete and has posted its result. In reality it seems that amongst other things, the database trie cache is initializing before the benchmark, which could imply the database is started before the benchmark is ran.

B) The most logical explanation is concurrency and contention with other processes given the variance between the startup benchmark and 'benchmark machine' including when ran back to back. Data and examples:

Benchmark Machine:

Benchmark Machine (with --base-path=/var/lib/polkadot):
| Memory   | Copy                  | 16.45 GiBs  | 11.49 GiBs   | ✅ Pass (143.1 %) |

Startup benchmark examples:

Kernel:
⚠️ The hardware does not meet the minimal requirements Failed checks: Copy(expected: 11.49 GiBs, found: 7.83 GiBs), for role 'Authority'

⚠️ The hardware does not meet the minimal requirements Failed checks: Copy(expected: 11.49 GiBs, found: 5.58 GiBs), for role 'Authority' find out more at:

⚠️ The hardware does not meet the minimal requirements Failed checks: Copy(expected: 11.49 GiBs, found: 8.88 GiBs), for role 'Authority'

**

C) The startup benchmark appears to apply stricter pass/fail criteria than 'benchmark machine'. 'benchmark machine' explicitly applies a "10% fault tolerance" and would pass results within 10% which the startup benchmark would fail. This inconsistency suggests the startup benchmark may be using different thresholds or lacks the same tolerance logic, which could result in spurious failures even on hardware that genuinely meets requirements.

Additional Data:

Command line options:

ExecStart=/usr/local/bin/polkadot \
    --validator \
    --database=ParityDb \
    --state-pruning=256 \
    --blocks-pruning=256 \
    --telemetry-url 'wss://telemetry-backend.w3f.community/submit 1' \
    --telemetry-url 'wss://telemetry.polkadot.io/submit 0' \
    --no-mdns \
    --no-private-ip \
    --db-cache 49152 \
    --base-path=/var/lib/polkadot \
    --network-backend litep2p \
    --sync=warp \
    --log=warn,sc_cli=info,txpool=info \

Steps to reproduce

Stop polkadot. Run 'polkadot benchmark machine' several times and observe significantly different results from the Memory Copy test. Example of testing loop:

for i in {1..10}; do
  echo "Run $i:"
  polkadot benchmark machine \
    --base-path=/var/lib/polkadot \
    --memory-duration 5 \
    --disk-duration 1 \
    --hash-duration 1 \
    --verify-duration 1 \
    2>&1 | grep "Memory.*Copy"
  sleep 10
done
  • tests we dont need are reduced to one second. Memory duration set at 5 seconds, the same as the default.

Start/Restart polkadot to catch a startup benchmark failure of 'Memory Copy', ideally more than once, note the statistics.

Metadata

Metadata

Assignees

No one assigned

    Labels

    I10-unconfirmedIssue might be valid, but it's not yet known.I2-bugThe node fails to follow expected behavior.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions