Payload scaling chart: Photon Ring vs disruptor-rs across 8B-4KiB payloads

Environment

MachineCPUOSRust
A (primary)Intel Core i7-10700KF @ 3.80 GHzLinux 6.81.93.1
B (secondary)Apple M1 PromacOS 26.31.92.0

Framework: Criterion, 100 samples, 3-second warmup, ring size 4096 slots.

Same-Thread Roundtrip

L1-hot, measures pure instruction cost with no cache-coherence traffic.

PayloadLatency (A)Latency (B)Cache linesNotes
8 B2.4 ns8.6 ns1Stamp + value share one 64B line
16 B9.8 ns11.3 ns1
32 B11.8 ns13.0 ns1
64 B18.8 ns16.4 ns2Slot = 72B, spills to 2 lines
128 B23.3 ns25.4 ns3
256 B34.4 ns41.2 ns5
512 B55.9 ns69.6 ns9
1 KB88.1 ns127.9 ns17memcpy starts to dominate
2 KB149.6 ns244.6 ns33
4 KB361.6 ns500.9 ns65~5.6 ns per cache line

Cross-Thread Roundtrip vs disruptor-rs

Methodology note: The 117 ns at 8B here vs 95 ns in the main benchmarks reflects differences in Criterion warm-up, iterator structure, and type-generic overhead. The disruptor-rs column is modeled (not measured at each payload size) using the baseline 133 ns from the actual disruptor crate benchmark plus estimated per-cache-line transfer costs.
PayloadPhoton Ring (A)Photon Ring (B)Disruptor (modeled, A)Advantage (A)
8 B117 ns156.7 ns133 ns12% faster
64 B125 ns195.8 ns145 ns14% faster
256 B148 ns156.7 ns181 ns18% faster
512 B163 ns167.6 ns229 ns29% faster
1 KB191 ns226.5 ns325 ns41% faster
4 KB342 ns369.7 ns901 ns62% faster
Cross-Thread Latency vs Payload Size (Intel i7-10700KF)
Photon Ring vs modeled disruptor-rs baseline. Log x-axis.

Key Observations

The memcpy is cheap relative to cache coherence

For payloads up to 56 bytes (one cache line with the stamp), the memcpy costs roughly 2–3 ns against a ~96 ns cache-coherence transfer. The copy is roughly 3% of total latency.

Photon Ring outperforms at all tested payload sizes

The performance advantage grows with payload size because:

  1. The Disruptor pays the same cache coherence cost. Consumers must still transfer the same cache lines from the publisher, whether they read in-place or copy.
  2. The Disruptor has higher base overhead. Sequence barrier load + event handler dispatch + shared cursor contention adds ~37 ns over Photon Ring's stamp-only fast path.
  3. x86 memcpy is extremely efficient. rep movsb with ERMS (Enhanced REP MOVSB) achieves near-memory-bandwidth speeds. The 4 KB copy costs ~200 ns, but the Disruptor's multi-line coherence transfer costs more.

Regenerating

cargo bench --bench payload_scaling
python3 scripts/plot_payload_scaling.py