Benchmark Methodology — Photon Ring

Hardware

Intel i7-10700KF (primary)

Property	Value
CPU	Intel Core i7-10700KF (Comet Lake)
Base frequency	3.80 GHz
Turbo frequency	Up to 5.10 GHz (single-core)
Cores / Threads	8 cores / 16 threads (SMT enabled)
L1d cache	32 KB per core, 8-way
L2 cache	256 KB per core, 4-way
L3 cache	16 MB shared, ring bus interconnect
Architecture	x86_64, Comet Lake (14 nm)
OS	Ubuntu (Linux 6.8)
Rust	1.93.1 stable

Apple M1 Pro (secondary)

Property	Value
CPU	Apple M1 Pro
Cores	8 (6 performance + 2 efficiency)
Architecture	aarch64 (ARMv8.5-A)
L1d cache	128 KB per P-core, 64 KB per E-core
L2 cache	12 MB P-cluster, 4 MB E-cluster
OS	macOS 26.3
Rust	1.92.0 stable

Criterion Configuration

Parameter	Value
Sample size	100 (Criterion default)
Warm-up time	3 seconds (Criterion default)
Measurement time	5 seconds (Criterion default)
Reported statistic	Median
Outlier detection	Criterion built-in MAD-based classification

Compiler flags: --release (opt-level 3). No custom RUSTFLAGS, no LTO, PGO, or target-cpu=native.

What Is NOT Controlled

The following variables are not controlled and can cause variance between runs and machines:

CPU frequency governor. Left at OS default. Turbo boost is not disabled.
SMT (Hyper-Threading). Enabled on Intel i7-10700KF. Cross-thread benchmarks may land on sibling hyperthreads or separate physical cores, which dramatically changes latency.
Core isolation. No isolcpus, nohz_full, or rcu_nocbs kernel parameters are set.
Core pinning. Criterion benchmarks do not pin threads. The rdtsc_oneway bench and the pinned_latency example do use core pinning where noted.
Background load. Benchmarks run on a developer workstation, not a dedicated bare-metal machine.

Cross-Thread Roundtrip Methodology

The roundtrip benchmark (benches/throughput.rs, function cross_thread_latency) measures the time for a message to travel from the publisher to a subscriber thread and for the subscriber to signal receipt back:

Publisher writes a u64 sequence number via publish(i).
Subscriber thread busy-spins on try_recv(). On receipt it stores the value into a shared AtomicU64 (seen) with Release ordering.
Publisher busy-spins on seen.load(Acquire) until it equals i.
Criterion measures steps 1–3.

Note: This is a roundtrip measurement: it includes one cache line transfer for the slot data (publisher → subscriber) and one for the seen atomic (subscriber → publisher). The reported 95 ns is approximately 2x the true one-way latency plus the AtomicU64 signal-back overhead.

One-Way Latency (RDTSC)

The one-way benchmark (benches/rdtsc_oneway.rs) eliminates signal-back overhead by embedding the publisher's TSC reading directly in the message payload:

Publisher calls RDTSCP (serializing TSC read) immediately before publish(). The TSC value is stored in the message payload.
Subscriber calls LFENCE; RDTSC immediately after try_recv() returns Ok.
The delta (subscriber_tsc - publisher_tsc) is recorded in raw cycles.
After 100,000 samples (10,000 warmup discarded), percentiles are computed and converted to nanoseconds using the known CPU base and turbo frequencies.

Disruptor Comparison

Both Photon Ring and disruptor-rs benchmarks run in the same Criterion binary, compiled with identical flags, in the same cargo bench invocation:

Same ring size: 4096 slots.
Same wait strategy: BusySpin (lowest-latency strategy in both libraries).
Publish-only: Disruptor ring has a single BusySpin consumer attached (required by its API). The consumer stores received values into a Relaxed atomic, which the benchmark ignores.

Cross-thread Disruptor numbers are not available because the Disruptor's consumer thread is managed internally by its builder API. The roundtrip comparison uses same-thread Criterion iteration for both libraries.

How to Reproduce

Full benchmark suite (Criterion)

cargo bench --bench throughput
cargo bench --bench payload_scaling

Results are written to target/criterion/ as JSON and HTML reports.

One-way latency (RDTSC)

# x86_64 only -- uses inline RDTSCP/LFENCE+RDTSC
cargo bench --bench rdtsc_oneway

Pinned-core latency example

cargo run --release --example pinned_latency

Caveats

Self-benchmarks. All benchmarks are authored and run by the Photon Ring maintainers. They have not been independently verified by a third party.
Hardware-dependent. Numbers are specific to the tested hardware. Different CPUs, cache hierarchies, and interconnects will produce different results.
Disruptor comparison is against the Rust port. The disruptor crate (v4.0.0) is a Rust reimplementation of the LMAX Disruptor pattern. A direct comparison against the Java original on matched hardware has not been performed.
Median vs. tail latency. The README reports median (p50). Tail latency (p99, p999) is higher and more variable. The rdtsc_oneway benchmark reports full percentile distributions.
Single-socket only. All benchmarks run on single-socket machines. Cross-socket (NUMA) latency would be significantly higher for both libraries.