Introduction
1.1 The inter-thread communication bottleneck
In concurrent systems — from high-frequency trading engines and real-time audio pipelines to game simulation loops — inter-thread message passing lies on the critical path of nearly every latency-sensitive operation. The dominant cost is not lock acquisition or memory allocation, but the cache-coherence protocol round-trip imposed by hardware itself.
When a producer thread on core A writes a message, the cache line containing that message transitions to the Modified state in A's private L1 cache. Before a consumer thread on core B can read that message, the coherence protocol must transfer the cache line from A's cache hierarchy to B's. On Intel processors using a ring-bus L3 interconnect (Comet Lake), this transfer takes approximately 40–55 ns for intra-socket transfers.
This coherence latency represents a hard physical floor. No software optimization can deliver an inter-thread message faster than the time required for a single cache-line transfer between cores. For a naive messaging scheme that touches two cache lines per message (one for the data, one for a shared control variable), the floor doubles.
1.2 The LMAX Disruptor and its limitations
The LMAX Disruptor, introduced by Thompson, Farley, and Barker in 2011, represented a landmark in the mechanical-sympathy approach to concurrent systems design. By replacing bounded queues with a pre-allocated ring buffer, it eliminated per-message allocation. However, its reliance on sequence barriers introduces structural overhead that cannot be eliminated within its design framework.
On the consumer's hot path, receiving a single message requires two cache-line transfers: first, the consumer loads the shared sequence barrier to determine a new message is available; second, it loads the slot containing the message payload. If the barrier and slot reside on different cache lines — which they almost always do — the consumer pays two L3 snoop latencies per message: approximately 80–110 ns of irreducible coherence traffic.
1.3 Our contribution
Photon Ring eliminates the sequence-barrier load from the consumer hot path. The key insight
is stamp-in-slot co-location: by embedding a seqlock sequence stamp directly
in the same #[repr(C, align(64))] slot structure as the message payload, both
ownership metadata and data reside within a single 64-byte cache line for payloads up to
56 bytes.
Background: Cache Coherence and Seqlocks
2.1 Cache coherence protocols
The MESI protocol assigns each cache line one of four states: Modified (dirty, present only in this core's cache), Exclusive (clean, only in this cache), Shared (clean, may be in multiple caches), and Invalid (not present).
The critical path for inter-thread communication is the Modified-to-Shared transition. On Intel desktop processors with a ring-bus L3 interconnect (Skylake through Comet Lake), the end-to-end latency for this sequence is approximately 40–55 ns, dominated by ring-bus traversal time.
2.2 Seqlocks in the Linux kernel
The seqlock, introduced in Linux 2.5.60, is a reader-writer synchronization mechanism optimized for workloads where reads vastly outnumber writes. Readers proceed without acquiring any lock, instead performing an optimistic read-and-verify protocol:
Writer: Reader:
write_seqlock(&seq); do {
// modify protected data s = read_seqbegin(&seq);
write_sequnlock(&seq); // copy protected data
} while (read_seqretry(&seq, s));
If the reader's two counter samples differ, or the initial sample is odd (write in progress),
the reader discards the copy and retries. This is sound only when the protected data has no
pointers and no destructor — exactly the Pod constraint Photon Ring enforces.
Design: Stamp-in-Slot
3.1 Slot layout
#[repr(C, align(64))]
pub struct Slot<T> {
stamp: AtomicU64, // seqlock sequence number
value: UnsafeCell<T>,
// padding to align(64) if needed
}
// For T <= 56 bytes: sizeof(Slot<T>) == 64 (one cache line)
// For T > 56 bytes: sizeof(Slot<T>) == ceil((8 + sizeof(T)) / 64) * 64
3.2 Write protocol
1. stamp.store(seq * 2 + 1, Release) // odd = write in progress 2. fence(Release) // stamp visible before data 3. ptr::write(slot.value, data) // write payload (T: Pod) 4. stamp.store(seq * 2 + 2, Release) // even = write complete 5. cursor.store(seq, Release) // consumers can proceed
3.3 Read protocol
1. s1 = stamp.load(Acquire) 2. if s1 is odd: spin (write in progress) 3. if s1 < expected*2+2: return Empty 4. if s1 > expected*2+2: return Lagged (ring wrapped) 5. value = ptr::read(slot.value) // optimistic copy 6. s2 = stamp.load(Acquire) 7. if s1 == s2: return Ok(value) 8. else: retry from step 1
3.4 Why one cache-line transfer suffices
When T fits in 56 bytes, the stamp at offset 0 and the value at offset 8 reside in the same
64-byte cache line. The consumer's stamp.load(Acquire) in step 1 triggers exactly
one L3 snoop. The ptr::read in step 5 reads from the same line, which is already
in the consumer's L1 cache. The total coherence traffic for a successful receive: one snoop,
~40–55 ns.
The Disruptor requires: one snoop for the sequence barrier, then one snoop for the slot data (different cache line) = ~80–110 ns minimum.
3.5 Per-consumer cursors eliminate shared state
Each Subscriber<T> holds a private, non-atomic u64 cursor.
No cache line is shared between subscribers. The producer cursor is consulted only on the
lag-detection slow path. On the common-case fast path, the consumer goes directly to the
expected slot index and checks its stamp, with no cross-core atomic load at all.
Safety and the Pod Constraint
The optimistic read in step 5 may observe a partially overwritten slot. If T had a destructor
(Drop) or held pointers, a torn read could produce invalid memory states before
the stamp check had a chance to discard the value. Photon Ring avoids this by requiring
T: Pod.
Pod (Plain Old Data) is an unsafe marker trait meaning every possible
bit pattern of T is a valid value. Under this constraint, a torn read produces some valid T
value (just not the one the producer wrote). The stamp mismatch in step 7 discards it before
it reaches user code. No UB occurs because:
- No invalid bit patterns exist for T (Pod guarantee).
- No destructor runs on the discarded value (Pod implies no Drop).
- No pointer is dereferenced before the stamp check (value is copied, not accessed through).
bool (only 0 and 1 are valid),
char (must be valid Unicode), NonZero<u32> (0 is invalid),
Option<T> (discriminant has invalid patterns), any enum,
any reference or pointer, String, Vec. Use primitive numeric
types or #[repr(C)] structs with Pod fields.
Advanced Features
5.1 SubscriberGroup: batched fanout
When N logical consumers are polled on the same thread, SubscriberGroup<T, N>
performs one ring slot read and advances N cursors in a compiler-unrolled loop.
Independent fanout to N subscribers costs approximately N × 1.1 ns;
a group reduces this to a single seqlock read plus ~0.2 ns per logical consumer
— a 5.5x improvement at N=10.
5.2 MPMC path
MpPublisher<T> is Clone + Send + Sync and uses atomic
sequence claiming (fetch_add on the head cursor) for concurrent producers.
Measured cost: 12.1 ns on Intel (vs 2.8 ns for SPMC), reflecting the CAS overhead
on the write side.
5.3 Pipeline topology builder
topology::Pipeline builds dedicated-thread processing graphs. Each stage runs
on its own thread with a ring buffer connecting it to the next stage. Fan-out (diamond)
topologies are supported via .fan_out(). The builder is gated to platforms
with OS thread support.
5.4 Hugepages and NUMA affinity
With the hugepages feature on Linux, Publisher::mlock prevents
paging and Publisher::prefault fault-maps all ring pages at startup, eliminating
page-fault jitter on the hot path. NUMA placement helpers
(set_numa_preferred, reset_numa_policy) allow the ring to be
allocated on the publisher's NUMA node, reducing cross-socket coherence costs.
Benchmark Results
All measurements on Intel i7-10700KF (Comet Lake), Linux 6.8, Rust 1.93.1, --release:
- Publish only: 2.8 ns (vs 30.6 ns for disruptor-rs) — 10.9x faster
- Cross-thread roundtrip: 95 ns (vs 138 ns) — 1.45x faster
- One-way latency p50 (RDTSC): 48 ns — within 20% of the bare L3 snoop floor
- One-way latency p99 (RDTSC): 66 ns
- Sustained throughput: ~300M msg/s
- Fanout to 10 subscribers: 17.0 ns total, 1.7 ns per subscriber
- SubscriberGroup (10 logical): ~4 ns total, 0.2 ns per logical consumer
The 48 ns p50 one-way figure is consistent with theoretical expectation: the L3 snoop latency on Comet Lake is ~40–55 ns. Photon Ring adds approximately 5–10 ns of software overhead above the hardware floor (stamp check, cursor increment, function call).
Conclusion
Stamp-in-slot co-location eliminates the second cache-line transfer that sequence-barrier
designs pay on every receive. Combined with per-consumer local cursors (no shared read-path
state) and the Pod constraint (torn reads are safe to discard), Photon Ring
achieves near-hardware latency for broadcast inter-thread messaging in Rust.
The design is sound under the Rust memory model, no_std compatible with
alloc, and scales from embedded targets to server-class NUMA systems.
The SubscriberGroup fanout mechanism and the Pipeline topology
builder extend the primitive to multi-stage, multi-consumer architectures without
sacrificing the fundamental one-cache-line-per-receive invariant.