Technical Report — Photon Ring

Photon Ring is a single-producer multi-consumer (SPMC) message passing library for Rust that achieves sub-100 nanosecond one-way inter-thread latency on commodity x86_64 hardware. The design co-locates a seqlock stamp with its payload in a single cache line, eliminating the extra cache miss that plagues traditional sequence-barrier designs. We describe the stamp-in-slot protocol, its safety properties under the Rust memory model, and show that per-slot seqlocks combined with per-consumer cursors yield constant-time, zero-allocation publish and receive operations. Benchmarks on an Intel i7-10700KF demonstrate 48 ns p50 one-way latency and 0.2 ns per-subscriber fanout cost with batched subscriber groups — a 5.5x improvement over independent consumers.

Contents

Introduction
Background: Cache Coherence and Seqlocks
Design: Stamp-in-Slot
Safety and the Pod Constraint
Advanced Features
Benchmark Results
Conclusion

Introduction

1.1 The inter-thread communication bottleneck

In concurrent systems — from high-frequency trading engines and real-time audio pipelines to game simulation loops — inter-thread message passing lies on the critical path of nearly every latency-sensitive operation. The dominant cost is not lock acquisition or memory allocation, but the cache-coherence protocol round-trip imposed by hardware itself.

When a producer thread on core A writes a message, the cache line containing that message transitions to the Modified state in A's private L1 cache. Before a consumer thread on core B can read that message, the coherence protocol must transfer the cache line from A's cache hierarchy to B's. On Intel processors using a ring-bus L3 interconnect (Comet Lake), this transfer takes approximately 40–55 ns for intra-socket transfers.

This coherence latency represents a hard physical floor. No software optimization can deliver an inter-thread message faster than the time required for a single cache-line transfer between cores. For a naive messaging scheme that touches two cache lines per message (one for the data, one for a shared control variable), the floor doubles.

1.2 The LMAX Disruptor and its limitations

The LMAX Disruptor, introduced by Thompson, Farley, and Barker in 2011, represented a landmark in the mechanical-sympathy approach to concurrent systems design. By replacing bounded queues with a pre-allocated ring buffer, it eliminated per-message allocation. However, its reliance on sequence barriers introduces structural overhead that cannot be eliminated within its design framework.

On the consumer's hot path, receiving a single message requires two cache-line transfers: first, the consumer loads the shared sequence barrier to determine a new message is available; second, it loads the slot containing the message payload. If the barrier and slot reside on different cache lines — which they almost always do — the consumer pays two L3 snoop latencies per message: approximately 80–110 ns of irreducible coherence traffic.

1.3 Our contribution

Photon Ring eliminates the sequence-barrier load from the consumer hot path. The key insight is stamp-in-slot co-location: by embedding a seqlock sequence stamp directly in the same #[repr(C, align(64))] slot structure as the message payload, both ownership metadata and data reside within a single 64-byte cache line for payloads up to 56 bytes.

Background: Cache Coherence and Seqlocks

2.1 Cache coherence protocols

The MESI protocol assigns each cache line one of four states: Modified (dirty, present only in this core's cache), Exclusive (clean, only in this cache), Shared (clean, may be in multiple caches), and Invalid (not present).

The critical path for inter-thread communication is the Modified-to-Shared transition. On Intel desktop processors with a ring-bus L3 interconnect (Skylake through Comet Lake), the end-to-end latency for this sequence is approximately 40–55 ns, dominated by ring-bus traversal time.

2.2 Seqlocks in the Linux kernel

The seqlock, introduced in Linux 2.5.60, is a reader-writer synchronization mechanism optimized for workloads where reads vastly outnumber writes. Readers proceed without acquiring any lock, instead performing an optimistic read-and-verify protocol:

Writer:                           Reader:
  write_seqlock(&seq);              do {
  // modify protected data            s = read_seqbegin(&seq);
  write_sequnlock(&seq);              // copy protected data
                                    } while (read_seqretry(&seq, s));

If the reader's two counter samples differ, or the initial sample is odd (write in progress), the reader discards the copy and retries. This is sound only when the protected data has no pointers and no destructor — exactly the Pod constraint Photon Ring enforces.

Design: Stamp-in-Slot

3.1 Slot layout

#[repr(C, align(64))]
pub struct Slot<T> {
    stamp: AtomicU64,   // seqlock sequence number
    value: UnsafeCell<T>,
    // padding to align(64) if needed
}

// For T <= 56 bytes: sizeof(Slot<T>) == 64 (one cache line)
// For T >  56 bytes: sizeof(Slot<T>) == ceil((8 + sizeof(T)) / 64) * 64

3.2 Write protocol

1. stamp.store(seq * 2 + 1, Release)   // odd = write in progress
2. fence(Release)                       // stamp visible before data
3. ptr::write(slot.value, data)         // write payload (T: Pod)
4. stamp.store(seq * 2 + 2, Release)   // even = write complete
5. cursor.store(seq, Release)          // consumers can proceed

3.3 Read protocol

1. s1 = stamp.load(Acquire)
2. if s1 is odd:          spin (write in progress)
3. if s1 < expected*2+2: return Empty
4. if s1 > expected*2+2: return Lagged (ring wrapped)
5. value = ptr::read(slot.value)       // optimistic copy
6. s2 = stamp.load(Acquire)
7. if s1 == s2:           return Ok(value)
8. else:                  retry from step 1

3.4 Why one cache-line transfer suffices

When T fits in 56 bytes, the stamp at offset 0 and the value at offset 8 reside in the same 64-byte cache line. The consumer's stamp.load(Acquire) in step 1 triggers exactly one L3 snoop. The ptr::read in step 5 reads from the same line, which is already in the consumer's L1 cache. The total coherence traffic for a successful receive: one snoop, ~40–55 ns.

The Disruptor requires: one snoop for the sequence barrier, then one snoop for the slot data (different cache line) = ~80–110 ns minimum.

3.5 Per-consumer cursors eliminate shared state

Each Subscriber<T> holds a private, non-atomic u64 cursor. No cache line is shared between subscribers. The producer cursor is consulted only on the lag-detection slow path. On the common-case fast path, the consumer goes directly to the expected slot index and checks its stamp, with no cross-core atomic load at all.

Safety and the Pod Constraint

The optimistic read in step 5 may observe a partially overwritten slot. If T had a destructor (Drop) or held pointers, a torn read could produce invalid memory states before the stamp check had a chance to discard the value. Photon Ring avoids this by requiring T: Pod.

Pod (Plain Old Data) is an unsafe marker trait meaning every possible bit pattern of T is a valid value. Under this constraint, a torn read produces some valid T value (just not the one the producer wrote). The stamp mismatch in step 7 discards it before it reaches user code. No UB occurs because:

No invalid bit patterns exist for T (Pod guarantee).
No destructor runs on the discarded value (Pod implies no Drop).
No pointer is dereferenced before the stamp check (value is copied, not accessed through).

Types that are NOT Pod: bool (only 0 and 1 are valid), char (must be valid Unicode), NonZero<u32> (0 is invalid), Option<T> (discriminant has invalid patterns), any enum, any reference or pointer, String, Vec. Use primitive numeric types or #[repr(C)] structs with Pod fields.

Advanced Features

5.1 SubscriberGroup: batched fanout

When N logical consumers are polled on the same thread, SubscriberGroup<T, N> performs one ring slot read and advances N cursors in a compiler-unrolled loop. Independent fanout to N subscribers costs approximately N × 1.1 ns; a group reduces this to a single seqlock read plus ~0.2 ns per logical consumer — a 5.5x improvement at N=10.

5.2 MPMC path

MpPublisher<T> is Clone + Send + Sync and uses atomic sequence claiming (fetch_add on the head cursor) for concurrent producers. Measured cost: 12.1 ns on Intel (vs 2.8 ns for SPMC), reflecting the CAS overhead on the write side.

5.3 Pipeline topology builder

topology::Pipeline builds dedicated-thread processing graphs. Each stage runs on its own thread with a ring buffer connecting it to the next stage. Fan-out (diamond) topologies are supported via .fan_out(). The builder is gated to platforms with OS thread support.

5.4 Hugepages and NUMA affinity

With the hugepages feature on Linux, Publisher::mlock prevents paging and Publisher::prefault fault-maps all ring pages at startup, eliminating page-fault jitter on the hot path. NUMA placement helpers (set_numa_preferred, reset_numa_policy) allow the ring to be allocated on the publisher's NUMA node, reducing cross-socket coherence costs.

Benchmark Results

All measurements on Intel i7-10700KF (Comet Lake), Linux 6.8, Rust 1.93.1, --release:

Publish only: 2.8 ns (vs 30.6 ns for disruptor-rs) — 10.9x faster
Cross-thread roundtrip: 95 ns (vs 138 ns) — 1.45x faster
One-way latency p50 (RDTSC): 48 ns — within 20% of the bare L3 snoop floor
One-way latency p99 (RDTSC): 66 ns
Sustained throughput: ~300M msg/s
Fanout to 10 subscribers: 17.0 ns total, 1.7 ns per subscriber
SubscriberGroup (10 logical): ~4 ns total, 0.2 ns per logical consumer

The 48 ns p50 one-way figure is consistent with theoretical expectation: the L3 snoop latency on Comet Lake is ~40–55 ns. Photon Ring adds approximately 5–10 ns of software overhead above the hardware floor (stamp check, cursor increment, function call).

Conclusion

Stamp-in-slot co-location eliminates the second cache-line transfer that sequence-barrier designs pay on every receive. Combined with per-consumer local cursors (no shared read-path state) and the Pod constraint (torn reads are safe to discard), Photon Ring achieves near-hardware latency for broadcast inter-thread messaging in Rust.

The design is sound under the Rust memory model, no_std compatible with alloc, and scales from embedded targets to server-class NUMA systems. The SubscriberGroup fanout mechanism and the Pipeline topology builder extend the primitive to multi-stage, multi-consumer architectures without sacrificing the fundamental one-cache-line-per-receive invariant.

Seqlock-Stamped Ring Buffers for Sub-100ns Inter-Thread Messaging

Introduction

1.1 The inter-thread communication bottleneck

1.2 The LMAX Disruptor and its limitations

1.3 Our contribution

Background: Cache Coherence and Seqlocks

2.1 Cache coherence protocols

2.2 Seqlocks in the Linux kernel

Design: Stamp-in-Slot

3.1 Slot layout

3.2 Write protocol

3.3 Read protocol

3.4 Why one cache-line transfer suffices

3.5 Per-consumer cursors eliminate shared state

Safety and the Pod Constraint

Advanced Features

5.1 SubscriberGroup: batched fanout

5.2 MPMC path

5.3 Pipeline topology builder

5.4 Hugepages and NUMA affinity

Benchmark Results

Conclusion