在高並發系統中實現無鎖資料結構

無鎖資料結構:包含 C++、Java 和 .NET 範例的完整指南

內部網路 2026 年 6 月 10 日 , , ,

A mutex is simple. You lock it, modify shared state, unlock it. Every thread that wants access waits its turn. The problem is “waiting its turn”, at high core counts and high throughput, that serialization becomes the ceiling. A thread holding a lock can be preempted, causing all other threads to sit idle. Priority inversion can cause low-priority threads to block high-priority ones. In systems processing millions of operations per second across dozens of cores, mutex contention does not just slow things down; it collapses throughput.

Analyze Concurrency Logic

SMART TS XL traces atomic operation paths, shared memory access patterns, and cross-thread data.

了解更多

Lock-free data structures replace mutual exclusion with atomic operations. Instead of preventing conflict, they tolerate and recover from it. A thread that fails a compare-and-swap retries with updated state. No thread ever blocks another. At least one thread always makes progress. The result is dramatically better throughput under high contention, predictable latency without convoy effects, and the elimination of deadlock and priority inversion by design. The price is implementation complexity: the ABA problem, memory reclamation hazards, false sharing, and subtle memory ordering requirements are all traps that do not exist in lock-based code. This article covers all of them with concrete code.

Lock-Free vs Wait-Free vs Mutex: Which to Use?

Before choosing an implementation approach, the right question is which progress guarantee the system actually needs. The three models differ in what they promise when threads compete.

型號Progress Guarantee典型潛伏期複雜最適合
Mutex / lock-basedBlocking, threads waitUnpredictable under contentionLow-contention shared state, simple correctness requirements
Lock-freeSystem-wide, at least one thread progressesLow under contentionHigh-throughput queues, stacks, counters
Wait-freePer-thread, every thread finishes in bounded stepsBounded worst case很高Real-time systems, safety-critical, strict latency SLAs
Obstruction-freeSolo progress, only when uncontestedLow without contention媒材Transactional memory prototypes, research contexts

Lock-free is the practical default for high-concurrency production systems. The Michael-Scott queue, the Treiber stack, and most production MPMC ring buffers are lock-free. Individual threads can starve under extreme contention, but the system as a whole makes progress.

Wait-free guarantees per-thread bounded progress but requires significantly more complex algorithms. Universal constructions exist but carry high overhead. Wait-free algorithms are appropriate for hard real-time contexts where tail latency matters more than average throughput.

Obstruction-free is rarely used directly in production. It appears in some transactional memory implementations and serves as a stepping stone when proving algorithm correctness.

For most high-concurrency systems: use a mutex when contention is low and correctness is simple, use lock-free when throughput under contention matters, use wait-free only when worst-case per-thread latency is a hard requirement.

What Is a Lock-Free Data Structure?

A data structure is 無鎖 if it guarantees that at least one thread among all threads operating on it will complete its operation in a finite number of steps, regardless of what other threads are doing or how they are scheduled. This formal definition has a precise implication: no lock-free algorithm can deadlock. If one thread is halted, suspended, or runs slowly, other threads continue to make progress.

The mechanism is atomic operations, CPU instructions that execute as a single indivisible unit. The universal primitive is Compare-and-Swap (CAS):

CAS(location, expected, new_value):
  if *location == expected:
    *location = new_value
    return true
  else:
    return false  // someone else changed it first

CAS is atomic at the hardware level. On x86 it is the CMPXCHG instruction. On ARM it is implemented via LDXR/STXR (load-exclusive/store-exclusive), which is the LL/SC (Load-Linked/Store-Conditional) variant. A thread reads a value, computes a new value, and uses CAS to install it only if nobody else has changed it in between. If CAS fails, the thread retries with the fresh value.

In C++11 and later, this is exposed through std::atomic:

CPP

#include <atomic>

// Atomic increment using CAS loop
void atomic_add(std::atomic<int>& counter, int delta) {
    int expected = counter.load(std::memory_order_relaxed);
    while (!counter.compare_exchange_weak(
        expected,
        expected + delta,
        std::memory_order_release,
        std::memory_order_relaxed))
    {
        // expected is updated on failure -- retry with fresh value
    }
}

compare_exchange_weak is the CAS primitive. On failure it updates expected to the current value automatically, making the retry loop idiomatic.

The ABA Problem: Lock-Free’s Most Dangerous Pitfall

The ABA problem is the most counterintuitive correctness hazard in lock-free programming. CAS checks whether a memory location still contains the expected value before installing a new one. It cannot detect whether that value was changed and changed back between the read and the CAS. The location still looks like the original value, but the underlying state of the system has changed in ways CAS cannot see.

The Scenario Step by Step

Consider a lock-free stack using a single pointer top:

  1. Thread A reads top = Node1. Node1’s next pointer is Node2.
  2. Thread A is preempted before completing its pop.
  3. Thread B pops Node1 (top becomes Node2), then pops Node2 (top becomes null).
  4. Thread B pushes a new node, which happens to be allocated at the same memory address as Node1 (common with free-list allocators). Top = Node1 (same pointer, different content).
  5. Thread A resumes. Its CAS sees top == Node1 (expected value), succeeds, and sets top = Node2, but Node2 was already freed. Disaster.

The ABA problem manifests differently in each data structure:

  • : CAS succeeds but installs a stale next pointer, causing use-after-free
  • 隊列: Head or tail pointer appears unchanged but the structure has changed
  • 鍊錶: A node appears to still be in position but has been removed and reallocated

Does LL/SC Avoid the ABA Problem?

Yes, LL/SC (Load-Linked/Store-Conditional) provides semantics than CAS and naturally avoids ABA. LL marks the loaded memory address as “linked.” SC on the same address fails if any store to that address occurred since the LL, even if the value was restored to its original state. The modification history is tracked at the hardware level, not just the current value.

However, LL/SC implementations on real hardware have practical limitations. On ARM and POWER, spurious LL/SC failures can occur even without contention (context switches, cache evictions). Code must account for retry loops even without real conflicts. On x86, there is no native LL/SC; CAS is the hardware primitive. For x86 code, ABA must be prevented in software.

Fixing ABA: Three Approaches

1. Version counters (tagged pointers)

Pack a version counter into the same atomic word as the pointer. Every successful CAS increments the counter. Even if a pointer returns to its original value, the counter differs:

CPP

#include <atomic>
#include <cstdint>

struct TaggedPtr {
    uintptr_t ptr : 48;  // pointer (48-bit virtual address space)
    uintptr_t tag : 16;  // version counter
};

std::atomic<TaggedPtr> top;

bool cas_with_tag(std::atomic<TaggedPtr>& loc,
                  TaggedPtr expected,
                  void* new_ptr) {
    TaggedPtr desired = { (uintptr_t)new_ptr, expected.tag + 1 };
    return loc.compare_exchange_strong(expected, desired);
}

This is the standard approach in practice. The 16-bit tag wraps around after 65,536 operations, which is theoretically unsafe but practically adequate for most systems, a thread is extremely unlikely to be preempted for exactly 65,536 CAS cycles on the same location.

2. Hazard pointers

Each thread maintains a small set of “hazard pointers”, pointers to nodes it is currently accessing. Before freeing any node, a thread checks all hazard pointer registries to confirm no other thread is accessing it. A node that appears in a hazard pointer is deferred until that pointer is cleared:

CPP

// Simplified hazard pointer pattern
thread_local void* hazard_ptr = nullptr;

void* safe_load(std::atomic<void*>& head) {
    void* ptr;
    do {
        ptr = head.load(std::memory_order_acquire);
        hazard_ptr = ptr;                          // announce we are using this
        // memory fence ensures announcement is visible before validation
        std::atomic_thread_fence(std::memory_order_seq_cst);
    } while (ptr != head.load(std::memory_order_acquire)); // validate still valid
    return ptr;
}

void safe_release() {
    hazard_ptr = nullptr;
}

Hazard pointers prevent ABA at the memory reclamation level: a node cannot be reused while any thread holds a hazard pointer to it. This requires scanning all thread hazard pointers before freeing, which adds overhead proportional to thread count.

3. Epoch-based reclamation (EBR)

Threads operate in globally tracked “epochs.” Retired nodes are buffered per-epoch and only freed when all threads have advanced past the epoch in which the node was retired. EBR is simpler to use than hazard pointers and has lower per-operation overhead, at the cost of bounded-but-unpredictable memory growth during quiescent periods.

在Java中, AtomicStampedReference directly addresses the ABA problem by pairing a reference with an integer stamp:

Java的

import java.util.concurrent.atomic.AtomicStampedReference;

AtomicStampedReference<Node> top =
    new AtomicStampedReference<>(null, 0);

void push(Node newNode) {
    int[] stamp = new int[1];
    Node current;
    do {
        current = top.get(stamp);
        newNode.next = current;
    } while (!top.compareAndSet(current, newNode, stamp[0], stamp[0] + 1));
}

Lock-Free Queue Implementations

Queues are the most commonly needed lock-free structure. The canonical lock-free queue is the Michael-Scott queue, which uses two pointers (head and tail) and a sentinel node to allow concurrent enqueue and dequeue operations.

SPSC Queue: Maximum Performance with Minimal Synchronization

A Single-Producer Single-Consumer queue eliminates all write-write and read-read contention. One thread writes to the tail; one thread reads from the head. Only the head pointer is shared between producer and consumer:

CPP

#include <atomic>
#include <array>

template<typename T, size_t Capacity>
class SPSCQueue {
    std::array<T, Capacity> buffer;
    alignas(64) std::atomic<size_t> head{0}; // consumer reads head
    alignas(64) std::atomic<size_t> tail{0}; // producer writes tail
    // alignas(64) puts each on a separate cache line -- prevents false sharing

public:
    bool push(const T& item) {
        size_t t = tail.load(std::memory_order_relaxed);
        size_t next = (t + 1) % Capacity;
        if (next == head.load(std::memory_order_acquire))
            return false; // full
        buffer[t] = item;
        tail.store(next, std::memory_order_release);
        return true;
    }

    bool pop(T& item) {
        size_t h = head.load(std::memory_order_relaxed);
        if (h == tail.load(std::memory_order_acquire))
            return false; // empty
        item = buffer[h];
        head.store((h + 1) % Capacity, std::memory_order_release);
        return true;
    }
};

alignas(64) on head and tail is critical. Without it, both fit on the same cache line. Every write to tail triggers a cache line invalidation visible to the core reading head, and vice versa, false sharing that serializes what should be independent operations.

MPMC Queue: Multi-Producer Multi-Consumer

A fully concurrent MPMC queue is significantly more complex. Production implementations use a sequence number per slot to coordinate producers and consumers without a lock:

CPP

#include <atomic>
#include <array>

template<typename T, size_t Capacity>
class MPMCQueue {
    struct Slot {
        std::atomic<size_t> sequence;
        T data;
    };

    alignas(64) std::array<Slot, Capacity> slots;
    alignas(64) std::atomic<size_t> enqueue_pos{0};
    alignas(64) std::atomic<size_t> dequeue_pos{0};

public:
    MPMCQueue() {
        for (size_t i = 0; i < Capacity; ++i)
            slots[i].sequence.store(i, std::memory_order_relaxed);
    }

    bool push(const T& item) {
        size_t pos = enqueue_pos.fetch_add(1, std::memory_order_relaxed);
        Slot& slot = slots[pos % Capacity];
        size_t seq = slot.sequence.load(std::memory_order_acquire);
        // wait for the slot to be ready for writing
        while (seq != pos) {
            seq = slot.sequence.load(std::memory_order_acquire);
        }
        slot.data = item;
        slot.sequence.store(pos + 1, std::memory_order_release);
        return true;
    }

    bool pop(T& item) {
        size_t pos = dequeue_pos.fetch_add(1, std::memory_order_relaxed);
        Slot& slot = slots[pos % Capacity];
        size_t seq = slot.sequence.load(std::memory_order_acquire);
        while (seq != pos + 1) {
            seq = slot.sequence.load(std::memory_order_acquire);
        }
        item = slot.data;
        slot.sequence.store(pos + Capacity, std::memory_order_release);
        return true;
    }
};

.NET ConcurrentQueue: How It Works Internally

.NET’s ConcurrentQueue<T> in .NET 5+ uses a segmented array structure. Each segment is a fixed-size array with a head and tail index managed with Interlocked operations (which map to CAS on the underlying hardware). Segments are linked via a volatile _nextSegment pointer. Enqueue appends to the current tail segment, growing the segment chain when full. Dequeue reads from the head segment and advances the head index atomically.

The key design decision is avoiding a global lock. Producers compete only on the tail index of the current segment. Consumers compete only on the head index. No producer ever contends with a consumer. This makes ConcurrentQueue<T> highly efficient for producer-consumer pipelines:

尖銳的

using System.Collections.Concurrent;
using System.Threading;

// .NET ConcurrentQueue -- lock-free, thread-safe, FIFO
var queue = new ConcurrentQueue<int>();

// Producer thread
var producer = Task.Run(() => {
    for (int i = 0; i < 1_000_000; i++)
        queue.Enqueue(i);
});

// Consumer thread
var consumer = Task.Run(() => {
    long sum = 0;
    while (!queue.IsEmpty || !producer.IsCompleted) {
        if (queue.TryDequeue(out int item))
            sum += item;
        else
            Thread.SpinWait(1);  // brief spin before retry
    }
    return sum;
});

await Task.WhenAll(producer, consumer);

Cache Coherence and False Sharing in Lock-Free Code

Cache coherence is the invisible performance variable in lock-free implementations. When a core writes to a memory location, the entire 64-byte cache line containing that location is invalidated in all other cores’ caches. They must fetch the updated line before reading. In lock-free code with frequent atomic updates, this cache-line invalidation can become the dominant cost.

虛假分享 occurs when two threads modify different variables that happen to share a cache line. Neither thread is accessing the same data, but the cache coherence protocol treats the entire cache line as contended:

CPP

// BAD: head and tail on the same cache line
struct BadQueue {
    std::atomic<size_t> head;   // bytes 0-7
    std::atomic<size_t> tail;   // bytes 8-15
    // both fit in the same 64-byte cache line
    // producer writing tail invalidates head in consumer's cache
};

// GOOD: head and tail on separate cache lines
struct GoodQueue {
    alignas(64) std::atomic<size_t> head;
    alignas(64) std::atomic<size_t> tail;
    // each on its own cache line -- no false sharing
};

The MESI protocol (Modified, Exclusive, Shared, Invalid) describes how x86 CPUs coordinate cache line ownership. When a core wants to write a location it shares with other cores (Shared state), it must broadcast a write request, wait for acknowledgment from all sharers, and transition to Modified state. This round-trip to the coherence protocol takes 50-300+ cycles on a multi-socket system. False sharing turns independent thread-local operations into cache-coherence traffic, collapsing performance that should scale linearly with cores.

Detecting false sharing: 使用 perf stat -e cache-misses,L1-dcache-load-misses on Linux or Visual Studio’s CPU performance profiler on Windows. A high rate of L1 cache misses in a multi-threaded program that fits in cache is a strong indicator of false sharing.

Memory Ordering: Why memory_order_relaxed Is Not Always Enough

C + +中 std::atomic operations expose the full range of memory ordering from the C++11 memory model. Choosing the wrong ordering is the second most common bug in lock-free code after the ABA problem.

CPP

// The five memory orderings and when each applies:

// relaxed: no synchronization, only atomicity
// Use for: counters where only the final value matters
counter.fetch_add(1, std::memory_order_relaxed);

// acquire: reads see all writes by threads that released this location
// Use for: reading shared state protected by a flag
if (ready.load(std::memory_order_acquire)) {
    use(data); // guaranteed to see all writes made before ready was set
}

// release: writes are visible to threads that acquire this location
// Use for: publishing shared state
data = compute();
ready.store(true, std::memory_order_release);

// acq_rel: acquire + release in one operation
// Use for: read-modify-write operations like CAS in the middle of a chain
node.compare_exchange_strong(expected, desired,
    std::memory_order_acq_rel,   // success
    std::memory_order_acquire);  // failure

// seq_cst: total order across all seq_cst operations, all cores
// Use for: when you need a global consistent view (but slowest)
flag.store(true, std::memory_order_seq_cst);

A common mistake: using relaxed for a release flag. A thread that sets ready = true - relaxed ordering does not guarantee that preceding writes to data are visible to other threads that read ready. The acquire/release pair creates the happens-before relationship that connects writer and reader.

The Treiber Stack: Lock-Free Stack in C++

The Treiber stack is the simplest non-trivial lock-free data structure. It uses a CAS loop on a single top pointer:

CPP

#include <atomic>

template<typename T>
class TreiberStack {
    struct Node {
        T data;
        Node* next;
        explicit Node(T d) : data(std::move(d)), next(nullptr) {}
    };

    std::atomic<Node*> top{nullptr};

public:
    void push(T data) {
        Node* node = new Node(std::move(data));
        node->next = top.load(std::memory_order_relaxed);
        while (!top.compare_exchange_weak(
            node->next,          // expected -- updated on failure
            node,                // desired
            std::memory_order_release,
            std::memory_order_relaxed))
        { /* retry */ }
    }

    bool pop(T& result) {
        Node* node = top.load(std::memory_order_acquire);
        while (node) {
            if (top.compare_exchange_weak(
                node,
                node->next,      // new top
                std::memory_order_acquire,
                std::memory_order_relaxed))
            {
                result = std::move(node->data);
                // WARNING: cannot free node here safely without hazard pointers or EBR
                // delete node; -- ABA hazard if another thread reads node->next
                return true;
            }
        }
        return false; // empty
    }
};

的評論 delete node is critical: naive deletion is the ABA hazard described earlier. In a production Treiber stack, node reclamation requires hazard pointers, epoch-based reclamation, or garbage collection (Java/C# handles this automatically).

Java Lock-Free Data Structures

Java’s standard library provides production-quality lock-free implementations in java.util.concurrent:

課程結構體進展筆記
AtomicReference<T>單值Lock-freeCAS on object reference
AtomicStampedReference<T>Value + stampLock-freeABA prevention via version counter
ConcurrentLinkedQueue<T>Michael-Scott queueLock-freeFIFO, unbounded
ConcurrentLinkedDeque<T>Lock-free dequeLock-free兩端
LongAdder計數器Lock-freeStriped, high-throughput counting

LongAdder is worth noting specifically. Rather than a single atomic counter competing across all threads, it maintains a striped array of counters, each accessed by a subset of threads. Contention is spread across stripes rather than concentrated on one location. The total is summed lazily on sum(). For high-frequency increment operations across many threads, LongAdder dramatically outperforms AtomicLong.incrementAndGet():

Java的

import java.util.concurrent.atomic.LongAdder;

// BAD for high-concurrency counting: all threads contend on one location
AtomicLong counter = new AtomicLong(0);
counter.incrementAndGet(); // single CAS -- all threads collide

// GOOD for high-concurrency counting: contention distributed across stripes
LongAdder adder = new LongAdder();
adder.increment();  // updates thread-local stripe -- minimal contention
long total = adder.sum(); // sum all stripes lazily

SMART TS XL Supports Lock-Free System Development

Lock-free code is among the hardest code to get right. The bugs are non-deterministic, appear only under specific interleavings, and often only in production at scale. Static analysis addresses this by examining the code’s structural properties before execution rather than waiting for a race condition to manifest.

SMART TS XL analyzes the full execution paths of concurrent code, tracing how atomic operations relate to the memory locations they guard. For systems that mix lock-free code with legacy components or multi-language architectures, it provides the cross-boundary visibility that single-language tools cannot: how a shared memory region accessed by a lock-free C++ component relates to the Java service that reads from it, or how a concurrent queue feeds into a COBOL-based processing pipeline.

靜態程式碼分析 capability identifies patterns associated with concurrency hazards: atomic loads without corresponding acquire semantics, CAS loops that do not update the expected value on failure, shared data structures where fields accessed by different threads are co-located without alignment padding. The 影響分析 capability traces what downstream components depend on a shared concurrent data structure, so that changes to the structure’s interface or reclamation strategy can be scoped correctly before the change is made.

For enterprise systems where lock-free components must coexist with legacy batch processing, message queuing middleware, and modern service layers, SMART TS XL“ 依賴映射 provides the architectural view of how concurrent data paths connect components written in different languages across the full system stack.

When Lock-Free Is the Wrong Choice

Lock-free is not always better than a mutex. The case for lock-free structures depends on the actual contention profile of the workload. At low thread counts or low contention, a well-implemented mutex is faster because it avoids the overhead of atomic retry loops and cache-line invalidation across cores.

Use a mutex when:

  • Thread count is low (below 4-8) or contention is infrequent
  • The critical section is long and complex (lock-free loops become expensive relative to the section)
  • Correctness is more important than throughput and the algorithm is simpler with a lock
  • The platform has efficient futex-based locking (Linux pthread_mutex, Windows SRWLOCK)

Use lock-free when:

  • Thread count is high and contention is sustained
  • Operations are short (CAS loop overhead is small relative to the operation)
  • Blocking is unacceptable (real-time threads, interrupt handlers, signal handlers)
  • Composing with other lock-free structures where a lock would introduce lock-order dependencies

Use wait-free when:

  • Every thread must complete in bounded time regardless of other threads
  • Hard real-time or safety-critical requirements where starvation is unacceptable
  • The algorithm can be structured to support bounded-step completion per thread

The discipline of lock-free programming is not choosing it everywhere but choosing it precisely where its properties, non-blocking progress, elimination of deadlock, tolerance of preemption, match the actual requirements of the system.