Queue and Stack Implementations: ArrayDeque Dominates

The JDK Javadoc for ArrayDeque states: “This class is likely to be faster than Stack when used as a stack, and faster than LinkedList when used as a queue.” This is an understatement. ArrayDeque is faster by a factor that grows with collection size, uses a fraction of the memory, and generates a fraction of the garbage. There is no workload where LinkedList is the correct choice for queue or stack operations.

ArrayDeque Internal Mechanics

ArrayDeque is a resizable circular buffer backed by an Object[]. It maintains two indices: head and tail. Elements occupy positions head through tail - 1 (wrapping around the array boundary).

// Simplified ArrayDeque layout (JDK 21 source, conceptual)
public class ArrayDeque<E> {
    Object[] elements;  // backing array, always power of 2 in size
    int head;           // index of first element
    int tail;           // index just past last element

    // addLast: store at tail, increment tail (mod length)
    public void addLast(E e) {
        elements[tail] = e;
        tail = (tail + 1) & (elements.length - 1); // bitwise mod
        if (tail == head) doubleCapacity();
    }

    // pollFirst: read from head, increment head (mod length)
    public E pollFirst() {
        int h = head;
        E result = (E) elements[h];
        elements[h] = null; // help GC
        head = (h + 1) & (elements.length - 1);
        return result;
    }
}

Key properties:

addFirst, addLast, pollFirst, pollLast: all O(1), no pointer allocation, no node creation
Resize: doubles the backing array when full, copies all elements. Amortized O(1)
Memory: one Object[] plus two int fields. No per-element overhead beyond the reference slot
Cache: the circular buffer is contiguous memory. Sequential access patterns hit the prefetcher

Contrast with LinkedList’s queue operations:

addLast: allocates a new Node (32 bytes), links it. O(1) but with allocation
pollFirst: unlinks a Node, which becomes garbage. O(1) but creates GC work
Memory: 32 bytes per element overhead in Node objects
Cache: each Node is at a random heap address. Following next pointers defeats the prefetcher

FIFO Benchmark: Offer and Poll

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 5)
@Fork(2)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class FifoBenchmark {

    @Param({"100", "10000", "1000000"})
    private int operations;

    @Benchmark
    public int arrayDequeFifo() {
        ArrayDeque<String> deque = new ArrayDeque<>(operations);
        for (int i = 0; i < operations; i++) {
            deque.offer("item-" + i);
        }
        int count = 0;
        String item;
        while ((item = deque.poll()) != null) {
            count += item.length();
        }
        return count;
    }

    @Benchmark
    public int linkedListFifo() {
        LinkedList<String> list = new LinkedList<>();
        for (int i = 0; i < operations; i++) {
            list.offer("item-" + i);
        }
        int count = 0;
        String item;
        while ((item = list.poll()) != null) {
            count += item.length();
        }
        return count;
    }
}

Results:

Operations	ArrayDeque (ns/op)	LinkedList (ns/op)	Ratio	GC Alloc (ArrayDeque)	GC Alloc (LinkedList)
100	2,800	4,500	1.6x slower	2.4 KB	5.6 KB
10,000	310,000	720,000	2.3x slower	160 KB	480 KB
1,000,000	38,000,000	128,000,000	3.4x slower	16 MB	48 MB

LinkedList allocates 3x more memory because every offer creates a 32-byte Node object, and every poll turns that Node into garbage. ArrayDeque allocates the backing array once (with potential resizes) and reuses slots via index arithmetic.

The GC allocation column (measured with -prof gc) shows the real damage. At 1M operations, LinkedList creates 1M Node objects that the GC must trace and collect. ArrayDeque creates one large array and a few resize copies.

LIFO Benchmark: Push and Pop

The stack pattern is identical in outcome. LinkedList’s push maps to addFirst, which allocates a Node. pop maps to removeFirst, which frees a Node. ArrayDeque’s push maps to addFirst (decrement head index), and pop maps to removeFirst (increment head index).

@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 2)
@Measurement(iterations = 5, time = 5)
@Fork(2)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@State(Scope.Benchmark)
public class LifoBenchmark {

    @Param({"100", "10000", "1000000"})
    private int operations;

    @Benchmark
    public int arrayDequeLifo() {
        ArrayDeque<String> deque = new ArrayDeque<>(operations);
        for (int i = 0; i < operations; i++) {
            deque.push("item-" + i);
        }
        int count = 0;
        while (!deque.isEmpty()) {
            count += deque.pop().length();
        }
        return count;
    }

    @Benchmark
    public int linkedListLifo() {
        LinkedList<String> list = new LinkedList<>();
        for (int i = 0; i < operations; i++) {
            list.push("item-" + i);
        }
        int count = 0;
        while (!list.isEmpty()) {
            count += list.pop().length();
        }
        return count;
    }

    @Benchmark
    public int stackLifo() {
        Stack<String> stack = new Stack<>();
        for (int i = 0; i < operations; i++) {
            stack.push("item-" + i);
        }
        int count = 0;
        while (!stack.isEmpty()) {
            count += stack.pop().length();
        }
        return count;
    }
}

Operations	ArrayDeque (ns)	LinkedList (ns)	Stack (ns)	Best
100	2,600	4,200	3,200	ArrayDeque
10,000	280,000	680,000	350,000	ArrayDeque
1,000,000	34,000,000	120,000,000	42,000,000	ArrayDeque

Stack is faster than LinkedList because it extends Vector, which uses a backing array (like ArrayList). But every method is synchronized, adding monitor acquisition overhead even in single-threaded use. ArrayDeque has no synchronization and no per-element allocation, making it the clear winner for both patterns.

The JDK itself recommends against Stack: “A more complete and consistent set of LIFO stack operations is provided by the Deque interface and its implementations, which should be used in preference to this class.”

Content Platform Event Pipeline

The content platform processes article view events through a pipeline: events arrive from the web tier, get buffered, batched, and written to the analytics store. The buffer is a FIFO queue:

// SLOW: LinkedList as event buffer
public class EventBuffer {

    private final Queue<ViewEvent> buffer = new LinkedList<>(); // SLOW
    private static final int BATCH_SIZE = 500;

    public void onEvent(ViewEvent event) {
        buffer.offer(event);
        if (buffer.size() >= BATCH_SIZE) {
            flush();
        }
    }

    private void flush() {
        List<ViewEvent> batch = new ArrayList<>(BATCH_SIZE);
        for (int i = 0; i < BATCH_SIZE && !buffer.isEmpty(); i++) {
            batch.add(buffer.poll());
        }
        writeBatch(batch);
    }

    private void writeBatch(List<ViewEvent> batch) {
        // Write to analytics store
    }
}

Every offer allocates a Node. Every poll creates garbage. At 10,000 events per second, that is 10,000 Node allocations and 10,000 Node deallocations per second, or 320KB of garbage per second from the queue alone.

// FAST: ArrayDeque as event buffer
public class EventBuffer {

    private final ArrayDeque<ViewEvent> buffer = new ArrayDeque<>(1024); // FAST
    private static final int BATCH_SIZE = 500;

    public void onEvent(ViewEvent event) {
        buffer.offer(event);
        if (buffer.size() >= BATCH_SIZE) {
            flush();
        }
    }

    private void flush() {
        List<ViewEvent> batch = new ArrayList<>(BATCH_SIZE);
        for (int i = 0; i < BATCH_SIZE && !buffer.isEmpty(); i++) {
            batch.add(buffer.poll());
        }
        writeBatch(batch);
    }

    private void writeBatch(List<ViewEvent> batch) {
        // Write to analytics store
    }
}

Same API. Zero Node allocations. The ArrayDeque reuses its backing array slots. At steady state (offer and poll at equal rates), the backing array never grows and generates zero garbage.

Concurrent Queues: When Multiple Threads Compete

The single-threaded event buffer works for a single-writer, single-flusher pattern. The content platform’s production event pipeline has multiple web handler threads producing events and a dedicated flusher thread consuming them. This requires a thread-safe queue.

ArrayDeque and LinkedList are not thread-safe. The JDK provides several concurrent queue implementations:

ConcurrentLinkedQueue

Lock-free, unbounded, FIFO. Uses a non-blocking algorithm based on CAS (Compare-And-Swap) operations on linked nodes.

// FAST for high-contention multi-producer single-consumer
Queue<ViewEvent> queue = new ConcurrentLinkedQueue<>();

// Multiple producer threads
executor.submit(() -> {
    queue.offer(new ViewEvent(articleId, timestamp));
});

// Single consumer thread
executor.submit(() -> {
    ViewEvent event;
    while ((event = queue.poll()) != null) {
        process(event);
    }
});

Trade-off: Lock-free means no thread blocks, but CAS retries under contention add CPU cost. Each element still allocates a Node object (similar to LinkedList’s overhead). Unbounded means no backpressure: a slow consumer lets the queue grow without limit, risking OutOfMemoryError.

LinkedBlockingQueue

Lock-based, optionally bounded, FIFO. Uses two separate locks (one for put, one for take) so producers and consumers do not contend with each other.

// FAST for bounded producer-consumer with backpressure
BlockingQueue<ViewEvent> queue = new LinkedBlockingQueue<>(10_000);

// Producer: blocks if queue is full (backpressure)
queue.put(event); // blocks if size == 10,000

// Consumer: blocks if queue is empty
ViewEvent event = queue.take(); // blocks if empty

Trade-off: Bounded capacity provides backpressure but can cause producer threads to block. Still allocates Node objects per element. Two-lock design reduces contention between producers and consumers but producers still contend with each other.

ArrayBlockingQueue

Lock-based, bounded, FIFO. Uses a single lock and a circular array (like ArrayDeque but thread-safe).

// Best for bounded queue with low per-element overhead
BlockingQueue<ViewEvent> queue = new ArrayBlockingQueue<>(10_000);

Trade-off: Single lock means producers and consumers contend with each other. But no Node allocation: elements stored directly in the backing array. Fixed capacity, cannot resize. Best when queue size is predictable and you want minimal GC pressure.

Benchmark: Concurrent Queue Throughput

@BenchmarkMode(Mode.Throughput)
@Warmup(iterations = 3, time = 3)
@Measurement(iterations = 5, time = 5)
@Fork(2)
@OutputTimeUnit(TimeUnit.MILLISECONDS)
@State(Scope.Benchmark)
@Threads(8)
public class ConcurrentQueueBenchmark {

    private ConcurrentLinkedQueue<String> concurrentLinked;
    private LinkedBlockingQueue<String> linkedBlocking;
    private ArrayBlockingQueue<String> arrayBlocking;

    @Setup(Level.Iteration)
    public void setup() {
        concurrentLinked = new ConcurrentLinkedQueue<>();
        linkedBlocking = new LinkedBlockingQueue<>(1_000_000);
        arrayBlocking = new ArrayBlockingQueue<>(1_000_000);
    }

    @Benchmark
    public void concurrentLinkedOfferPoll(Blackhole bh) {
        concurrentLinked.offer("event");
        bh.consume(concurrentLinked.poll());
    }

    @Benchmark
    public void linkedBlockingOfferPoll(Blackhole bh) {
        linkedBlocking.offer("event");
        bh.consume(linkedBlocking.poll());
    }

    @Benchmark
    public void arrayBlockingOfferPoll(Blackhole bh) {
        arrayBlocking.offer("event");
        bh.consume(arrayBlocking.poll());
    }
}

Queue	Throughput (ops/ms)	GC Alloc Rate (MB/s)
ConcurrentLinkedQueue	12,400	38.2
LinkedBlockingQueue	8,600	26.4
ArrayBlockingQueue	9,200	0.8

ConcurrentLinkedQueue has the highest raw throughput (lock-free), but the highest allocation rate (Node objects). ArrayBlockingQueue has near-zero allocation (fixed backing array, no Node objects) at competitive throughput. LinkedBlockingQueue falls in between: two-lock design reduces contention vs single lock, but Node allocation adds GC pressure.

For the content platform’s event pipeline, the recommendation:

// Production event pipeline: bounded ArrayBlockingQueue
// - Fixed capacity provides backpressure (prevent OOM)
// - Zero per-element allocation (events reuse array slots)
// - Predictable GC behavior under load
public class ProductionEventPipeline {

    private final BlockingQueue<ViewEvent> queue =
        new ArrayBlockingQueue<>(50_000);

    public boolean submit(ViewEvent event) {
        return queue.offer(event); // non-blocking, returns false if full
    }

    public void startConsumer() {
        Thread.startVirtualThread(() -> {
            List<ViewEvent> batch = new ArrayList<>(500);
            while (!Thread.currentThread().isInterrupted()) {
                batch.clear();
                try {
                    batch.add(queue.take()); // block until one available
                    queue.drainTo(batch, 499); // drain up to 499 more
                    writeBatch(batch);
                } catch (InterruptedException e) {
                    Thread.currentThread().interrupt();
                }
            }
        });
    }

    private void writeBatch(List<ViewEvent> batch) {
        // Write to analytics store
    }
}

drainTo is a bulk operation that transfers multiple elements in a single lock acquisition, reducing per-element lock overhead compared to calling poll in a loop.

Decision Matrix

Pattern	Best Choice	Avoid
Single-thread FIFO	ArrayDeque	LinkedList, ArrayList
Single-thread LIFO	ArrayDeque	Stack, LinkedList
Multi-producer, single-consumer, unbounded	ConcurrentLinkedQueue	synchronized + LinkedList
Multi-producer, multi-consumer, bounded	ArrayBlockingQueue	LinkedBlockingQueue (if GC-sensitive)
Multi-producer, multi-consumer, high contention	LinkedBlockingQueue	ArrayBlockingQueue (single lock)
Priority ordering	PriorityQueue (single-thread) or PriorityBlockingQueue	sorted LinkedList

The theme is consistent: array-backed implementations outperform node-based implementations for the same access pattern. The difference is cache locality and allocation overhead. Node-based structures pay a per-element allocation cost that compounds into GC pressure, cache pollution, and memory fragmentation. Array-backed structures pay a resize cost when they grow, which is amortized and avoidable with capacity hints.