Skip to main content

On This Page

Characterizing AWS Graviton Memory Subsystems: Graviton2 vs. Graviton4 Performance

3 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Characterize the AWS Graviton memory subsystem using ASCT

The Arm System Characterization Tool (ASCT) provides an empirical method to analyze the cache hierarchy of AWS Graviton instances. Benchmarks show that Graviton4’s Neoverse V2 cores double the private L2 cache size to 2 MB per core compared to Graviton2.

Why This Matters

For memory-bound workloads, CPU clock speed is secondary to how efficiently code accesses the cache hierarchy and DRAM. Ignoring these boundaries leads to ‘performance cliffs’ where a slight increase in the working set size can trigger a jump from L2 latency (4.0 ns on Graviton4) to LLC or DRAM latency, drastically reducing throughput despite available compute cycles.

Key Insights

  • Graviton4 (Neoverse V2) achieves significantly higher architectural efficiency than Graviton2 (Neoverse N1), specifically providing 114.64 B/cycle at L1 versus 63.76 B/cycle (2026 analysis).
  • The transition from DDR4 in Graviton2 to DDR5 in Graviton4 increases unloaded baseline DRAM latency from 95 ns to 110 ns, but provides higher throughput via independent dual 32-bit subchannels.
  • Pointer chasing is used via ASCT to bypass hardware prefetchers and measure true round-trip memory latency by creating a dependent chain of load instructions.
  • Graviton4 demonstrates a 26% latency improvement in both L2 and LLC tiers compared to its predecessor.

Working Examples

Script to enumerate cache properties directly from sysfs.

for c in /sys/devices/system/cpu/cpu0/cache/index*; do
echo "=== $(basename $c) ==="
echo "Level: $(cat $c/level)"
echo "Type: $(cat $c/type)"
echo "Size: $(cat $c/size)"
echo "Shared CPU list: $(cat $c/shared_cpu_list)"
echo
done

Script to generate a structured summary of cache associativity and line sizes.

for cpu in 0; do
echo "=== CPU $cpu ==="
for idx in /sys/devices/system/cpu/cpu${cpu}/cache/index*; do
level=$(cat $idx/level)
type=$(cat $idx/type)
size=$(cat $idx/size)
ways=$(cat $idx/ways_of_associativity)
line=$(cat $idx/coherency_line_size)
shared=$(cat $idx/shared_cpu_list)
echo " L${level} ${type}: ${size}, ${ways}-way, ${line}B line, shared with CPUs: ${shared}"
done
done

Command to automate pointer chase benchmarks for cache and memory latency.

sudo asct run latency-sweep --output-dir latency_results_$(hostname)

Command to measure peak single-core throughput across different cache levels.

sudo asct run bandwidth-sweep --output-dir bandwidth_results_$(hostname)

Practical Applications

    • Use Case: High-throughput multi-threaded workloads on c8g instances benefit from the doubled private L2 cache (2 MB), which minimizes coherency traffic over the shared L3 interconnect.
  • Pitfall: Random memory access patterns that ignore the 64-byte cache line size lead to wasted bandwidth by fetching data that is never used.
    • Use Case: Application tuning for Neoverse V deplooyments using ASCT diff utility to quantify performance gains between instance generations.
  • Pitfall: Relying solely on cycle counts for comparison across different processor clock speeds; failure to normalize results into nanoseconds or bytes per cycle leads to inaccurate architectural assessments.

References:

Continue reading

Next article

Demystifying the JavaScript Event Loop: How Asynchronous Processing Works

Related Content