Characterizing AWS Graviton Memory Subsystems: Graviton2 vs. Graviton4 Performance
These articles are AI-generated summaries. Please check the original sources for full details.
Characterize the AWS Graviton memory subsystem using ASCT
The Arm System Characterization Tool (ASCT) provides an empirical method to analyze the cache hierarchy of AWS Graviton instances. Benchmarks show that Graviton4’s Neoverse V2 cores double the private L2 cache size to 2 MB per core compared to Graviton2.
Why This Matters
For memory-bound workloads, CPU clock speed is secondary to how efficiently code accesses the cache hierarchy and DRAM. Ignoring these boundaries leads to ‘performance cliffs’ where a slight increase in the working set size can trigger a jump from L2 latency (4.0 ns on Graviton4) to LLC or DRAM latency, drastically reducing throughput despite available compute cycles.
Key Insights
- Graviton4 (Neoverse V2) achieves significantly higher architectural efficiency than Graviton2 (Neoverse N1), specifically providing 114.64 B/cycle at L1 versus 63.76 B/cycle (2026 analysis).
- The transition from DDR4 in Graviton2 to DDR5 in Graviton4 increases unloaded baseline DRAM latency from 95 ns to 110 ns, but provides higher throughput via independent dual 32-bit subchannels.
- Pointer chasing is used via ASCT to bypass hardware prefetchers and measure true round-trip memory latency by creating a dependent chain of load instructions.
- Graviton4 demonstrates a 26% latency improvement in both L2 and LLC tiers compared to its predecessor.
Working Examples
Script to enumerate cache properties directly from sysfs.
for c in /sys/devices/system/cpu/cpu0/cache/index*; do
echo "=== $(basename $c) ==="
echo "Level: $(cat $c/level)"
echo "Type: $(cat $c/type)"
echo "Size: $(cat $c/size)"
echo "Shared CPU list: $(cat $c/shared_cpu_list)"
echo
done
Script to generate a structured summary of cache associativity and line sizes.
for cpu in 0; do
echo "=== CPU $cpu ==="
for idx in /sys/devices/system/cpu/cpu${cpu}/cache/index*; do
level=$(cat $idx/level)
type=$(cat $idx/type)
size=$(cat $idx/size)
ways=$(cat $idx/ways_of_associativity)
line=$(cat $idx/coherency_line_size)
shared=$(cat $idx/shared_cpu_list)
echo " L${level} ${type}: ${size}, ${ways}-way, ${line}B line, shared with CPUs: ${shared}"
done
done
Command to automate pointer chase benchmarks for cache and memory latency.
sudo asct run latency-sweep --output-dir latency_results_$(hostname)
Command to measure peak single-core throughput across different cache levels.
sudo asct run bandwidth-sweep --output-dir bandwidth_results_$(hostname)
Practical Applications
-
- Use Case: High-throughput multi-threaded workloads on c8g instances benefit from the doubled private L2 cache (2 MB), which minimizes coherency traffic over the shared L3 interconnect.
- Pitfall: Random memory access patterns that ignore the 64-byte cache line size lead to wasted bandwidth by fetching data that is never used.
-
- Use Case: Application tuning for Neoverse V deplooyments using ASCT
diffutility to quantify performance gains between instance generations.
- Use Case: Application tuning for Neoverse V deplooyments using ASCT
- Pitfall: Relying solely on cycle counts for comparison across different processor clock speeds; failure to normalize results into nanoseconds or bytes per cycle leads to inaccurate architectural assessments.
References:
Continue reading
Next article
Demystifying the JavaScript Event Loop: How Asynchronous Processing Works
Related Content
Mastering AWS Lambda for Real-Time Pipelines: A Technical Deep Dive
Optimize AWS Lambda performance using memory-CPU scaling, VPC integration, and Kinesis stream processing with a 15-minute execution limit.
Floci: A High-Fidelity AWS Emulator with 24ms Startup
Floci optimizes AWS emulation using a 13 MiB native binary core for control planes and real Docker-backed engines for data planes, delivering high-fidelity testing.
MiniStack vs Floci vs LocalStack: 2026 Local Cloud Performance Benchmark
Comprehensive benchmark reveals MiniStack supports 31 AWS services with a 211 MB image size and sub-2s startup, outperforming LocalStack and Floci in resource efficiency.