Noisy Neighbors and CPU Pinning

The main chapter showed the article service’s P99 jumping from 18ms to 95ms when a batch job landed on the same node, with L3 cache hit rate dropping from 94% to 72%. This section goes deeper: how to detect cache contention, what the cpuset cgroup controller actually does, how the kubelet’s static CPU manager policy allocates cores, how NUMA topology affects pinning decisions, and the benchmark results that prove the isolation works.

Anatomy of Cache Contention

A modern server CPU has three levels of cache. L1 and L2 are private to each core. L3 is shared across all cores on a socket. The contention problem lives in L3.

Intel Xeon Gold 6248 (common in Kubernetes clusters):

Per core:
  L1 instruction cache:   32 KB    (4-cycle latency)
  L1 data cache:          32 KB    (4-cycle latency)
  L2 unified cache:       1 MB     (12-cycle latency)

Shared per socket:
  L3 unified cache:       27.5 MB  (44-cycle latency)

Main memory:
  DDR4-2933:              ~80ns    (~200 cycles at 2.5 GHz)

Ratio of L3 miss penalty to L3 hit:
  80ns / 17.6ns = 4.5x slower

When two pods share a socket, their combined working sets compete for L3 space. The cache uses a pseudo-LRU replacement policy. A pod scanning a large dataset sequentially floods the cache with data it will never reuse, evicting hot data from other pods.

Measuring the Contention

Intel Resource Director Technology (RDT) exposes per-process cache and memory bandwidth metrics through the resctrl filesystem. On nodes that support it, you can measure exactly how much cache each pod occupies:

# Check RDT support
cat /proc/cpuinfo | grep -o 'cat_l3\|cqm_llc\|cqm_occup_llc\|mba' | sort -u

# Mount resctrl filesystem
mount -t resctrl resctrl /sys/fs/resctrl

# Create monitoring group for a specific container
CONTAINER_PID=$(crictl inspect $CONTAINER_ID | jq .info.pid)
echo $CONTAINER_PID > /sys/fs/resctrl/mon_groups/article-service/tasks

# Read L3 cache occupancy
cat /sys/fs/resctrl/mon_groups/article-service/mon_data/mon_L3_00/llc_occupancy
# Output: 18874368  (bytes, ~18MB)

# Read memory bandwidth
cat /sys/fs/resctrl/mon_groups/article-service/mon_data/mon_L3_00/mbm_local_bytes
# Output: 2936012800  (bytes/second, ~2.8 GB/s)

Without RDT, perf stat provides indirect measurement through hardware performance counters:

# Measure cache behavior for a container's processes
perf stat -e cache-references,cache-misses,LLC-loads,LLC-load-misses,\
LLC-stores,LLC-store-misses \
-p $CONTAINER_PID -- sleep 10

# Before batch job co-location:
#  LLC-load-misses:     1,247,000  (5.8% of LLC-loads)
#  cache-misses:        3,891,000  (6.2% of cache-references)
#  instructions:        48.2B
#  IPC:                 2.1

# After batch job co-location:
#  LLC-load-misses:     8,934,000  (28.2% of LLC-loads)   +616%
#  cache-misses:        14,221,000 (18.7% of cache-references)  +265%
#  instructions:        48.1B      (same workload)
#  IPC:                 1.3        (-38%)

The instruction count stays the same because the article service is doing the same work. But each instruction takes longer because data is not in cache. IPC (Instructions Per Cycle) drops by 38%, which directly translates to higher latency.

Common Noisy Neighbor Patterns

Not all co-located workloads cause contention equally. The worst offenders have specific access patterns:

Contention severity by workload pattern:

Pattern                          L3 Impact    Bandwidth Impact
───────────────────────────────────────────────────────────────
Sequential scan (log processing) SEVERE       HIGH
  Streams data through cache, evicts everything

Random access (key-value lookup)  LOW          LOW
  Small working set, high reuse, fits in L3

Large working set (ML inference)  SEVERE       MODERATE
  Model weights compete for cache space

Streaming writes (event logging)  MODERATE     HIGH
  Write-allocate policy fills cache lines

Small tight loops (crypto, JSON)  LOW          LOW
  Hot code stays in L1/L2, minimal L3 pressure

The analytics pipeline is the worst offender because it sequentially scans event data. Each 64-byte cache line is loaded, used once, and evicted. The cache replacement policy cannot distinguish between this scan traffic and the article service’s hot data. Both get the same treatment.

The cpuset Cgroup Controller

Linux provides CPU isolation through the cpuset cgroup controller. A process assigned to a cpuset can only execute on the cores listed in that set. The kernel scheduler will not migrate it elsewhere, and no process outside the cpuset can run on those cores.

# View current cpuset for a container
cat /sys/fs/cgroup/cpuset/kubepods/pod$POD_UID/$CONTAINER_ID/cpuset.cpus
# Default (no pinning): 0-15
# With static policy:   4-7

# View memory nodes (NUMA)
cat /sys/fs/cgroup/cpuset/cpuset.mems
# Default: 0-1 (both NUMA nodes)
# With single-numa-node: 0 (only local node)

When the kubelet pins a container to cores 4-7, it also removes cores 4-7 from the cpuset of all non-pinned containers. The shared pool shrinks. This is the enforcement mechanism: exclusion, not just assignment.

Core allocation with static CPU manager policy:

Node: 16 cores (0-15), 2 NUMA nodes (0: cores 0-7, 1: cores 8-15)

Reserved for system (kubelet --reserved-cpus):
  Core 0: kubelet, OS processes, kernel threads

Pinned pods (Guaranteed, integer CPU):
  article-service (4 CPU): cores 1-4
  search-api (2 CPU):      cores 5-6

Shared pool (all remaining pods):
  cores 7-15

Result:
  BestEffort pod cpuset:  7-15 (9 cores)
  Burstable pod cpuset:   7-15 (9 cores)
  article-service cpuset: 1-4  (exclusive)
  search-api cpuset:      5-6  (exclusive)

Configuring the Static Policy

The kubelet must be configured with the static policy at startup. It cannot be changed at runtime without restarting the kubelet and draining the node.

# kubelet configuration for the content platform's latency-sensitive nodes
kubelet \
  --cpu-manager-policy=static \
  --cpu-manager-reconcile-period=10s \
  --reserved-cpus=0 \
  --kube-reserved=cpu=1000m,memory=2Gi \
  --system-reserved=cpu=500m,memory=1Gi

The --reserved-cpus=0 flag pins system processes to core 0, preventing them from interfering with application cores. The reconcile period controls how often the kubelet checks that cpuset assignments match the desired state. The default 10 seconds is sufficient; shorter periods add overhead without benefit.

Pods that qualify for CPU pinning must meet all conditions:

CPU pinning eligibility:

1. Pod QoS class = Guaranteed
   (requests == limits for all containers, both cpu and memory)

2. CPU request is an integer
   (cpu: "2" qualifies, cpu: "1500m" does NOT)

3. All init containers also meet conditions 1 and 2
   (or have integer CPU requests)

A Guaranteed pod with cpu: "1500m" does not get pinned. It runs in the shared pool alongside Burstable and BestEffort pods. The integer requirement exists because a partial core cannot be exclusively assigned. You cannot give a process “1.5 cores” via cpuset; it is either on the core or not.

# SLOW: Guaranteed but not pinned (fractional CPU)
# Still shares cores with all other pods in the shared pool
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: article-service
    resources:
      requests:
        cpu: "1500m"    # Not an integer
        memory: "4Gi"
      limits:
        cpu: "1500m"
        memory: "4Gi"

# FAST: Guaranteed and pinned (integer CPU)
# Gets exclusive cores, no sharing
apiVersion: v1
kind: Pod
spec:
  containers:
  - name: article-service
    resources:
      requests:
        cpu: "2"        # Integer, eligible for pinning
        memory: "4Gi"
      limits:
        cpu: "2"
        memory: "4Gi"

The trade-off is granularity. A service that needs 1.5 cores of CPU must request 2 to get pinning. The extra 0.5 core sits idle when the service is not bursting. For the content platform’s article service, the measured utilization is 65% of the 2 pinned cores, meaning 0.7 cores are wasted on average. The P99 improvement from 95ms to 19ms justifies the 35% idle overhead.

Benchmark: Shared vs Pinned CPU

The following benchmark ran on the content platform’s staging cluster, comparing article service performance under three configurations. The workload is identical: 200 concurrent users requesting rendered articles at a steady rate of 1000 requests/second.

Test environment:
  Node: 16 cores (Intel Xeon Gold 6248, 2.5 GHz)
  Memory: 64GB DDR4-2933
  L3 cache: 27.5MB
  Co-located workload: analytics batch job (4 CPU, sequential scan)

Configuration A: Default scheduling (no pinning, no anti-affinity)
  article-service: 2 CPU Guaranteed, shared cores
  analytics-job: 4 CPU Burstable, shared cores
  Both run on same 16-core shared pool

Configuration B: Anti-affinity only (no pinning)
  article-service: 2 CPU Guaranteed, shared cores, separate node
  analytics-job: runs on different node
  Article service still shares cores with monitoring/logging pods

Configuration C: CPU pinning + anti-affinity + NUMA alignment
  article-service: 2 CPU Guaranteed, pinned to cores 1-2, NUMA node 0
  analytics-job: runs on different node
  Monitoring pods restricted to shared pool (cores 8-15)

Results (10-minute sustained load, 600K total requests):

Metric              Config A      Config B      Config C
──────────────────────────────────────────────────────────
P50 latency         14ms          11ms          9ms
P90 latency         38ms          16ms          13ms
P99 latency         95ms          28ms          19ms
P999 latency        210ms         52ms          24ms
Max latency         380ms         95ms          31ms

L3 cache hit rate   72%           91%           97%
IPC                 1.3           1.9           2.2
Context switches/s  4,200         1,800         120
CPU throttle events 0             0             0

Throughput          1,000 rps     1,000 rps     1,000 rps
CPU utilization     78%           55%           62%

Configuration A to B: removing the noisy neighbor from the same node drops P99 from 95ms to 28ms. L3 cache hit rate recovers from 72% to 91%. The remaining gap from 91% to 97% comes from monitoring and logging sidecars that share the core pool.

Configuration B to C: pinning eliminates context switches almost entirely (4,200 to 120 per second) and pushes L3 hit rate to 97%. The remaining 3% misses are compulsory misses from new data. P99 drops from 28ms to 19ms, and P999 drops from 52ms to 24ms. The tail latency improvement is the most significant: the gap between P99 and P999 shrinks from 24ms to 5ms, indicating highly predictable performance.

NUMA Topology Management

CPU pinning assigns cores. NUMA alignment ensures the memory those cores access is physically close. Without NUMA awareness, a pod pinned to cores on Socket 0 might allocate memory pages on Socket 1, paying the cross-socket penalty on every access.

Topology Manager Policies

The kubelet’s topology manager coordinates resource allocation across CPU, memory, and devices:

kubelet \
  --topology-manager-policy=single-numa-node \
  --topology-manager-scope=pod

Four policies are available:

Topology manager policies:

none (default):
  No topology alignment. CPU and memory allocated independently.
  A pod might get cores on NUMA 0 and memory on NUMA 1.

best-effort:
  Try to align resources to a single NUMA node.
  If alignment is impossible, proceed anyway.
  No scheduling failures from topology.

restricted:
  Require alignment for Guaranteed pods with integer CPU.
  Other pods use best-effort alignment.
  Guaranteed pod rejected if alignment impossible.

single-numa-node:
  ALL resources for an eligible pod must come from one NUMA node.
  Rejected if no single NUMA node has enough resources.
  Strictest policy, strongest isolation.

The content platform uses single-numa-node for its latency-sensitive node pool and best-effort for the batch pool. The batch pool does not need strict alignment because its workloads are throughput-oriented and tolerate variable memory latency.

Memory Allocation with NUMA

When a pod is pinned to cores on NUMA node 0 with single-numa-node policy, the kernel’s memory allocator uses membind to restrict page allocation to NUMA node 0’s memory:

# Verify NUMA memory binding for a pinned pod
numastat -p $CONTAINER_PID

# Correctly aligned (single-numa-node):
# Per-node process memory usage (in MBs)
# PID             Node 0    Node 1    Total
# 28451           3891.2    12.8      3904.0
#                 (99.7%)   (0.3%)

# Misaligned (no topology manager):
# PID             Node 0    Node 1    Total
# 28451           2194.6    1709.4    3904.0
#                 (56.2%)   (43.8%)

The 0.3% on Node 1 in the aligned case is kernel metadata that cannot be NUMA-bound. The application’s data pages all reside on the local node.

Cross-Socket Penalty Measurement

The impact of NUMA misalignment depends on the memory access pattern. Random access workloads suffer more because each access independently hits local or remote memory. Sequential access workloads partially hide the latency through hardware prefetching.

Article service memory access benchmark:
  Workload: rendering article with 50KB template, 8KB cached data,
            120KB JSON payload from search results

Aligned (cores 1-4, memory NUMA 0):
  Average memory load latency:    41.8ns
  Memory bandwidth utilization:   3.1 GB/s (of 50 GB/s local channel)
  P99 per-request latency:        19ms

Misaligned (cores 1-4, memory split NUMA 0+1):
  Average memory load latency:    58.3ns   (+39%)
  Memory bandwidth utilization:   4.8 GB/s (cross-socket traffic)
  P99 per-request latency:        29ms     (+53%)

Fully remote (cores 1-4, memory mostly NUMA 1):
  Average memory load latency:    74.1ns   (+77%)
  Memory bandwidth utilization:   6.2 GB/s (saturating QPI link)
  P99 per-request latency:        38ms     (+100%)

The 39% increase in memory latency from misalignment translates to a 53% increase in service P99 latency. The ratio is not linear because not all request processing time is memory-bound. CPU-bound phases (JSON parsing, template rendering) are unaffected by NUMA placement. But the memory-bound phases (data structure traversal, cache lookups) dominate the tail.

Production Monitoring

Detecting noisy neighbor interference and NUMA misalignment in production requires metrics that standard Kubernetes monitoring does not provide. The content platform exports custom metrics from node-level agents:

# Prometheus metrics exported by the rdt-exporter DaemonSet
- name: node_llc_occupancy_bytes
  help: "L3 cache occupancy per container"
  type: gauge
  labels: [pod, container, numa_node]

- name: node_memory_bandwidth_bytes_per_second
  help: "Memory bandwidth per container"
  type: gauge
  labels: [pod, container, numa_node, direction]

- name: node_ipc
  help: "Instructions per cycle per container"
  type: gauge
  labels: [pod, container]

- name: node_context_switches_per_second
  help: "Involuntary context switches per container"
  type: gauge
  labels: [pod, container]

Alert rules for the content platform:

# Alert when a latency-sensitive pod's cache hit rate drops
- alert: CacheContention
  expr: |
    (1 - rate(node_llc_load_misses_total{pod=~"article-service.*"}[5m])
        / rate(node_llc_loads_total{pod=~"article-service.*"}[5m])) < 0.90
  for: 2m
  labels:
    severity: warning
  annotations:
    summary: "L3 cache hit rate below 90% for {{ $labels.pod }}"

# Alert when IPC drops indicating interference
- alert: IPCDegradation
  expr: |
    node_ipc{pod=~"article-service.*|search-api.*"} < 1.5
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "IPC below 1.5 for {{ $labels.pod }}, possible noisy neighbor"

These alerts have fired twice in production. Both times, a misconfigured job had bypassed the node taint and landed on the latency-sensitive pool. The IPC alert caught it within 5 minutes. Without these metrics, the issue would have appeared as unexplained latency regression in application-level monitoring, triggering a much longer investigation.

Trade-offs and Costs

CPU pinning and NUMA alignment are not free. The costs are concrete and measurable:

Resource efficiency impact:

Without pinning:
  Node CPU allocatable:    14 cores
  All pods share all cores via CFS
  Utilization efficiency:  75-85% (cores shared, high packing)

With pinning (content platform config):
  Pinned cores:            6 (article-service: 4, search-api: 2)
  Shared pool:             8 cores (for batch, monitoring, system)
  Pinned core utilization: 55-65% (cannot be shared when idle)
  Shared pool utilization: 70-80%
  Overall node efficiency: 62-72%

Cost: ~15% lower utilization efficiency per node
      Requires 1-2 additional nodes per cluster to maintain capacity

Operational complexity:

  - kubelet restart required to change CPU manager policy
  - Node drain required before kubelet restart
  - Integer CPU requirement forces over-provisioning for some workloads
  - single-numa-node policy can cause scheduling failures on fragmented nodes
  - Monitoring requires RDT-capable hardware and custom exporters

For the content platform, the math works: two additional nodes at $400/month each versus the latency improvement that affects every page load for every user. The operational complexity is a one-time setup cost amortized across the cluster lifetime.

The configuration is not appropriate for all workloads. Batch processing clusters should use the default CPU manager policy and let CFS share cores freely. Development and staging environments do not need pinning. The static policy is for production latency-sensitive services where P99 stability matters more than packing efficiency.

The combination of QoS classes (Section 1) and CPU pinning (this section) creates a two-layer defense: QoS protects against eviction during resource pressure, and pinning protects against interference during normal operation. Together, they transform Kubernetes from a best-effort scheduler into a platform capable of delivering predictable latency at the tail.