Skip to main content
the invisible-layer how abstraction is making software engineers dumber

The Machine: Memory, the Stack, and the Heap

12 min read Chapter 14 of 56
Summary

Traces what actually happens in memory when you...

Traces what actually happens in memory when you write x = 42 in Python versus C, exposing the stack/heap distinction, process memory layout, cache locality effects, and the 100x performance gap that abstraction hides from you.

The Machine: Memory, the Stack, and the Heap

One Line, Two Realities

Write this in Python:

x = 42

Now write this in C:

int x = 42;

They look equivalent. They are not. The distance between these two lines is the distance between knowing what your computer does and hoping that it does something reasonable.

In C, int x = 42 does exactly what it says. The compiler emits an instruction that places the 32-bit integer value 42 directly into a register or onto the current stack frame. Four bytes. No indirection. No metadata. The value sits on the stack, right next to the function’s return address and other local variables. When the function returns, that memory is reclaimed instantly — the stack pointer moves, and the space is gone. Cost: effectively zero.

In Python, x = 42 triggers a cascade you never see. CPython allocates a PyObject on the heap — a structure that looks like this in C:

typedef struct {
    Py_ssize_t ob_refcnt;    // 8 bytes: reference count
    PyTypeObject *ob_type;    // 8 bytes: pointer to the type object
} PyObject;

typedef struct {
    PyObject ob_base;         // 16 bytes from above
    long ob_ival;             // 8 bytes: the actual value (on 64-bit systems)
} PyLongObject;              // Total: at minimum 28 bytes, but CPython uses
                             // a variable-length representation

That 42 is not four bytes. It’s a minimum of 28 bytes of heap-allocated memory, wrapped in a type-tagged object with a reference count. The name x is then inserted into the local namespace dictionary — another hash table lookup, another heap allocation, another layer of indirection. The assignment x = 42 involves a heap allocation, a dictionary insertion, a reference count increment, and a type pointer assignment. The C version involves moving a literal into a register.

CPython does optimize small integers. Values from -5 to 256 are pre-allocated and cached, so x = 42 reuses an existing PyLongObject rather than creating a new one. But x = 257 triggers a fresh heap allocation every time. And none of this changes the fundamental cost: every access to x chases a pointer from the namespace dictionary to the PyObject on the heap.

You can see this directly:

import sys

x = 42
print(sys.getsizeof(x))   # 28 bytes on CPython 3.11+
print(id(x))               # Memory address of the PyObject

y = 257
z = 257
print(id(y) == id(z))      # False — two separate heap allocations

The C version:

#include <stdio.h>

int main() {
    int x = 42;
    printf("Size: %zu bytes\n", sizeof(x));  // 4 bytes. Always.
    printf("Address: %p\n", (void*)&x);       // Stack address
    return 0;
}

Four bytes vs twenty-eight. Stack vs heap. Direct access vs pointer chasing. This is where the 100x performance gap begins.

The Stack and the Heap

Every running process divides its virtual memory into segments with distinct purposes. Here’s what the layout actually looks like:

Linux Process Virtual Memory Layout showing text, data, BSS, heap, stack, and kernel space segments with virtual address ranges

Linux process virtual memory layout: the text segment holds read-only executable machine code shared between processes; the data segment (.data/.rodata) stores initialized global and static variables; BSS holds uninitialized globals zero-filled at load time; the heap grows upward from the BSS boundary via malloc/new (managed by the kernel’s brk syscall and mmap for large allocations); an unmapped region separates heap from stack and is used for shared libraries loaded via mmap; the stack grows downward from high addresses, storing local variables, return addresses, and function parameters in LIFO order; kernel space at the top is inaccessible to user processes. Every segfault, OOM kill, and “stack overflow” error is a violation of one of these boundaries.

The stack is a contiguous block of memory managed by the CPU itself. When you call a function, the CPU pushes a stack frame — space for return address, arguments, and local variables. When the function returns, the stack pointer moves back. Allocation is one instruction: subtract from the stack pointer. Deallocation is one instruction: add to the stack pointer. No fragmentation. No garbage collector. No overhead.

void example() {
    int a = 10;       // Stack: 4 bytes, instant
    double b = 3.14;  // Stack: 8 bytes, instant
    char buf[256];    // Stack: 256 bytes, instant
    // All reclaimed when example() returns
}

The heap is a pool of memory managed by a memory allocator (malloc in C, new in C++, the runtime in managed languages). Heap allocation is slow: the allocator must search for a free block of sufficient size, update its internal bookkeeping, and handle fragmentation. Deallocation requires either explicit calls (free) or a garbage collector.

void example_heap() {
    int *a = malloc(sizeof(int));  // Heap: allocator searches free list
    *a = 10;                       // Write through pointer indirection
    // ... use a ...
    free(a);                       // Explicit deallocation, or you leak
}

When does something go on the stack vs the heap?

Stack: Local variables of known, fixed size. Function arguments. Return addresses. In C, any variable declared inside a function with a type the compiler knows the size of. The stack is fast but limited — typically 1-8 MB per thread — and the data dies when the function returns.

Heap: Anything that needs to outlive the function that created it. Anything of dynamic or unknown size. In Python, everything — because every value is a PyObject whose lifetime is managed by reference counting, not by scope.

Here’s the critical distinction: stack allocation is O(1) — move a pointer. Heap allocation is O(n) in the worst case — search a free list, possibly request memory from the OS via mmap or sbrk. In practice, modern allocators like jemalloc or tcmalloc amortize this, but heap allocation is still 10-100x slower than stack allocation.

Python’s Memory Model: Everything Is an Object

Python doesn’t use the stack for your data. Every value — integers, floats, strings, lists, None, True — is a PyObject allocated on the heap. The stack holds only bytecode instruction pointers and frame metadata.

def add(a, b):
    result = a + b
    return result

When CPython executes this, the stack frame for add contains pointers to PyObjects, not the values themselves. The actual integers a and b point to are somewhere on the heap. The addition a + b involves:

  1. Dereference the pointer to a’s PyObject
  2. Check its type tag (ob_type) to determine it’s an integer
  3. Dereference the pointer to b’s PyObject
  4. Check its type tag
  5. Call the __add__ method for the integer type
  6. Allocate a new PyObject on the heap for the result
  7. Set the result’s reference count to 1
  8. Return a pointer to the new object

Compare this to what a C compiler produces for int result = a + b:

add     eax, ebx    ; One instruction. Done.

One instruction vs eight steps involving multiple pointer dereferences, type checks, and a heap allocation. This is not a minor difference. This is a structural consequence of Python’s object model.

And then there’s the GIL — the Global Interpreter Lock. Because reference counts must be updated atomically, and because CPython chose not to use fine-grained locking for every PyObject, a single mutex protects the entire interpreter. One thread runs Python bytecode at a time. Your 16-core machine runs Python on one core.

The Benchmark: 10 Million Integers

Let’s stop theorizing and measure.

Python:

import time

def sum_python():
    total = 0
    for i in range(10_000_000):
        total += i
    return total

start = time.perf_counter()
result = sum_python()
elapsed = time.perf_counter() - start
print(f"Python loop: {elapsed:.3f}s, result: {result}")

# Using built-in sum (C-implemented)
start = time.perf_counter()
result = sum(range(10_000_000))
elapsed = time.perf_counter() - start
print(f"Python sum(): {elapsed:.3f}s, result: {result}")

C:

#include <stdio.h>
#include <time.h>

int main() {
    struct timespec start, end;
    clock_gettime(CLOCK_MONOTONIC, &start);

    long total = 0;
    for (int i = 0; i < 10000000; i++) {
        total += i;
    }

    clock_gettime(CLOCK_MONOTONIC, &end);
    double elapsed = (end.tv_sec - start.tv_sec) +
                     (end.tv_nsec - start.tv_nsec) / 1e9;
    printf("C loop: %.3fs, result: %ld\n", elapsed, total);
    return 0;
}

Typical results on a modern x86-64 machine (compiled with gcc -O2):

VersionTimeRelative
C loop0.005s1x
Python sum(range())0.12s24x
Python explicit loop0.52s104x

The Python explicit loop is 104x slower than C. Even Python’s optimized sum() — implemented in C inside CPython — is 24x slower because it still has to unbox each PyObject to extract the integer value.

This isn’t because Python is “interpreted.” It’s because every integer is a 28-byte heap object accessed through a pointer. Every addition allocates a new object. Every iteration increments and decrements reference counts. The CPU spends its time chasing pointers and managing metadata instead of doing arithmetic.

Cache Locality: The Hidden Multiplier

There’s another cost lurking beneath the heap allocation story, and it makes the gap even worse.

Modern CPUs don’t read memory one byte at a time. They read it in 64-byte cache lines. When you access one byte, the CPU fetches the surrounding 64 bytes into L1 cache. If your next access is within those 64 bytes, it hits the cache — about 1 nanosecond. If it’s somewhere else entirely, you wait for a cache miss — 100 nanoseconds or more. That’s a 100x penalty.

An array of integers stored contiguously in memory is cache-friendly. Access element 0, and elements 1 through 15 are already in the cache line. Iterate the array and almost every access is a cache hit.

A linked list is cache-hostile. Each node is allocated separately on the heap, scattered across memory. Every node->next dereference is a potential cache miss.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define N 10000000

// Array: contiguous memory
long sum_array(int *arr, int n) {
    long total = 0;
    for (int i = 0; i < n; i++) {
        total += arr[i];
    }
    return total;
}

// Linked list: scattered memory
struct Node {
    int value;
    struct Node *next;
};

long sum_linked_list(struct Node *head) {
    long total = 0;
    while (head) {
        total += head->value;
        head = head->next;
    }
    return total;
}

int main() {
    // Build array
    int *arr = malloc(N * sizeof(int));
    for (int i = 0; i < N; i++) arr[i] = i;

    // Build linked list (allocated in random order to simulate real heap)
    struct Node **nodes = malloc(N * sizeof(struct Node*));
    for (int i = 0; i < N; i++) {
        nodes[i] = malloc(sizeof(struct Node));
        nodes[i]->value = i;
    }
    // Shuffle to destroy allocation-order locality
    for (int i = N - 1; i > 0; i--) {
        int j = rand() % (i + 1);
        struct Node *tmp = nodes[i];
        nodes[i] = nodes[j];
        nodes[j] = tmp;
    }
    for (int i = 0; i < N - 1; i++) nodes[i]->next = nodes[i + 1];
    nodes[N - 1]->next = NULL;

    struct timespec start, end;

    clock_gettime(CLOCK_MONOTONIC, &start);
    long r1 = sum_array(arr, N);
    clock_gettime(CLOCK_MONOTONIC, &end);
    double t_array = (end.tv_sec - start.tv_sec) +
                     (end.tv_nsec - start.tv_nsec) / 1e9;

    clock_gettime(CLOCK_MONOTONIC, &start);
    long r2 = sum_linked_list(nodes[0]);
    clock_gettime(CLOCK_MONOTONIC, &end);
    double t_list = (end.tv_sec - start.tv_sec) +
                    (end.tv_nsec - start.tv_nsec) / 1e9;

    printf("Array:       %.3fs (result: %ld)\n", t_array, r1);
    printf("Linked list: %.3fs (result: %ld)\n", t_list, r2);
    printf("Ratio:       %.1fx\n", t_list / t_array);

    free(arr);
    for (int i = 0; i < N; i++) free(nodes[i]);
    free(nodes);
    return 0;
}

Typical results:

StructureTimeCache misses per iteration
Array0.008s~0 (sequential prefetch)
Linked list (shuffled)0.18s~1 (pointer chasing)
Ratio~22x

Same number of elements. Same summation. A 22x difference, purely from memory layout. The CPU is waiting for data to arrive from RAM on almost every linked list access, while the array version streams through cache at full speed.

This is why Python lists (which are arrays of PyObject* pointers) are still slower than C arrays even though they’re contiguous: each element is a pointer, and following that pointer to the actual PyObject is a cache miss. You get contiguous pointers but scattered objects.

What You Don’t Know Is Costing You

The Python community’s standard response to these benchmarks is: “Use NumPy.” And yes, NumPy stores data in contiguous C arrays, bypassing the PyObject overhead. That’s the right answer for numerical work. But it’s also an admission: Python’s default memory model is too expensive for serious computation.

You cannot reason about the performance of your code if you don’t know where your data lives. Is it on the stack or the heap? Is it contiguous or scattered? Is each access going to L1 cache or to main memory? These aren’t academic questions. They’re the difference between a service that responds in 2 milliseconds and one that responds in 200 milliseconds.

The abstraction tells you: a variable holds a value. The machine tells you: a name in a hash table points to a reference-counted, type-tagged object on the heap, accessed through multiple layers of indirection, with memory that is eventually reclaimed by a garbage collector that pauses your program at unpredictable intervals.

You’re building on the second reality whether or not you acknowledge the first.