What Docker Actually Is
SummaryDemystifies Docker by building a container from scratch...
Demystifies Docker by building a container from scratch...
Demystifies Docker by building a container from scratch using Linux namespaces, cgroups, and overlay filesystems, then maps each primitive to what docker run actually does, giving the reader a mental model for debugging container issues at the OS level.
What Docker Actually Is
Docker is not a virtual machine. If you take one thing from this section, take that. A virtual machine runs a complete operating system with its own kernel on emulated hardware. Docker runs your process on the host kernel with some clever isolation tricks. The difference isn’t semantic — it’s architectural, and confusing the two will lead you to wrong conclusions about performance, security, and debugging.
A container is a regular Linux process with three constraints applied to it:
- Namespaces — it can’t see certain things
- Cgroups — it can’t use more than certain amounts of resources
- An overlay filesystem — it sees a custom view of the filesystem
That’s it. No hypervisor. No guest kernel. No hardware emulation. Just a process with blinders on and a budget.
macOS/Windows note: Docker Desktop on macOS and Windows runs a Linux VM behind the scenes (using Apple’s Virtualization Framework or WSL2). Your container runs inside that Linux VM. You’re getting a VM whether you wanted one or not — but the container inside it still uses the primitives described here.
Namespaces: What Your Process Can See
A Linux namespace restricts what a process can see. There are several types, each isolating a different aspect of the system:
| Namespace | Isolates | Effect |
|---|---|---|
| PID | Process IDs | Container sees its own PID 1, can’t see host processes |
| NET | Network stack | Container gets its own IP, interfaces, routing table |
| MNT | Mount points | Container sees its own filesystem tree |
| UTS | Hostname | Container can have its own hostname |
| IPC | Inter-process communication | Separate shared memory, semaphores |
| USER | User/group IDs | UID 0 in container can map to unprivileged UID on host |
You can create namespaces manually with unshare:
# Create a new PID and mount namespace, run bash inside it
sudo unshare --pid --mount --fork bash
# Inside this new namespace:
echo $$
# 1 <-- this bash IS PID 1 in this namespace
# Mount a fresh /proc so ps works correctly in the new PID namespace
mount -t proc proc /proc
ps aux
# USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
# root 1 0.0 0.0 7236 4020 pts/0 S 12:00 0:00 bash
# root 2 0.0 0.0 10072 3312 pts/0 R+ 12:00 0:00 ps aux
# Only two processes visible. The host has hundreds.
# Exit and unmount when done
From inside this namespace, the bash process believes it’s PID 1 — the init process. It can’t see any of the host’s processes. From the host’s perspective, this is still just a regular process with a regular PID (say, 45823). The namespace is a lens, not a wall.
You can inspect what namespaces a process belongs to:
# On the host, find the actual PID of our namespaced bash
# Then look at its namespace memberships
ls -la /proc/45823/ns/
# lrwxrwxrwx 1 root root 0 ... cgroup -> 'cgroup:[4026531835]'
# lrwxrwxrwx 1 root root 0 ... mnt -> 'mnt:[4026532589]' # different!
# lrwxrwxrwx 1 root root 0 ... pid -> 'pid:[4026532590]' # different!
# lrwxrwxrwx 1 root root 0 ... net -> 'net:[4026531840]' # same as host
Each namespace is identified by an inode number. Processes in the same namespace share the same inode. This is how you can tell whether two containers share a network namespace — compare their /proc/<pid>/ns/net links.
You can enter an existing namespace with nsenter:
# Enter the namespaces of a running container
# (This is what "docker exec" does under the hood)
CONTAINER_PID=$(docker inspect --format '{{.State.Pid}}' my_container)
nsenter --target $CONTAINER_PID --mount --pid --net bash
That command joins the mount, PID, and network namespaces of the container’s main process. You’re now seeing what the container sees. This is far more powerful than docker exec because you can choose which namespaces to enter — you could join the network namespace but keep the host’s PID namespace, for instance.
Cgroups: What Your Process Can Use
Namespaces control visibility. Cgroups (control groups) control resource consumption. A cgroup sets hard limits on how much CPU, memory, I/O bandwidth, and other resources a process (or group of processes) can use.
Cgroup configuration lives in a filesystem, typically mounted at /sys/fs/cgroup/. You can create and configure cgroups by creating directories and writing to files:
# Create a cgroup (cgroups v2)
sudo mkdir /sys/fs/cgroup/my_container
# Set a memory limit of 100MB
echo $((100 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/my_container/memory.max
# Set a CPU limit: 50% of one core (50000 out of 100000 microseconds)
echo "50000 100000" | sudo tee /sys/fs/cgroup/my_container/cpu.max
# Add the current shell to this cgroup
echo $$ | sudo tee /sys/fs/cgroup/my_container/cgroup.procs
# Now this shell and all its children are limited to
# 100MB RAM and 50% of one CPU core
If a process in this cgroup tries to allocate more than 100MB of memory, the kernel’s OOM (Out Of Memory) killer will terminate it. This is the mechanism behind the dreaded “OOMKilled” status in Kubernetes — your container exceeded its cgroup memory limit, and the kernel killed it. It wasn’t a crash. It was an execution.
You can see a container’s cgroup limits:
# Find the cgroup of a Docker container
CONTAINER_ID=$(docker inspect --format '{{.Id}}' my_container)
cat /sys/fs/cgroup/docker/$CONTAINER_ID/memory.max
cat /sys/fs/cgroup/docker/$CONTAINER_ID/memory.current
When someone says “my container is using 250MB” — this is where that number comes from. memory.current in the cgroup filesystem.
The Overlay Filesystem: Layered Reality
A Docker image isn’t a single filesystem snapshot. It’s a stack of read-only layers with a thin writable layer on top.
When you write a Dockerfile like:
FROM ubuntu:22.04 # Layer 1: base Ubuntu filesystem
RUN apt-get install -y python3 # Layer 2: adds python3 binaries
COPY app.py /app/ # Layer 3: adds your code
Each instruction creates a layer — a directory containing only the files that changed. The overlay filesystem (OverlayFS) merges these layers into a single coherent view:
┌─────────────────────────┐
│ Writable Layer │ ← Container writes go here
├─────────────────────────┤
│ Layer 3: COPY app.py │ ← Read-only
├─────────────────────────┤
│ Layer 2: RUN apt-get │ ← Read-only
├─────────────────────────┤
│ Layer 1: ubuntu:22.04 │ ← Read-only
└─────────────────────────┘
When the container reads /app/app.py, OverlayFS looks down through the layers until it finds the file. When the container writes a file, the write goes to the top writable layer. When the container modifies an existing file from a lower layer, OverlayFS copies it to the writable layer first (copy-on-write again — the same principle as fork()).
You can see the actual layer directories on disk:
docker inspect my_container --format '{{.GraphDriver.Data.MergedDir}}'
# /var/lib/docker/overlay2/abc123.../merged
docker inspect my_container --format '{{json .GraphDriver.Data}}' | python3 -m json.tool
# {
# "LowerDir": "/var/lib/docker/overlay2/layer3/diff:
# /var/lib/docker/overlay2/layer2/diff:
# /var/lib/docker/overlay2/layer1/diff",
# "MergedDir": "/var/lib/docker/overlay2/abc123/merged",
# "UpperDir": "/var/lib/docker/overlay2/abc123/diff",
# "WorkDir": "/var/lib/docker/overlay2/abc123/work"
# }
LowerDir is the stack of read-only layers. UpperDir is the writable layer. MergedDir is the combined view the container sees.
This is why docker images shows shared layers — if ten images are all FROM ubuntu:22.04, that base layer exists only once on disk. It’s also why writing large files inside a running container is slow: OverlayFS has overhead compared to writing directly to ext4 or xfs. Databases inside containers suffer from this. This is why production database containers use volume mounts that bypass OverlayFS entirely.
Building a Container Without Docker
Let’s put the primitives together. Here’s a minimal “container” using only shell commands:
# 1. Create a root filesystem (just use Alpine's minimal rootfs)
mkdir -p /tmp/mycontainer/rootfs
cd /tmp/mycontainer
curl -o alpine.tar.gz https://dl-cdn.alpinelinux.org/alpine/v3.19/releases/x86_64/alpine-minirootfs-3.19.0-x86_64.tar.gz
tar xzf alpine.tar.gz -C rootfs
# 2. Set up a cgroup with resource limits
sudo mkdir /sys/fs/cgroup/mycontainer
echo $((50 * 1024 * 1024)) | sudo tee /sys/fs/cgroup/mycontainer/memory.max
# 3. Launch a process with new namespaces, in the cgroup,
# with the new root filesystem
echo $$ | sudo tee /sys/fs/cgroup/mycontainer/cgroup.procs
sudo unshare --pid --mount --uts --ipc --fork chroot rootfs /bin/sh -c '
mount -t proc proc /proc
hostname mycontainer
echo "Hello from inside a container!"
echo "PID: $$"
echo "Hostname: $(hostname)"
ps aux
exec /bin/sh
'
This gives you an isolated process with its own PID namespace (it sees itself as PID 1), its own hostname, its own filesystem root, and a 50MB memory limit. No Docker daemon. No image registry. Just three Linux primitives composed together.
What docker run Actually Does
When you run docker run -it --memory=512m --cpus=1.5 ubuntu bash, Docker:
- Pulls the image layers (if not cached) and assembles them into an overlay filesystem
- Creates a new set of namespaces: PID, NET, MNT, UTS, IPC (optionally USER)
- Creates a cgroup and sets
memory.maxto 512MB,cpu.maxto150000 100000 - Sets up a virtual ethernet pair (
veth) connecting the container’s network namespace to thedocker0bridge - Sets the overlay
MergedDiras the root filesystem viapivot_root(similar tochrootbut more secure) - Drops Linux capabilities (the container’s root can’t load kernel modules, for example)
- Applies seccomp filters (blocks ~44 dangerous syscalls like
rebootandmount) - Executes
bashas PID 1 inside the container
Every one of these steps uses the OS primitives discussed above. Docker is an orchestrator of existing kernel features, not a new virtualization technology.
Why This Matters for Debugging
When a container “can’t reach the network,” the problem is in the NET namespace — check docker inspect for network settings, use nsenter --net to enter its network namespace and run ip addr / ip route. When a container gets OOMKilled, the problem is in its cgroup — check memory.max vs memory.current. When a container’s filesystem writes are slow, the problem might be OverlayFS — consider a volume mount.
The abstraction Docker provides is valuable. But debugging requires you to see through the abstraction to the primitives underneath. You can’t fix a namespace issue by restarting the container. You can’t fix a cgroup limit by upgrading your base image. You have to know what layer the problem lives on.
Docker didn’t invent containers. It made them convenient. Convenience is good — until it breaks, and you need to understand what’s actually happening.