Skip to main content
the invisible-layer how abstraction is making software engineers dumber

DNS: The Internet's Achilles' Heel

10 min read Chapter 12 of 56
Summary

A deep dive into DNS resolution from getaddrinfo()...

A deep dive into DNS resolution from getaddrinfo() through the full resolver chain, exposing JVM caching traps, TTL-related outages, debugging with dig and nslookup, and why DNS is the silent killer behind most 'mysterious' production incidents.

DNS: The Internet’s Achilles’ Heel

Every network request begins with a question: what is the IP address of this hostname? And the system responsible for answering that question — DNS — is a distributed, hierarchical, eventually-consistent database that was designed in 1983 and has been held together with caching, convention, and quiet desperation ever since.

DNS is the single most common hidden cause of production incidents. Not because DNS is unreliable — it’s remarkably robust for its age — but because engineers don’t understand its caching semantics, don’t monitor its behavior, and routinely build systems that assume name resolution is instantaneous, deterministic, and free.

It is none of those things.

The Resolution Chain: What Actually Happens

When your Python code calls requests.get("https://api.example.com/users"), the requests library eventually needs an IP address. Deep inside urllib3, a call is made to socket.getaddrinfo(), which is a thin wrapper around the POSIX C function getaddrinfo().

You can see this directly:

import socket

# This is what happens underneath requests.get()
results = socket.getaddrinfo("api.example.com", 443, socket.AF_UNSPEC, socket.SOCK_STREAM)
for family, socktype, proto, canonname, sockaddr in results:
    print(f"Family: {family.name}, Address: {sockaddr[0]}, Port: {sockaddr[1]}")

# Output:
# Family: AF_INET, Address: 93.184.216.34, Port: 443
# Family: AF_INET6, Address: 2606:2800:220:1:248:1893:25c8:1946, Port: 443

That innocent function call triggers a multi-step decision process inside the operating system. The order of operations is governed by a file most engineers have never opened: /etc/nsswitch.conf.

The nsswitch Decision Chain

Open /etc/nsswitch.conf on any Linux system and look for the hosts line:

hosts:          files dns myhostname

This line is an instruction manual for the resolver. It reads left to right:

  1. files — Check /etc/hosts first. This flat file maps hostnames to IP addresses directly. If api.example.com has an entry here, resolution stops. No DNS query is sent. This is how localhost resolves to 127.0.0.1 — it’s hardcoded in /etc/hosts, not resolved through DNS.

  2. dns — If /etc/hosts has no match, the resolver reads /etc/resolv.conf to find the configured nameserver and sends a DNS query.

  3. myhostname — As a final fallback, the machine’s own hostname resolves to its own address.

The order matters. If you put dns before files, /etc/hosts overrides are ignored. Container orchestrators like Kubernetes manipulate /etc/resolv.conf to inject cluster-internal DNS servers, which is how my-service.default.svc.cluster.local resolves inside a pod. If you’ve ever wondered why your container can reach services by name — this is the mechanism.

The Recursive Resolution

When the query hits the DNS recursive resolver (the server listed in /etc/resolv.conf), a layered lookup begins. Say you’re resolving api.example.com:

  1. The resolver checks its cache. If it has a fresh answer (within TTL), it returns immediately.
  2. If not cached, it asks a root nameserver (there are 13 root server IPs, operated by different organizations): “Who handles .com?”
  3. The root server responds with the address of a .com TLD (Top-Level Domain) nameserver.
  4. The resolver asks the TLD server: “Who handles example.com?”
  5. The TLD server responds with the authoritative nameserver for example.com (e.g., ns1.example.com at 198.51.100.1).
  6. The resolver asks the authoritative server: “What is the A record for api.example.com?”
  7. The authoritative server responds: 93.184.216.34, TTL 300 seconds.

The resolver caches the result for 300 seconds and returns it to your application.

You can trace this entire chain manually:

# Ask a root server directly, disabling recursive resolution
$ dig @a.root-servers.net api.example.com +norecurse

# You'll get a referral to .com TLD servers
;; AUTHORITY SECTION:
com.                172800  IN  NS  a.gtld-servers.net.

# Now ask the .com TLD server
$ dig @a.gtld-servers.net api.example.com +norecurse

# You'll get a referral to example.com's nameservers
;; AUTHORITY SECTION:
example.com.        172800  IN  NS  ns1.example.com.

# Now ask the authoritative nameserver
$ dig @ns1.example.com api.example.com

# You'll get the actual answer
;; ANSWER SECTION:
api.example.com.    300     IN  A   93.184.216.34

That 300 is the TTL — five minutes. After five minutes, your resolver will re-query the authoritative server. During those five minutes, if example.com’s operator changes the IP address, your resolver won’t know. Your traffic will continue flowing to the old address.

The JVM DNS Caching Trap

Java’s InetAddress.getByName() doesn’t just call getaddrinfo() and trust the OS. The JVM maintains its own DNS cache, and the default behavior is pathological.

When a SecurityManager is installed (common in enterprise environments, application servers like Tomcat, and any application using Java Security), the JVM caches successful DNS lookups forever. The property networkaddress.cache.ttl defaults to -1, meaning infinite caching. The TTL from the DNS response is ignored entirely.

// This caches the result FOREVER in the JVM
InetAddress addr = InetAddress.getByName("api.example.com");
// Even if the DNS record changes, this JVM will keep hitting the old IP
// until the process restarts

Without a SecurityManager, the default TTL is 30 seconds — reasonable, but still independent of the actual DNS TTL. You can override it:

// In code (must be called before any DNS resolution)
java.security.Security.setProperty("networkaddress.cache.ttl", "60");

// Or via JVM argument
-Dsun.net.inetaddr.ttl=60

// Or in $JAVA_HOME/conf/security/java.security
networkaddress.cache.ttl=60

AWS explicitly recommends setting this to 60 seconds for any application using their services, because AWS infrastructure frequently reassigns IP addresses to hostnames. Their own SDK documentation includes this warning. The fact that they had to write it tells you how often this bites people.

Production Failure: The Ghost Traffic Incident

Here’s a real pattern that plays out in production regularly.

A team runs a backend service behind a load balancer at api.internal.company.com. The DNS record points to the load balancer’s IP: 10.0.1.50, TTL 300 seconds. During a migration, the team provisions a new load balancer at 10.0.2.75 and updates the DNS record to point to the new IP.

The old load balancer is decommissioned 10 minutes after the DNS change.

What happens:

  • Clients that resolved DNS before the change have 10.0.1.50 cached.
  • For up to 300 seconds (the TTL), those clients continue sending traffic to 10.0.1.50 — the decommissioned host.
  • If the old host is completely down, connections fail immediately with ECONNREFUSED. Noisy, but obvious.
  • If the old host’s IP has been reassigned to a different machine (common in cloud environments with elastic IPs), traffic arrives at a machine that has no idea what to do with it. Requests hang, timeout after 30 seconds, or worse — get processed by the wrong service entirely.
  • Java services with default caching never recover without a restart.

The fix: always lower the TTL before migrations. Set it to 30 or 60 seconds hours before the change. Wait for the old TTL to expire. Then change the record. Then wait again for the new, shorter TTL to expire across all caches. Only then decommission the old target.

# Step 1: Lower the TTL (do this 24 hours before migration)
# In your DNS provider, change the TTL for api.internal.company.com from 3600 to 60

# Step 2: Verify the TTL has propagated
$ dig api.internal.company.com | grep -A1 "ANSWER SECTION"
api.internal.company.com.  60  IN  A  10.0.1.50

# Step 3: Change the IP record
# Update A record from 10.0.1.50 to 10.0.2.75

# Step 4: Wait for old caches to drain (at least 60 seconds)
# Step 5: Verify traffic is hitting the new host
# Step 6: After migration is confirmed stable, raise the TTL back

This procedure is documented nowhere in most migration runbooks. Engineers learn it after the incident.

Debugging DNS: Your Toolkit

Three tools. That’s all you need.

dig — The standard DNS query tool. Shows the full response including TTL, authority, and additional sections:

# Basic query
$ dig api.example.com

# Query a specific nameserver
$ dig @8.8.8.8 api.example.com

# Query for a specific record type
$ dig api.example.com AAAA    # IPv6
$ dig example.com MX          # Mail servers
$ dig example.com NS          # Nameservers
$ dig example.com TXT         # TXT records (SPF, DKIM, verification)

# Trace the full resolution chain
$ dig +trace api.example.com

# Short output, just the answer
$ dig +short api.example.com
93.184.216.34

nslookup — Simpler, available everywhere, useful for quick checks:

$ nslookup api.example.com
Server:    8.8.8.8
Address:   8.8.8.8#53

Non-authoritative answer:
Name:    api.example.com
Address: 93.184.216.34

The “Non-authoritative answer” means the response came from a cache, not from the authoritative nameserver directly.

resolvectl (systemd-resolved) — On modern Linux systems using systemd-resolved, this shows the current resolver configuration and cache statistics:

$ resolvectl status
Global
       Protocols: +LLMNR +mDNS -DNSOverTLS DNSSEC=no/unsupported
resolv.conf mode: stub

Link 2 (eth0)
    Current Scopes: DNS LLMNR/IPv4
         Protocols: +DefaultRoute +LLMNR -mDNS -DNSOverTLS DNSSEC=no/unsupported
Current DNS Server: 8.8.8.8
       DNS Servers: 8.8.8.8 8.8.4.4

# Flush the local DNS cache
$ resolvectl flush-caches

DNS Over HTTPS: What It Changes, What It Doesn’t

Traditional DNS queries travel over UDP port 53, unencrypted. Anyone on the network path — your ISP, a coffee shop’s Wi-Fi operator, a compromised router — can see every hostname you resolve. DNS over HTTPS (DoH) wraps DNS queries inside HTTPS connections to resolvers like https://dns.google/dns-query or https://cloudflare-dns.com/dns-query.

What DoH changes: privacy. Your DNS queries are encrypted. Your ISP can’t see which hostnames you’re resolving (though they can still see the IPs you connect to afterward, and often infer the hostname from SNI in the TLS handshake unless you’re also using Encrypted Client Hello).

What DoH doesn’t change: everything else about DNS resolution semantics. TTLs still apply. Caching still works the same way. Stale records still cause outages. The JVM still caches forever. The resolution chain is identical — it’s just the transport between your stub resolver and the recursive resolver that’s encrypted.

DoH also introduces a dependency that plain DNS doesn’t have: your DNS resolution now requires a working HTTPS stack, including TLS certificate verification. If the DoH resolver’s certificate expires, DNS stops working entirely. This has happened. It’s a brutal failure mode because the error message — if there even is one — says something about TLS, not DNS.

Why DNS Kills You

DNS is dangerous precisely because it works almost all of the time. Engineers build systems that assume DNS resolution is instantaneous and correct. They don’t add DNS resolution metrics to their monitoring. They don’t test failover scenarios where DNS records change. They don’t know the TTL of their own service’s DNS records.

Then a cloud provider reassigns an IP, or a DNS record is misconfigured during a migration, or a resolver goes down, and suddenly half the microservices in the cluster can’t find each other. The dashboards light up with HTTP 503s and connection timeouts, and the on-call engineer starts investigating the application code — because they don’t think to check DNS.

Run this right now on any production system you operate:

$ dig +short your-service.example.com
$ dig your-service.example.com | grep TTL

If you don’t know what IP addresses should come back, or what the TTL is, you have a blind spot in your operational understanding. When DNS breaks — and it will — you’ll be debugging in the dark.