Infrastructure Repo Structure and Environment Overlays
Infrastructure Repo Structure and Environment Overlays
The Failure
The infra repo started with flat directories: one folder per service, each containing full Kubernetes manifests for every environment. The checkout service had checkout-staging.yaml and checkout-production.yaml—two 200-line files that were 95% identical. When someone added a new environment variable to staging but forgot production, the services drifted. Six months later, the repo had 50 YAML files with no clear relationship between staging and production configurations.
The base/overlay pattern eliminates duplication. One base definition, thin overlays per environment.
The Mechanism
Directory Convention
ecommerce-infra/
├── apps/ # Application workloads
│ ├── catalog-service/
│ │ ├── base/
│ │ │ ├── kustomization.yaml
│ │ │ ├── deployment.yaml
│ │ │ ├── service.yaml
│ │ │ └── hpa.yaml
│ │ └── overlays/
│ │ ├── staging/
│ │ │ ├── kustomization.yaml
│ │ │ └── patches/
│ │ │ ├── replicas.yaml
│ │ │ └── env.yaml
│ │ └── production/
│ │ ├── kustomization.yaml
│ │ └── patches/
│ │ ├── replicas.yaml
│ │ ├── env.yaml
│ │ └── resources.yaml
│ ├── checkout-service/
│ │ └── (same structure)
│ └── ...
├── platform/ # Shared infrastructure
│ ├── argocd/
│ ├── monitoring/
│ ├── ingress/
│ ├── cert-manager/
│ └── external-secrets/
├── clusters/ # Cluster-level config
│ ├── staging/
│ │ └── kustomization.yaml # References all staging overlays
│ └── production/
│ └── kustomization.yaml
└── CODEOWNERS
The Implementation
Base Definition
# apps/checkout-service/base/deployment.yaml
# HARDENED: Base deployment - environment-agnostic
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
labels:
app.kubernetes.io/name: checkout-service
app.kubernetes.io/part-of: ecommerce
spec:
selector:
matchLabels:
app.kubernetes.io/name: checkout-service
template:
metadata:
labels:
app.kubernetes.io/name: checkout-service
spec:
containers:
- name: checkout-service
image: ghcr.io/acme/checkout-service:latest
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health/ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
livenessProbe:
httpGet:
path: /health/live
port: 8080
initialDelaySeconds: 15
periodSeconds: 20
# apps/checkout-service/base/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- deployment.yaml
- service.yaml
- hpa.yaml
commonLabels:
app.kubernetes.io/managed-by: kustomize
Staging Overlay
# apps/checkout-service/overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
namespace: staging
resources:
- ../../base
patches:
- path: patches/replicas.yaml
- path: patches/env.yaml
# apps/checkout-service/overlays/staging/patches/replicas.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
spec:
replicas: 1
template:
spec:
containers:
- name: checkout-service
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
Production Overlay
# apps/checkout-service/overlays/production/patches/replicas.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-service
spec:
replicas: 3
template:
spec:
containers:
- name: checkout-service
resources:
requests:
cpu: 250m
memory: 512Mi
limits:
cpu: 1000m
memory: 1Gi
CODEOWNERS
# CODEOWNERS
# HARDENED: Require review for production changes
/apps/*/overlays/production/ @platform-team
/platform/ @platform-team
/clusters/production/ @platform-team @sre-team
/apps/*/overlays/staging/ @dev-team
Validate Overlays in CI
# ecommerce-infra/.github/workflows/validate.yml
# HARDENED: Validate all overlays render correctly
name: Validate Manifests
on:
pull_request:
jobs:
validate:
runs-on: ubuntu-latest
strategy:
matrix:
env: [staging, production]
steps:
- uses: actions/checkout@v4
- name: Validate kustomize build
run: |
for app in apps/*/overlays/${{ matrix.env }}; do
echo "Validating: $app"
kustomize build "$app" > /dev/null
done
- name: Kubeval validation
run: |
for app in apps/*/overlays/${{ matrix.env }}; do
kustomize build "$app" | kubeval --strict
done
The Gate
CODEOWNERS is the gate for production changes. Any PR modifying production overlays requires review from the platform team. Staging changes are self-service.
The Recovery
Overlays diverge too far from base: If production and staging have very different configurations, the overlays become as large as the base. Refactor: move shared config to base, keep only true differences in overlays (replicas, resources, environment variables).
New service requires many files to bootstrap: Create a service template directory. Use cp -r apps/_template apps/new-service and fill in the service name.
Kustomize build fails on merge: The validation CI catches this before merge. If it slips through, ArgoCD will show the Application as OutOfSync with an error message.