Building a Production-Grade Async Job Queue: Engineering Resilience and Backpressure
These articles are AI-generated summaries. Please check the original sources for full details.
I Built a Production-Grade Async Job Queue from Scratch — Here’s Everything That Actually Happened
Macaulay Praise developed a custom async job queue using Python, FastAPI, and Redis Streams to bypass high-level abstractions. The final system features 47 passing tests and a services layer with ~92% coverage.
Why This Matters
While frameworks like Celery abstract infrastructure, they often obscure the mechanics of failure. In production environments, the critical challenge is not basic message delivery but handling ‘zombie jobs’—where workers crash mid-execution without sending an ACK—which can poison metrics and cause silent data loss if not managed by a dedicated recovery service.
Key Insights
- The ‘Reaper’ service ensures trustworthiness by implementing visibility timeout recovery via XPENDING queries in Redis Streams and zombie detection via PostgreSQL heartbeats.
- Backpressure oscillation is mitigated using a two-watermark band (high and low) rather than a single threshold to prevent rapid toggling between accepting and rejecting requests.
- Weighted Fair Scheduling prevents priority starvation by using runtime weights stored in Redis (e.g., critical: 60, high: 30, normal: 10) to distribute worker capacity.
- Thundering herd problems are solved through exponential backoff combined with full jitter (
* random.random()) to spread retry windows.
Working Examples
Exponential backoff with full jitter to prevent synchronized retries.
def _backoff_seconds(retry_count: int, base: float = 1.0, max_delay: float = 60.0) -> float:
return min(base * (2 ** retry_count), max_delay) * random.random()
Weighted scheduler implementation to ensure no priority level is starved.
async def pick_queue(redis) -> str:
weights = await get_weights(redis)
roll = random.randint(1, 100)
cumulative = 0
for priority, weight in weights.items():
cumulative += weight
if roll <= cumulative:
return f"queue:{priority}"
Practical Applications
- .env Management: Using .env.example for committed templates while keeping .env in .gitignore; failure to rotate credentials after accidental commits leads to security breaches.
- Docker Infrastructure: Utilizing
condition: service_healthyin depends_on; simply listing services causes app crashes because containers start before internal services are ready to accept connections. - Virtual Environment Mapping: Placing the venv at
/opt/venvinstead of/app/.venv; bind mounting the application directory overlays the venv and wipes installed packages.
References:
Continue reading
Next article
Implementing State-Based AI Workflows with LangGraph Templates
Related Content
Engineering a Real psql Terminal: PTY, Reverse WebSockets, and Redis Streams
Learn how to build a PTY-backed PostgreSQL console using Redis Streams to decouple I/O and reverse WebSockets to bypass NAT constraints for real terminal semantics.
Engineering Social Impact: Architecture Decisions for a UNICEF Child Development Platform
A technical deep dive into building a child development monitoring platform for UNICEF using Vue 3 and Atomic Design in Tarumã, São Paulo.
Engineering a Real-Time Robot Battle Simulator: Lessons in Performance and Language Design
A technical deep dive into Logic Arena, featuring a custom scripting language and the resolution of a 3,862ms scripting bottleneck.