Skip to main content

On This Page

Building Gigawatt-Scale AI Clusters with Backend Aggregation

2 min read
Share

These articles are AI-generated summaries. Please check the original sources for full details.

Building Prometheus: How Backend Aggregation Enables Gigawatt-Scale AI Clusters

Meta’s Prometheus AI cluster is being built to deliver 1-gigawatt of capacity, and backend aggregation (BAG) plays a crucial role in connecting thousands of GPUs across multiple data centers and regions. By leveraging modular hardware, advanced routing, and resilient topologies, BAG ensures both performance and reliability at unprecedented scale, with inter-BAG capacities reaching the petabit range.

Why This Matters

The technical reality of building gigawatt-scale AI clusters like Prometheus requires a robust and scalable networking infrastructure, which is often at odds with ideal models that prioritize simplicity and cost-effectiveness. The failure to design and implement such infrastructure can result in significant costs and scalability limitations, as evidenced by the complexity of interconnecting tens of thousands of GPUs. For instance, a single misconfigured network switch can lead to a failure domain that affects an entire region, resulting in substantial downtime and revenue loss.

Key Insights

  • BAG is a centralized Ethernet-based super spine network layer that interconnects multiple spine layer fabrics across various data centers and regions, with inter-BAG capacities reaching 16-48 Pbps per region pair.
  • The use of modular hardware, such as Jericho3 (J3) ASIC line cards, enables high-capacity, scalable, and resilient interconnect, with each line card providing up to 432x800G ports.
  • Routing within BAG uses eBGP with link bandwidth attributes, enabling Unequal Cost Multipath (UCMP) for efficient load balancing and robust failure handling, as seen in Meta’s implementation of BAG.

Working Example

# BAG Network Topology Example
## Planar Topology
* Connects BAG switches one-to-one between regions
* Offers simplified management but concentrates potential failure domains

## Spread Connection Topology
* Distributes links across multiple BAG switches/planes
* Enhances path diversity and resilience

Practical Applications

  • Use Case: Meta’s Prometheus AI cluster uses BAG to connect thousands of GPUs across multiple data centers and regions, enabling seamless, high-capacity networking and ensuring the scalability and reliability of the cluster.
  • Pitfall: Failure to carefully manage oversubscription ratios can lead to performance degradation and scalability limitations, as seen in cases where oversubscription from L2 to BAG exceeds 4.5:1.

References:

Continue reading

Next article

Cloudflare Introduces Vertical Microfrontend Template for Efficient Edge Routing

Related Content