Skip to main content
data systems mechanics invariants in distributed architectures

Request Routing and Service Discovery

6 min read Chapter 8 of 28
Summary

This section establishes the core mechanisms by which...

This section establishes the core mechanisms by which clients locate partitions and replicas in dynamic distributed environments. Service discovery enables automatic identification of network locations for services, partitions, or replicas through centralized (ZooKeeper/etcd) or decentralized (gossip protocol) approaches. Request routing strategies determine how clients direct requests using direct queries, cached tables with TTLs, watch/stream notifications, or gossip dissemination. The section introduces critical tradeoffs: centralized approaches provide strong consistency at the cost of single points of failure, while decentralized approaches offer scalability with eventual consistency. Key technical implementations include ZooKeeper's ephemeral nodes and watches, etcd's gRPC watch streams and leases, and gossip's epidemic dissemination. The architecture shows how hybrid systems combine etcd for strong consistency on critical metadata with gossip for scalable state dissemination. Code artifacts demonstrate ZooKeeper partition routing with caching and etcd leader discovery with real-time updates. Comparison tables contrast consistency models, failure detection mechanisms, and client complexity across strategies.

Request Routing and Service Discovery

Distributed systems assume topology changes from node failures, requiring adaptive discovery mechanisms. When nodes fail, join, or rebalance, clients must locate the correct partition and replica to maintain availability and performance. Request routing and service discovery enforce the tradeoff between consistency and availability in dynamic environments.

Service Discovery Mechanisms

Service discovery implements one of two immutable architectures: centralized or decentralized. These mechanisms determine how clients locate services in a distributed system.

  • Centralized Service Discovery: Centralized service discovery provides strong consistency at the cost of introducing a single point of failure and potential bottleneck. It relies on a coordination service like ZooKeeper or etcd, which maintains a hierarchical namespace or key-value store mapping service names to network locations. Clients query this registry to resolve service addresses.

  • Decentralized Service Discovery: Decentralized service discovery achieves scalability and fault tolerance by distributing the registry across nodes via gossip protocols. Each node maintains a partial view of the system and exchanges state with peers. This approach provides eventual consistency, where updates propagate incrementally and system views may diverge temporarily.

Request Routing Strategies

Request routing strategies enforce specific tradeoffs based on system architecture, partitioning scheme, and discovery mechanism. Each strategy balances consistency, latency, and coordination overhead.

  • Direct Coordination Query: Direct coordination query ensures strong consistency by requiring clients to query the coordination service (e.g., ZooKeeper or etcd) for every request. This strategy increases latency and places sustained load on the coordination layer.

  • Cached with TTL: Cached routing with time-to-live reduces coordination load by allowing clients to reuse routing data until expiration. This strategy introduces the risk of stale routes during the TTL window, trading accuracy for performance.

  • Watch/Stream-based: Watch-based routing delivers immediate updates through push notifications from the coordination service. This strategy maintains strong consistency after propagation but increases client complexity and may strain coordination infrastructure under high churn.

  • Gossip-disseminated: Gossip-disseminated routing spreads topology changes through peer-to-peer exchange. This method scales efficiently and recovers quickly from node failures but guarantees only eventual consistency.

Implementation Examples

ZooKeeper-based Service Discovery

// ZooKeeper-based service discovery for partition routing
public class ZooKeeperPartitionRouter {
    private ZooKeeper zk;
    private Map<String, String> partitionCache = new ConcurrentHashMap<>();
    private String servicePath = "/services/database/partitions";

    public void init() throws Exception {
        zk = new ZooKeeper("zk1:2181,zk2:2181,zk3:2181", 30000, event -> {
            if (event.getType() == EventType.NodeChildrenChanged) {
                refreshPartitionMap();
            }
        });
        zk.exists(servicePath, true);
        refreshPartitionMap();
    }

    private void refreshPartitionMap() {
        try {
            List<String> partitions = zk.getChildren(servicePath, false);
            for (String partition : partitions) {
                String nodePath = servicePath + "/" + partition;
                byte[] data = zk.getData(nodePath, false, null);
                String nodeAddress = new String(data);
                partitionCache.put(partition, nodeAddress);
            }
        } catch (Exception e) {
            // Log error, use stale cache with TTL
        }
    }

    public String routeRequest(String partitionKey) {
        String partition = calculatePartition(partitionKey);
        String nodeAddress = partitionCache.get(partition);
        if (nodeAddress == null) {
            // Fallback: query ZooKeeper directly (expensive)
            refreshPartitionMap();
            nodeAddress = partitionCache.get(partition);
        }
        return nodeAddress;
    }

    private String calculatePartition(String key) {
        // Consistent hashing or range-based partitioning
        int hash = key.hashCode() & 0x7fffffff;
        return "p" + (hash % 256); // 256 partitions
    }
}

etcd-based Leader Discovery with gRPC Watch Streams

// etcd-based leader discovery with gRPC watch streams
package main

import (
    "context"
    "log"
    "sync"
    "go.etcd.io/etcd/clientv3"
)

type LeaderWatcher struct {
    client *clientv3.Client
    leaders map[string]string // partition -> leader address
    mu sync.RWMutex
}

func NewLeaderWatcher(endpoints []string) (*LeaderWatcher, error) {
    cli, err := clientv3.New(clientv3.Config{
        Endpoints: endpoints,
    })
    if err != nil {
        return nil, err
    }
    lw := &LeaderWatcher{
        client: cli,
        leaders: make(map[string]string),
    }
    go lw.watchLeaders()
    return lw, nil
}

func (lw *LeaderWatcher) watchLeaders() {
    watchChan := lw.client.Watch(context.Background(), "/leaders/", clientv3.WithPrefix())
    for resp := range watchChan {
        for _, ev := range resp.Events {
            partition := extractPartitionFromKey(string(ev.Kv.Key))
            if ev.Type == clientv3.EventTypePut {
                lw.mu.Lock()
                lw.leaders[partition] = string(ev.Kv.Value)
                lw.mu.Unlock()
            } else if ev.Type == clientv3.EventTypeDelete {
                lw.mu.Lock()
                delete(lw.leaders, partition)
                lw.mu.Unlock()
            }
        }
    }
}

func (lw *LeaderWatcher) GetLeader(partition string) (string, bool) {
    lw.mu.RLock()
    defer lw.mu.RUnlock()
    leader, ok := lw.leaders[partition]
    return leader, ok
}

Comparison of Service Discovery Mechanisms

ComponentPrimary RoleConsistency ModelFailure DetectionUse Case Example
ZooKeeperCentralized coordination serviceStrong consistency (linearizable)Session timeout (typically 5-30s)Kafka broker metadata, HDFS Namenode failover
etcdDistributed key-value storeStrong consistency (linearizable via Raft)Lease expiration + heartbeatKubernetes service discovery, CoreDNS backend
Gossip ProtocolDecentralized state disseminationEventual consistencyPhi Accrual or heartbeat-basedCassandra ring membership, Consul agent failure detection
Hybrid ApproachCombines coordination + gossipStrong for critical metadata, eventual for otherMultiple mechanismsConsul (Raft + gossip), CockroachDB (Raft + gossip)

Client Request Routing Strategies

Routing StrategyCoordination MechanismClient ComplexityFailure Recovery SpeedConsistency Guarantee
Direct coordination queryClient queries ZooKeeper/etcd for each requestLowSlow (network roundtrip)Strong
Cached with TTLClient caches routing table, refreshes periodicallyMediumMedium (TTL-dependent)Stale during TTL
Watch/Stream-basedClient receives push notifications on changesHighFast (immediate update)Strong after propagation
Gossip-disseminatedRouting info spread via gossip protocolMediumFast (gossip speed)Eventual
Anycast/DNSNetwork-level routingLowSlow (DNS TTL)None

Architecture Diagram: Hybrid Coordination with etcd and Gossip

graph LR;
    participant Client as "Client"
    participant etcd as "etcd Cluster"
    participant Gossip as "Gossip Protocol"
    participant Database as "Database Node"

    Client->>etcd: Request service discovery
    etcd->>Client: Return service location
    Client->>Database: Send request to service
    Database->>Gossip: Exchange state via gossip
    Gossip->>Database: Update local state
    Database->>Client: Return response

Sequence Diagram: Client Request Routing with ZooKeeper

sequenceDiagram
    participant Client as "Client"
    participant ZooKeeper as "ZooKeeper Ensemble"
    participant Database as "Database Node"

    Note over Client,ZooKeeper: Client connects to ZooKeeper
    Client->>ZooKeeper: Query /services/db/partitions
    ZooKeeper->>Client: Return partition mapping
    Client->>Client: Cache partition mapping
    Note over Client,ZooKeeper: Client sets watch on parent znode
    ZooKeeper->>Client: Notify on partition change
    Client->>Client: Refresh cache
    Client->>Database: Send request to partition
    Database->>Client: Return response

    alt Node failure scenario
        Database->>ZooKeeper: Session expires
        ZooKeeper->>Client: Notify of node failure
        Client->>Client: Update cache
    end

    alt New node joins
        Database->>ZooKeeper: Register with ZooKeeper
        ZooKeeper->>Client: Notify of new node
        Client->>Client: Update cache
    end

Request routing and service discovery enforce the tradeoff between consistency and availability in dynamic environments. By selecting mechanisms aligned with system invariants—such as linearizability, partition tolerance, or low-latency reads—engineers design for failure as the default state and ensure resilient operation under churn.