Request Routing and Service Discovery

Distributed systems assume topology changes from node failures, requiring adaptive discovery mechanisms. When nodes fail, join, or rebalance, clients must locate the correct partition and replica to maintain availability and performance. Request routing and service discovery enforce the tradeoff between consistency and availability in dynamic environments.

Service Discovery Mechanisms

Service discovery implements one of two immutable architectures: centralized or decentralized. These mechanisms determine how clients locate services in a distributed system.

Centralized Service Discovery: Centralized service discovery provides strong consistency at the cost of introducing a single point of failure and potential bottleneck. It relies on a coordination service like ZooKeeper or etcd, which maintains a hierarchical namespace or key-value store mapping service names to network locations. Clients query this registry to resolve service addresses.
Decentralized Service Discovery: Decentralized service discovery achieves scalability and fault tolerance by distributing the registry across nodes via gossip protocols. Each node maintains a partial view of the system and exchanges state with peers. This approach provides eventual consistency, where updates propagate incrementally and system views may diverge temporarily.

Request Routing Strategies

Request routing strategies enforce specific tradeoffs based on system architecture, partitioning scheme, and discovery mechanism. Each strategy balances consistency, latency, and coordination overhead.

Direct Coordination Query: Direct coordination query ensures strong consistency by requiring clients to query the coordination service (e.g., ZooKeeper or etcd) for every request. This strategy increases latency and places sustained load on the coordination layer.
Cached with TTL: Cached routing with time-to-live reduces coordination load by allowing clients to reuse routing data until expiration. This strategy introduces the risk of stale routes during the TTL window, trading accuracy for performance.
Watch/Stream-based: Watch-based routing delivers immediate updates through push notifications from the coordination service. This strategy maintains strong consistency after propagation but increases client complexity and may strain coordination infrastructure under high churn.
Gossip-disseminated: Gossip-disseminated routing spreads topology changes through peer-to-peer exchange. This method scales efficiently and recovers quickly from node failures but guarantees only eventual consistency.

Implementation Examples

ZooKeeper-based Service Discovery

// ZooKeeper-based service discovery for partition routing
public class ZooKeeperPartitionRouter {
    private ZooKeeper zk;
    private Map<String, String> partitionCache = new ConcurrentHashMap<>();
    private String servicePath = "/services/database/partitions";

    public void init() throws Exception {
        zk = new ZooKeeper("zk1:2181,zk2:2181,zk3:2181", 30000, event -> {
            if (event.getType() == EventType.NodeChildrenChanged) {
                refreshPartitionMap();
            }
        });
        zk.exists(servicePath, true);
        refreshPartitionMap();
    }

    private void refreshPartitionMap() {
        try {
            List<String> partitions = zk.getChildren(servicePath, false);
            for (String partition : partitions) {
                String nodePath = servicePath + "/" + partition;
                byte[] data = zk.getData(nodePath, false, null);
                String nodeAddress = new String(data);
                partitionCache.put(partition, nodeAddress);
            }
        } catch (Exception e) {
            // Log error, use stale cache with TTL
        }
    }

    public String routeRequest(String partitionKey) {
        String partition = calculatePartition(partitionKey);
        String nodeAddress = partitionCache.get(partition);
        if (nodeAddress == null) {
            // Fallback: query ZooKeeper directly (expensive)
            refreshPartitionMap();
            nodeAddress = partitionCache.get(partition);
        }
        return nodeAddress;
    }

    private String calculatePartition(String key) {
        // Consistent hashing or range-based partitioning
        int hash = key.hashCode() & 0x7fffffff;
        return "p" + (hash % 256); // 256 partitions
    }
}

etcd-based Leader Discovery with gRPC Watch Streams

// etcd-based leader discovery with gRPC watch streams
package main

import (
    "context"
    "log"
    "sync"
    "go.etcd.io/etcd/clientv3"
)

type LeaderWatcher struct {
    client *clientv3.Client
    leaders map[string]string // partition -> leader address
    mu sync.RWMutex
}

func NewLeaderWatcher(endpoints []string) (*LeaderWatcher, error) {
    cli, err := clientv3.New(clientv3.Config{
        Endpoints: endpoints,
    })
    if err != nil {
        return nil, err
    }
    lw := &LeaderWatcher{
        client: cli,
        leaders: make(map[string]string),
    }
    go lw.watchLeaders()
    return lw, nil
}

func (lw *LeaderWatcher) watchLeaders() {
    watchChan := lw.client.Watch(context.Background(), "/leaders/", clientv3.WithPrefix())
    for resp := range watchChan {
        for _, ev := range resp.Events {
            partition := extractPartitionFromKey(string(ev.Kv.Key))
            if ev.Type == clientv3.EventTypePut {
                lw.mu.Lock()
                lw.leaders[partition] = string(ev.Kv.Value)
                lw.mu.Unlock()
            } else if ev.Type == clientv3.EventTypeDelete {
                lw.mu.Lock()
                delete(lw.leaders, partition)
                lw.mu.Unlock()
            }
        }
    }
}

func (lw *LeaderWatcher) GetLeader(partition string) (string, bool) {
    lw.mu.RLock()
    defer lw.mu.RUnlock()
    leader, ok := lw.leaders[partition]
    return leader, ok
}

Comparison of Service Discovery Mechanisms

Component	Primary Role	Consistency Model	Failure Detection	Use Case Example
ZooKeeper	Centralized coordination service	Strong consistency (linearizable)	Session timeout (typically 5-30s)	Kafka broker metadata, HDFS Namenode failover
etcd	Distributed key-value store	Strong consistency (linearizable via Raft)	Lease expiration + heartbeat	Kubernetes service discovery, CoreDNS backend
Gossip Protocol	Decentralized state dissemination	Eventual consistency	Phi Accrual or heartbeat-based	Cassandra ring membership, Consul agent failure detection
Hybrid Approach	Combines coordination + gossip	Strong for critical metadata, eventual for other	Multiple mechanisms	Consul (Raft + gossip), CockroachDB (Raft + gossip)

Client Request Routing Strategies

Routing Strategy	Coordination Mechanism	Client Complexity	Failure Recovery Speed	Consistency Guarantee
Direct coordination query	Client queries ZooKeeper/etcd for each request	Low	Slow (network roundtrip)	Strong
Cached with TTL	Client caches routing table, refreshes periodically	Medium	Medium (TTL-dependent)	Stale during TTL
Watch/Stream-based	Client receives push notifications on changes	High	Fast (immediate update)	Strong after propagation
Gossip-disseminated	Routing info spread via gossip protocol	Medium	Fast (gossip speed)	Eventual
Anycast/DNS	Network-level routing	Low	Slow (DNS TTL)	None

Architecture Diagram: Hybrid Coordination with etcd and Gossip

graph LR;
    participant Client as "Client"
    participant etcd as "etcd Cluster"
    participant Gossip as "Gossip Protocol"
    participant Database as "Database Node"

    Client->>etcd: Request service discovery
    etcd->>Client: Return service location
    Client->>Database: Send request to service
    Database->>Gossip: Exchange state via gossip
    Gossip->>Database: Update local state
    Database->>Client: Return response

Sequence Diagram: Client Request Routing with ZooKeeper

sequenceDiagram
    participant Client as "Client"
    participant ZooKeeper as "ZooKeeper Ensemble"
    participant Database as "Database Node"

    Note over Client,ZooKeeper: Client connects to ZooKeeper
    Client->>ZooKeeper: Query /services/db/partitions
    ZooKeeper->>Client: Return partition mapping
    Client->>Client: Cache partition mapping
    Note over Client,ZooKeeper: Client sets watch on parent znode
    ZooKeeper->>Client: Notify on partition change
    Client->>Client: Refresh cache
    Client->>Database: Send request to partition
    Database->>Client: Return response

    alt Node failure scenario
        Database->>ZooKeeper: Session expires
        ZooKeeper->>Client: Notify of node failure
        Client->>Client: Update cache
    end

    alt New node joins
        Database->>ZooKeeper: Register with ZooKeeper
        ZooKeeper->>Client: Notify of new node
        Client->>Client: Update cache
    end

Request routing and service discovery enforce the tradeoff between consistency and availability in dynamic environments. By selecting mechanisms aligned with system invariants—such as linearizability, partition tolerance, or low-latency reads—engineers design for failure as the default state and ensure resilient operation under churn.