Request Routing and Service Discovery
SummaryThis section establishes the core mechanisms by which...
This section establishes the core mechanisms by which...
This section establishes the core mechanisms by which clients locate partitions and replicas in dynamic distributed environments. Service discovery enables automatic identification of network locations for services, partitions, or replicas through centralized (ZooKeeper/etcd) or decentralized (gossip protocol) approaches. Request routing strategies determine how clients direct requests using direct queries, cached tables with TTLs, watch/stream notifications, or gossip dissemination. The section introduces critical tradeoffs: centralized approaches provide strong consistency at the cost of single points of failure, while decentralized approaches offer scalability with eventual consistency. Key technical implementations include ZooKeeper's ephemeral nodes and watches, etcd's gRPC watch streams and leases, and gossip's epidemic dissemination. The architecture shows how hybrid systems combine etcd for strong consistency on critical metadata with gossip for scalable state dissemination. Code artifacts demonstrate ZooKeeper partition routing with caching and etcd leader discovery with real-time updates. Comparison tables contrast consistency models, failure detection mechanisms, and client complexity across strategies.
Request Routing and Service Discovery
Distributed systems assume topology changes from node failures, requiring adaptive discovery mechanisms. When nodes fail, join, or rebalance, clients must locate the correct partition and replica to maintain availability and performance. Request routing and service discovery enforce the tradeoff between consistency and availability in dynamic environments.
Service Discovery Mechanisms
Service discovery implements one of two immutable architectures: centralized or decentralized. These mechanisms determine how clients locate services in a distributed system.
-
Centralized Service Discovery: Centralized service discovery provides strong consistency at the cost of introducing a single point of failure and potential bottleneck. It relies on a coordination service like ZooKeeper or etcd, which maintains a hierarchical namespace or key-value store mapping service names to network locations. Clients query this registry to resolve service addresses.
-
Decentralized Service Discovery: Decentralized service discovery achieves scalability and fault tolerance by distributing the registry across nodes via gossip protocols. Each node maintains a partial view of the system and exchanges state with peers. This approach provides eventual consistency, where updates propagate incrementally and system views may diverge temporarily.
Request Routing Strategies
Request routing strategies enforce specific tradeoffs based on system architecture, partitioning scheme, and discovery mechanism. Each strategy balances consistency, latency, and coordination overhead.
-
Direct Coordination Query: Direct coordination query ensures strong consistency by requiring clients to query the coordination service (e.g., ZooKeeper or etcd) for every request. This strategy increases latency and places sustained load on the coordination layer.
-
Cached with TTL: Cached routing with time-to-live reduces coordination load by allowing clients to reuse routing data until expiration. This strategy introduces the risk of stale routes during the TTL window, trading accuracy for performance.
-
Watch/Stream-based: Watch-based routing delivers immediate updates through push notifications from the coordination service. This strategy maintains strong consistency after propagation but increases client complexity and may strain coordination infrastructure under high churn.
-
Gossip-disseminated: Gossip-disseminated routing spreads topology changes through peer-to-peer exchange. This method scales efficiently and recovers quickly from node failures but guarantees only eventual consistency.
Implementation Examples
ZooKeeper-based Service Discovery
// ZooKeeper-based service discovery for partition routing
public class ZooKeeperPartitionRouter {
private ZooKeeper zk;
private Map<String, String> partitionCache = new ConcurrentHashMap<>();
private String servicePath = "/services/database/partitions";
public void init() throws Exception {
zk = new ZooKeeper("zk1:2181,zk2:2181,zk3:2181", 30000, event -> {
if (event.getType() == EventType.NodeChildrenChanged) {
refreshPartitionMap();
}
});
zk.exists(servicePath, true);
refreshPartitionMap();
}
private void refreshPartitionMap() {
try {
List<String> partitions = zk.getChildren(servicePath, false);
for (String partition : partitions) {
String nodePath = servicePath + "/" + partition;
byte[] data = zk.getData(nodePath, false, null);
String nodeAddress = new String(data);
partitionCache.put(partition, nodeAddress);
}
} catch (Exception e) {
// Log error, use stale cache with TTL
}
}
public String routeRequest(String partitionKey) {
String partition = calculatePartition(partitionKey);
String nodeAddress = partitionCache.get(partition);
if (nodeAddress == null) {
// Fallback: query ZooKeeper directly (expensive)
refreshPartitionMap();
nodeAddress = partitionCache.get(partition);
}
return nodeAddress;
}
private String calculatePartition(String key) {
// Consistent hashing or range-based partitioning
int hash = key.hashCode() & 0x7fffffff;
return "p" + (hash % 256); // 256 partitions
}
}
etcd-based Leader Discovery with gRPC Watch Streams
// etcd-based leader discovery with gRPC watch streams
package main
import (
"context"
"log"
"sync"
"go.etcd.io/etcd/clientv3"
)
type LeaderWatcher struct {
client *clientv3.Client
leaders map[string]string // partition -> leader address
mu sync.RWMutex
}
func NewLeaderWatcher(endpoints []string) (*LeaderWatcher, error) {
cli, err := clientv3.New(clientv3.Config{
Endpoints: endpoints,
})
if err != nil {
return nil, err
}
lw := &LeaderWatcher{
client: cli,
leaders: make(map[string]string),
}
go lw.watchLeaders()
return lw, nil
}
func (lw *LeaderWatcher) watchLeaders() {
watchChan := lw.client.Watch(context.Background(), "/leaders/", clientv3.WithPrefix())
for resp := range watchChan {
for _, ev := range resp.Events {
partition := extractPartitionFromKey(string(ev.Kv.Key))
if ev.Type == clientv3.EventTypePut {
lw.mu.Lock()
lw.leaders[partition] = string(ev.Kv.Value)
lw.mu.Unlock()
} else if ev.Type == clientv3.EventTypeDelete {
lw.mu.Lock()
delete(lw.leaders, partition)
lw.mu.Unlock()
}
}
}
}
func (lw *LeaderWatcher) GetLeader(partition string) (string, bool) {
lw.mu.RLock()
defer lw.mu.RUnlock()
leader, ok := lw.leaders[partition]
return leader, ok
}
Comparison of Service Discovery Mechanisms
| Component | Primary Role | Consistency Model | Failure Detection | Use Case Example |
|---|---|---|---|---|
| ZooKeeper | Centralized coordination service | Strong consistency (linearizable) | Session timeout (typically 5-30s) | Kafka broker metadata, HDFS Namenode failover |
| etcd | Distributed key-value store | Strong consistency (linearizable via Raft) | Lease expiration + heartbeat | Kubernetes service discovery, CoreDNS backend |
| Gossip Protocol | Decentralized state dissemination | Eventual consistency | Phi Accrual or heartbeat-based | Cassandra ring membership, Consul agent failure detection |
| Hybrid Approach | Combines coordination + gossip | Strong for critical metadata, eventual for other | Multiple mechanisms | Consul (Raft + gossip), CockroachDB (Raft + gossip) |
Client Request Routing Strategies
| Routing Strategy | Coordination Mechanism | Client Complexity | Failure Recovery Speed | Consistency Guarantee |
|---|---|---|---|---|
| Direct coordination query | Client queries ZooKeeper/etcd for each request | Low | Slow (network roundtrip) | Strong |
| Cached with TTL | Client caches routing table, refreshes periodically | Medium | Medium (TTL-dependent) | Stale during TTL |
| Watch/Stream-based | Client receives push notifications on changes | High | Fast (immediate update) | Strong after propagation |
| Gossip-disseminated | Routing info spread via gossip protocol | Medium | Fast (gossip speed) | Eventual |
| Anycast/DNS | Network-level routing | Low | Slow (DNS TTL) | None |
Architecture Diagram: Hybrid Coordination with etcd and Gossip
graph LR;
participant Client as "Client"
participant etcd as "etcd Cluster"
participant Gossip as "Gossip Protocol"
participant Database as "Database Node"
Client->>etcd: Request service discovery
etcd->>Client: Return service location
Client->>Database: Send request to service
Database->>Gossip: Exchange state via gossip
Gossip->>Database: Update local state
Database->>Client: Return response
Sequence Diagram: Client Request Routing with ZooKeeper
sequenceDiagram
participant Client as "Client"
participant ZooKeeper as "ZooKeeper Ensemble"
participant Database as "Database Node"
Note over Client,ZooKeeper: Client connects to ZooKeeper
Client->>ZooKeeper: Query /services/db/partitions
ZooKeeper->>Client: Return partition mapping
Client->>Client: Cache partition mapping
Note over Client,ZooKeeper: Client sets watch on parent znode
ZooKeeper->>Client: Notify on partition change
Client->>Client: Refresh cache
Client->>Database: Send request to partition
Database->>Client: Return response
alt Node failure scenario
Database->>ZooKeeper: Session expires
ZooKeeper->>Client: Notify of node failure
Client->>Client: Update cache
end
alt New node joins
Database->>ZooKeeper: Register with ZooKeeper
ZooKeeper->>Client: Notify of new node
Client->>Client: Update cache
end
Request routing and service discovery enforce the tradeoff between consistency and availability in dynamic environments. By selecting mechanisms aligned with system invariants—such as linearizability, partition tolerance, or low-latency reads—engineers design for failure as the default state and ensure resilient operation under churn.