DNS Resolution, Hybrid Connectivity, and Network Debugging
DNS Resolution, Hybrid Connectivity, and Network Debugging
DNS in AWS is more complex than it appears. Each VPC has a built-in DNS resolver at the VPC CIDR base + 2 (e.g., 10.0.0.2 for a 10.0.0.0/16 VPC). This resolver handles VPC-internal DNS, Route 53 private hosted zones, and forwarding to public DNS. When you add VPC peering, Transit Gateway, and on-premises networks, DNS resolution becomes a non-trivial routing problem.
Route 53 Private Hosted Zones
A private hosted zone is associated with one or more VPCs. Only resources in those VPCs can resolve records in the zone:
import boto3
route53 = boto3.client('route53')
# Create a private hosted zone for internal service discovery
zone_response = route53.create_hosted_zone(
Name='internal.mycompany.com',
CallerReference=f'internal-zone-{int(time.time())}',
VPC={
'VPCRegion': 'us-east-1',
'VPCId': 'vpc-prod'
},
HostedZoneConfig={
'Comment': 'Internal service discovery',
'PrivateZone': True
}
)
zone_id = zone_response['HostedZone']['Id']
# Associate additional VPCs (they can now resolve records in this zone)
route53.associate_vpc_with_hosted_zone(
HostedZoneId=zone_id,
VPC={'VPCRegion': 'us-east-1', 'VPCId': 'vpc-staging'}
)
# Create records for internal services
route53.change_resource_record_sets(
HostedZoneId=zone_id,
ChangeBatch={
'Changes': [{
'Action': 'UPSERT',
'ResourceRecordSet': {
'Name': 'payment-api.internal.mycompany.com',
'Type': 'A',
'AliasTarget': {
'HostedZoneId': 'Z35SXDOTRQ7X7K', # NLB hosted zone ID
'DNSName': 'my-nlb-1234.elb.us-east-1.amazonaws.com',
'EvaluateTargetHealth': True
}
}
}]
}
)
Route 53 Resolver: Hybrid DNS
When you have on-premises networks connected via VPN/Direct Connect, DNS resolution must flow both ways:
- AWS resources need to resolve on-prem DNS names (
ldap.corp.internal) - On-prem resources need to resolve AWS private DNS names (
payment-api.internal.mycompany.com)
# Route 53 Resolver Endpoints solve this bidirectional DNS problem
ec2 = boto3.client('ec2')
resolver = boto3.client('route53resolver')
# Outbound Endpoint: AWS → On-premises DNS resolution
# "I need my VPC resources to resolve corp.internal names"
outbound_endpoint = resolver.create_resolver_endpoint(
CreatorRequestId='outbound-to-onprem',
Name='outbound-to-corp-dns',
SecurityGroupIds=['sg-resolver-outbound'],
Direction='OUTBOUND',
IpAddresses=[
{'SubnetId': 'subnet-private-a', 'Ip': '10.0.16.10'},
{'SubnetId': 'subnet-private-b', 'Ip': '10.0.80.10'}
]
)
# Forwarding rule: Send corp.internal queries to on-prem DNS servers
resolver.create_resolver_rule(
CreatorRequestId='forward-corp-internal',
Name='forward-to-corp-dns',
RuleType='FORWARD',
DomainName='corp.internal',
TargetIps=[
{'Ip': '172.16.0.53', 'Port': 53}, # On-prem DNS server 1
{'Ip': '172.16.0.54', 'Port': 53} # On-prem DNS server 2
],
ResolverEndpointId=outbound_endpoint['ResolverEndpoint']['Id']
)
# Inbound Endpoint: On-premises → AWS DNS resolution
# "I need on-prem servers to resolve my AWS private hosted zones"
inbound_endpoint = resolver.create_resolver_endpoint(
CreatorRequestId='inbound-from-onprem',
Name='inbound-from-corp',
SecurityGroupIds=['sg-resolver-inbound'],
Direction='INBOUND',
IpAddresses=[
{'SubnetId': 'subnet-private-a', 'Ip': '10.0.16.53'},
{'SubnetId': 'subnet-private-b', 'Ip': '10.0.80.53'}
]
)
# On-prem DNS servers forward queries for internal.mycompany.com to these IPs
# (configure conditional forwarding on your on-prem DNS)
VPC Flow Logs: Network Forensics
Flow Logs capture metadata about IP traffic flowing through ENIs. They don’t capture packet contents — just connection metadata (source, dest, port, protocol, action, bytes).
# Enable VPC Flow Logs to CloudWatch Logs
ec2 = boto3.client('ec2')
ec2.create_flow_log(
ResourceIds=['vpc-prod'],
ResourceType='VPC',
TrafficType='ALL', # ACCEPT, REJECT, or ALL
LogDestinationType='cloud-watch-logs',
LogGroupName='/vpc/flow-logs/prod',
MaxAggregationInterval=60, # 1 minute (or 600 for 10 min, cheaper)
LogFormat='${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} '
'${srcport} ${dstport} ${protocol} ${packets} ${bytes} '
'${start} ${end} ${action} ${log-status} '
'${vpc-id} ${subnet-id} ${tcp-flags} ${flow-direction}',
TagSpecifications=[{
'ResourceType': 'vpc-flow-log',
'Tags': [{'Key': 'Name', 'Value': 'prod-flow-logs'}]
}]
)
# Query flow logs with CloudWatch Logs Insights
# Find all REJECTED traffic to a specific instance
query = """
fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, action
| filter action = "REJECT"
| filter dstAddr = "10.0.16.45"
| sort @timestamp desc
| limit 100
"""
# Common patterns in flow log analysis:
# 1. SG blocking: REJECT with srcAddr=known-good → Missing security group rule
# 2. NACL blocking: REJECT on ephemeral port → NACL outbound not allowing return traffic
# 3. Route miss: No flow log entry at all → Packet never reached the ENI (routing issue)
Systematic Network Debugging
When connectivity fails between two resources, follow this checklist:
def debug_connectivity(source_id: str, dest_ip: str, dest_port: int):
"""
Systematic checklist for debugging network connectivity in AWS.
Check each layer in order — the first failure is your root cause.
"""
ec2 = boto3.client('ec2')
print("=" * 60)
print(f"Debugging: {source_id} → {dest_ip}:{dest_port}")
print("=" * 60)
# Layer 1: Route Table
print("\n[1] Route Table Check")
print(" - Does the source's subnet route table have a route to the destination CIDR?")
print(" - For cross-VPC: Is there a peering/TGW route?")
print(" - For internet: Is there an IGW (public) or NAT GW (private) route?")
# Layer 2: Network ACLs (stateless!)
print("\n[2] Network ACL Check")
print(" - Source subnet NACL: Is OUTBOUND to dest_ip:dest_port ALLOWED?")
print(" - Dest subnet NACL: Is INBOUND from source_ip:dest_port ALLOWED?")
print(" - CRITICAL: Is OUTBOUND on dest NACL allowing ephemeral ports (1024-65535)?")
print(" - NACLs are evaluated by rule number (lowest first), first match wins")
# Layer 3: Security Groups (stateful)
print("\n[3] Security Group Check")
print(" - Source SG: Does it allow OUTBOUND to dest_ip:dest_port?")
print(" - (Default SG allows all outbound — usually not the issue)")
print(" - Dest SG: Does it allow INBOUND from source_ip/source_sg:dest_port?")
print(" - SGs are stateful: if inbound allowed, return traffic auto-allowed")
# Layer 4: DNS Resolution
print("\n[4] DNS Check (if using hostname instead of IP)")
print(" - Can the source resolve the hostname?")
print(" - Is the private hosted zone associated with the source's VPC?")
print(" - For interface endpoints: Is PrivateDnsEnabled=true?")
# Layer 5: Service-specific
print("\n[5] Service-Specific Check")
print(" - Lambda in VPC: Does it have a NAT Gateway for internet?")
print(" - RDS: Is it publicly accessible? If not, must be in same VPC/peered")
print(" - S3: Using gateway endpoint? Check endpoint policy.")
print(" - Cross-account: Resource policy must allow the source principal")
# Use Reachability Analyzer for automated checking:
print("\n[6] Automated: Use VPC Reachability Analyzer")
analysis = ec2.create_network_insights_path(
Source=source_id,
Destination=dest_ip,
DestinationPort=dest_port,
Protocol='tcp'
)
print(f" Created path: {analysis['NetworkInsightsPath']['NetworkInsightsPathId']}")
print(" Run: ec2.start_network_insights_analysis(NetworkInsightsPathId=...)")
// Using VPC Reachability Analyzer programmatically
import software.amazon.awssdk.services.ec2.Ec2Client;
import software.amazon.awssdk.services.ec2.model.*;
public class NetworkDebugger {
private final Ec2Client ec2 = Ec2Client.create();
public void analyzeReachability(String sourceId, String destId, int port) {
// Create analysis path
CreateNetworkInsightsPathResponse pathResponse = ec2.createNetworkInsightsPath(
CreateNetworkInsightsPathRequest.builder()
.source(sourceId) // ENI, instance, or gateway ID
.destination(destId)
.destinationPort(port)
.protocol(Protocol.TCP)
.build());
String pathId = pathResponse.networkInsightsPath().networkInsightsPathId();
// Run the analysis
StartNetworkInsightsAnalysisResponse analysisResponse =
ec2.startNetworkInsightsAnalysis(
StartNetworkInsightsAnalysisRequest.builder()
.networkInsightsPathId(pathId)
.build());
String analysisId = analysisResponse.networkInsightsAnalysis()
.networkInsightsAnalysisId();
// Poll for results (in production, use waiter or async)
DescribeNetworkInsightsAnalysesResponse result = ec2.describeNetworkInsightsAnalyses(
DescribeNetworkInsightsAnalysesRequest.builder()
.networkInsightsAnalysisIds(analysisId)
.build());
var analysis = result.networkInsightsAnalyses().get(0);
System.out.println("Reachable: " + analysis.networkPathFound());
if (!analysis.networkPathFound()) {
System.out.println("Explanations:");
for (var explanation : analysis.explanations()) {
System.out.printf(" Component: %s → %s%n",
explanation.componentAsString(), explanation.explanationCode());
}
}
}
}
Pro tip: When all else fails, check if the source has Source/Destination Check enabled (default). This setting drops traffic where the source or destination IP doesn’t match the ENI’s IP — which breaks NAT instances, network appliances, and container networking that use IP forwarding. Disable it for any instance acting as a router.