Skip to main content
aws in the trenches advanced cloud engineering for senior developers

DNS Resolution, Hybrid Connectivity, and Network Debugging

6 min read Chapter 15 of 21

DNS Resolution, Hybrid Connectivity, and Network Debugging

DNS in AWS is more complex than it appears. Each VPC has a built-in DNS resolver at the VPC CIDR base + 2 (e.g., 10.0.0.2 for a 10.0.0.0/16 VPC). This resolver handles VPC-internal DNS, Route 53 private hosted zones, and forwarding to public DNS. When you add VPC peering, Transit Gateway, and on-premises networks, DNS resolution becomes a non-trivial routing problem.

Route 53 Private Hosted Zones

A private hosted zone is associated with one or more VPCs. Only resources in those VPCs can resolve records in the zone:

import boto3

route53 = boto3.client('route53')

# Create a private hosted zone for internal service discovery
zone_response = route53.create_hosted_zone(
    Name='internal.mycompany.com',
    CallerReference=f'internal-zone-{int(time.time())}',
    VPC={
        'VPCRegion': 'us-east-1',
        'VPCId': 'vpc-prod'
    },
    HostedZoneConfig={
        'Comment': 'Internal service discovery',
        'PrivateZone': True
    }
)
zone_id = zone_response['HostedZone']['Id']

# Associate additional VPCs (they can now resolve records in this zone)
route53.associate_vpc_with_hosted_zone(
    HostedZoneId=zone_id,
    VPC={'VPCRegion': 'us-east-1', 'VPCId': 'vpc-staging'}
)

# Create records for internal services
route53.change_resource_record_sets(
    HostedZoneId=zone_id,
    ChangeBatch={
        'Changes': [{
            'Action': 'UPSERT',
            'ResourceRecordSet': {
                'Name': 'payment-api.internal.mycompany.com',
                'Type': 'A',
                'AliasTarget': {
                    'HostedZoneId': 'Z35SXDOTRQ7X7K',  # NLB hosted zone ID
                    'DNSName': 'my-nlb-1234.elb.us-east-1.amazonaws.com',
                    'EvaluateTargetHealth': True
                }
            }
        }]
    }
)

Route 53 Resolver: Hybrid DNS

When you have on-premises networks connected via VPN/Direct Connect, DNS resolution must flow both ways:

  • AWS resources need to resolve on-prem DNS names (ldap.corp.internal)
  • On-prem resources need to resolve AWS private DNS names (payment-api.internal.mycompany.com)
# Route 53 Resolver Endpoints solve this bidirectional DNS problem

ec2 = boto3.client('ec2')
resolver = boto3.client('route53resolver')

# Outbound Endpoint: AWS → On-premises DNS resolution
# "I need my VPC resources to resolve corp.internal names"
outbound_endpoint = resolver.create_resolver_endpoint(
    CreatorRequestId='outbound-to-onprem',
    Name='outbound-to-corp-dns',
    SecurityGroupIds=['sg-resolver-outbound'],
    Direction='OUTBOUND',
    IpAddresses=[
        {'SubnetId': 'subnet-private-a', 'Ip': '10.0.16.10'},
        {'SubnetId': 'subnet-private-b', 'Ip': '10.0.80.10'}
    ]
)

# Forwarding rule: Send corp.internal queries to on-prem DNS servers
resolver.create_resolver_rule(
    CreatorRequestId='forward-corp-internal',
    Name='forward-to-corp-dns',
    RuleType='FORWARD',
    DomainName='corp.internal',
    TargetIps=[
        {'Ip': '172.16.0.53', 'Port': 53},  # On-prem DNS server 1
        {'Ip': '172.16.0.54', 'Port': 53}   # On-prem DNS server 2
    ],
    ResolverEndpointId=outbound_endpoint['ResolverEndpoint']['Id']
)

# Inbound Endpoint: On-premises → AWS DNS resolution
# "I need on-prem servers to resolve my AWS private hosted zones"
inbound_endpoint = resolver.create_resolver_endpoint(
    CreatorRequestId='inbound-from-onprem',
    Name='inbound-from-corp',
    SecurityGroupIds=['sg-resolver-inbound'],
    Direction='INBOUND',
    IpAddresses=[
        {'SubnetId': 'subnet-private-a', 'Ip': '10.0.16.53'},
        {'SubnetId': 'subnet-private-b', 'Ip': '10.0.80.53'}
    ]
)
# On-prem DNS servers forward queries for internal.mycompany.com to these IPs
# (configure conditional forwarding on your on-prem DNS)

VPC Flow Logs: Network Forensics

Flow Logs capture metadata about IP traffic flowing through ENIs. They don’t capture packet contents — just connection metadata (source, dest, port, protocol, action, bytes).

# Enable VPC Flow Logs to CloudWatch Logs
ec2 = boto3.client('ec2')

ec2.create_flow_log(
    ResourceIds=['vpc-prod'],
    ResourceType='VPC',
    TrafficType='ALL',  # ACCEPT, REJECT, or ALL
    LogDestinationType='cloud-watch-logs',
    LogGroupName='/vpc/flow-logs/prod',
    MaxAggregationInterval=60,  # 1 minute (or 600 for 10 min, cheaper)
    LogFormat='${version} ${account-id} ${interface-id} ${srcaddr} ${dstaddr} '
             '${srcport} ${dstport} ${protocol} ${packets} ${bytes} '
             '${start} ${end} ${action} ${log-status} '
             '${vpc-id} ${subnet-id} ${tcp-flags} ${flow-direction}',
    TagSpecifications=[{
        'ResourceType': 'vpc-flow-log',
        'Tags': [{'Key': 'Name', 'Value': 'prod-flow-logs'}]
    }]
)

# Query flow logs with CloudWatch Logs Insights
# Find all REJECTED traffic to a specific instance
query = """
fields @timestamp, srcAddr, dstAddr, srcPort, dstPort, protocol, action
| filter action = "REJECT"
| filter dstAddr = "10.0.16.45"
| sort @timestamp desc
| limit 100
"""

# Common patterns in flow log analysis:
# 1. SG blocking: REJECT with srcAddr=known-good → Missing security group rule
# 2. NACL blocking: REJECT on ephemeral port → NACL outbound not allowing return traffic
# 3. Route miss: No flow log entry at all → Packet never reached the ENI (routing issue)

Systematic Network Debugging

When connectivity fails between two resources, follow this checklist:

def debug_connectivity(source_id: str, dest_ip: str, dest_port: int):
    """
    Systematic checklist for debugging network connectivity in AWS.
    Check each layer in order — the first failure is your root cause.
    """
    ec2 = boto3.client('ec2')

    print("=" * 60)
    print(f"Debugging: {source_id}{dest_ip}:{dest_port}")
    print("=" * 60)

    # Layer 1: Route Table
    print("\n[1] Route Table Check")
    print("  - Does the source's subnet route table have a route to the destination CIDR?")
    print("  - For cross-VPC: Is there a peering/TGW route?")
    print("  - For internet: Is there an IGW (public) or NAT GW (private) route?")

    # Layer 2: Network ACLs (stateless!)
    print("\n[2] Network ACL Check")
    print("  - Source subnet NACL: Is OUTBOUND to dest_ip:dest_port ALLOWED?")
    print("  - Dest subnet NACL: Is INBOUND from source_ip:dest_port ALLOWED?")
    print("  - CRITICAL: Is OUTBOUND on dest NACL allowing ephemeral ports (1024-65535)?")
    print("  - NACLs are evaluated by rule number (lowest first), first match wins")

    # Layer 3: Security Groups (stateful)
    print("\n[3] Security Group Check")
    print("  - Source SG: Does it allow OUTBOUND to dest_ip:dest_port?")
    print("  - (Default SG allows all outbound — usually not the issue)")
    print("  - Dest SG: Does it allow INBOUND from source_ip/source_sg:dest_port?")
    print("  - SGs are stateful: if inbound allowed, return traffic auto-allowed")

    # Layer 4: DNS Resolution
    print("\n[4] DNS Check (if using hostname instead of IP)")
    print("  - Can the source resolve the hostname?")
    print("  - Is the private hosted zone associated with the source's VPC?")
    print("  - For interface endpoints: Is PrivateDnsEnabled=true?")

    # Layer 5: Service-specific
    print("\n[5] Service-Specific Check")
    print("  - Lambda in VPC: Does it have a NAT Gateway for internet?")
    print("  - RDS: Is it publicly accessible? If not, must be in same VPC/peered")
    print("  - S3: Using gateway endpoint? Check endpoint policy.")
    print("  - Cross-account: Resource policy must allow the source principal")

    # Use Reachability Analyzer for automated checking:
    print("\n[6] Automated: Use VPC Reachability Analyzer")
    analysis = ec2.create_network_insights_path(
        Source=source_id,
        Destination=dest_ip,
        DestinationPort=dest_port,
        Protocol='tcp'
    )
    print(f"  Created path: {analysis['NetworkInsightsPath']['NetworkInsightsPathId']}")
    print("  Run: ec2.start_network_insights_analysis(NetworkInsightsPathId=...)")
// Using VPC Reachability Analyzer programmatically
import software.amazon.awssdk.services.ec2.Ec2Client;
import software.amazon.awssdk.services.ec2.model.*;

public class NetworkDebugger {

    private final Ec2Client ec2 = Ec2Client.create();

    public void analyzeReachability(String sourceId, String destId, int port) {
        // Create analysis path
        CreateNetworkInsightsPathResponse pathResponse = ec2.createNetworkInsightsPath(
            CreateNetworkInsightsPathRequest.builder()
                .source(sourceId)       // ENI, instance, or gateway ID
                .destination(destId)
                .destinationPort(port)
                .protocol(Protocol.TCP)
                .build());

        String pathId = pathResponse.networkInsightsPath().networkInsightsPathId();

        // Run the analysis
        StartNetworkInsightsAnalysisResponse analysisResponse =
            ec2.startNetworkInsightsAnalysis(
                StartNetworkInsightsAnalysisRequest.builder()
                    .networkInsightsPathId(pathId)
                    .build());

        String analysisId = analysisResponse.networkInsightsAnalysis()
            .networkInsightsAnalysisId();

        // Poll for results (in production, use waiter or async)
        DescribeNetworkInsightsAnalysesResponse result = ec2.describeNetworkInsightsAnalyses(
            DescribeNetworkInsightsAnalysesRequest.builder()
                .networkInsightsAnalysisIds(analysisId)
                .build());

        var analysis = result.networkInsightsAnalyses().get(0);
        System.out.println("Reachable: " + analysis.networkPathFound());

        if (!analysis.networkPathFound()) {
            System.out.println("Explanations:");
            for (var explanation : analysis.explanations()) {
                System.out.printf("  Component: %s → %s%n",
                    explanation.componentAsString(), explanation.explanationCode());
            }
        }
    }
}

Pro tip: When all else fails, check if the source has Source/Destination Check enabled (default). This setting drops traffic where the source or destination IP doesn’t match the ENI’s IP — which breaks NAT instances, network appliances, and container networking that use IP forwarding. Disable it for any instance acting as a router.