Skip to main content
adaptive distributed systems intent-based dynamic consistency in java 21

Validation via Chaos Engineering

3 min read Chapter 23 of 25
Summary

Toxiproxy and Testcontainers enable robust chaos engineering for...

Toxiproxy and Testcontainers enable robust chaos engineering for distributed systems, ensuring resilience through fault injection and behavioral observation.

Validation via Chaos Engineering

Introduction to Chaos Engineering

Chaos engineering is a disciplined approach to identifying failures before they become outages by proactively injecting failures into a system to observe its behavior. This approach has become increasingly important in distributed systems, where the complexity of interactions between components can lead to unforeseen failures. In this section, we will explore how chaos engineering can be applied using Testcontainers and Toxiproxy to validate the consistency and availability of a system.

Testcontainers and Toxiproxy Overview

Testcontainers is a Java library that supports JUnit tests by providing lightweight, throwaway instances of common databases, Selenium web browsers, or anything else that can run in a Docker container. Toxiproxy, on the other hand, is a framework for simulating network conditions, specifically designed for testing, CI, and development environments to tamper with TCP connections. By combining these two tools, developers can create robust tests that simulate real-world network failures and validate the behavior of their system under such conditions.

Applying Chaos Engineering with Testcontainers and Toxiproxy

To apply chaos engineering using Testcontainers and Toxiproxy, developers can follow a series of steps. First, they need to set up a Testcontainer for the service they want to test, such as a PostgreSQL database. Then, they can use Toxiproxy to simulate network failures, such as latency or connection cuts, between the service and its clients. By injecting these failures into the system, developers can observe how the system behaves and validate its consistency and availability.

Example Use Case: Validating Consistency Fallback

One example use case for chaos engineering with Testcontainers and Toxiproxy is validating the consistency fallback mechanism in a distributed system. In this scenario, the system is designed to switch from a synchronous consistency model to an asynchronous one when a network partition occurs. To validate this behavior, developers can use Toxiproxy to simulate a network partition and then verify that the system correctly switches to the asynchronous consistency model.

@Test
void highCriticalityIntent_shouldFallbackToAsync_duringNetworkPartition() {
    // Setup Proxy to Postgres
    var proxy = toxiproxy.getProxy(postgresContainer, 5432);
    
    // 1. Assert Normal Operation (Good Health)
    var result = intentService.process(highCriticalIntent);
    assertThat(result.strategy()).isEqualTo(ExecutionStrategy.SYNC);

    // 2. Introduce Network Partition (Cut connection)
    proxy.setConnectionCut(true);

    // 3. Assert Degradation (Degraded Health)
    var degradedResult = intentService.process(highCriticalIntent);
    assertThat(degradedResult.strategy()).isEqualTo(ExecutionStrategy.ASYNC);
    assertThat(degradedResult.status()).isEqualTo(202);
    
    // 4. Restore and Verify Reconciliation
    proxy.setConnectionCut(false);
    await().atMost(Duration.ofSeconds(5)).until(() -> 
        reconciliationService.isConsistent(highCriticalIntent)
    );
}

Decision Matrix for Expected System Behavior

The following decision matrix outlines the expected system behavior during chaos testing:

System HealthIntent CriticalityExpected StrategyVerification Metric
GoodHighSYNCp99 Latency < 50ms
PartitionedHighASYNCHTTP 202 Status
GoodLowOPTIMISTICVersion Increment
PartitionedLowASYNCKafka Consumer Lag

Conclusion

In conclusion, chaos engineering with Testcontainers and Toxiproxy provides a powerful approach to validating the consistency and availability of distributed systems. By simulating real-world network failures and observing the system’s behavior, developers can ensure that their system is robust and resilient in the face of failures. As the complexity of distributed systems continues to grow, the importance of chaos engineering will only continue to increase.

Sources

[1] https://www.docker.com/blog/developing-resilient-applications-with-toxiproxy-and-testcontainers/ [2] https://dotnet.testcontainers.org/modules/toxiproxy/ [3] https://testcontainers.com/modules/toxiproxy/