Today’s advanced distributed software systems must be tested for potential weaknesses and faults. Chaos engineering is the process of testing a distributed computing system to ensure that it can tolerate unexpected disruptions. It relies on concepts underlying chaos theory, which focus on random and unpredictable behavior. If you are interested in knowing more about Chaos Engineering and History please refer this article from Gremlin
In this article, we will discuss various categories of attacks and some use cases.
Generate load across CPU, Memory and Storage devices
Help in preparation for sudden load change, validating auto scaling, test monitoring and alerting config. Its like preparing our system for Black Friday sale in advance.
CPU attack sends heavy traffic on system which can help to identify stability and performance undrer stress. We can also validate auto scaling and alerting mechanism.
Memory leak is the top reason for "Out Of Memory" in production. Memory leaks happen when applications consume more memory resources than release. This attack will help to validate the hypothesis for memory intensive work load like in-memory cache, and machine learning models. It will also help in cloud migration by simulating auto-scaling configuration.
Disk attacks are often used to simulate reading or writing a large data set, such as a restored backup, or replicated database. It can also help in identifying loopholes in automatic disc cleanup process.
An IO attack can help you prepare for slower storage solutions by simulating their performance. This attack help to validate disk heavy work load (batch process which read/write from disk) and effectiveness of in-memory cache.
State attacks change the state of your environment by terminating processes, shutting down or restarting hosts, and changing the system clock. This lets you prepare your systems for unexpected changes in your environment such as power outages, node failures, clock drift, or application crashes.
Process Killer Attack
Process killer attacks allow teams to terminate a specific process or set of processes. This will ensure watch-dog effectiveness for application/service restart and testing leader re-election in clustered work load.
This is similar to chaos monkey where entire host is shutdown which enable team to build highly resilient system. This will help to validate DR scenarios like automatic work load migration, replication and high availability of clustered workload.
Time Travel Attack
Time travel attacks allow you to change the system clock. This lets you prepare for scenarios such as Daylight Savings Time (DST), clock drift between hosts, and expiring SSL/TLS certificates.
Network attacks let you simulate unhealthy network conditions including dropped connections, high latency, packet loss, and DNS outages. This lets you build applications that are resilient to unreliable network conditions.
Blackhole attacks help you simulate outages by dropping network traffic between services. This lets you uncover hard dependencies, test fallback and failover mechanisms, and prepare your applications for unreliable networks. We can also validate monitoring and alerting mechanism for cluster.
Latency is the amount of time taken for a network request to travel from one network endpoint to another. The Latency attack injects a delay into outbound network traffic, letting you validate your system’s responsiveness under slow network conditions. This will also help in circuit breaker configuration for retry and timeout threshold.
Recently we have seen Akamai DNS failure caused many popular becoming un-reachable. More info here The DNS attack simulates a DNS outage by blocking network access to DNS servers. This lets you prepare for DNS outages, test your fallback DNS servers, and validate DNS resolver configurations.
Packet Loss Attack
This attack is very helpful for streaming services, such as live video or multiplayer gaming which rely on a high throughput of data. When there is network congestion, many packets are queued and some packages may lost due to the queue capacity threshold on your hardware. Packet Loss attacks let you replicate this condition and simulate the end-user experience and configuration of the replay mechanism for a better user experience.
The article provides a comprehensive overview of different types of chaos engineering attacks. It explains how chaos engineering can help identify and mitigate failures in complex systems. The article dives into various types of attacks, such as CPU, memory, and network attacks, and how they can impact the system's behavior. It also discusses how to conduct chaos engineering experiments and explains the importance of using it to ensure system reliability and resilience. Overall, the article provides valuable insights into the world of chaos engineering and highlights the importance of implementing it in modern software systems.
In the next article, we will discuss other Chaos Engineering concepts.
Did you find this article valuable?
Support Amit Himani by becoming a sponsor. Any amount is appreciated!