CoFI: Consistency-Guided Fault Injection for Cloud Systems
CoFI (Consistency-guided Fault Injection, pronounced as “coffee”) is a tool for injecting network partitions when testing cloud systems.
We observe that, network partitions can leave cloud systems in inconsistent states, where partition bugs (bugs triggered by network partitions) are more likely to occur. Based on this observation, CoFI first infers invariants (i.e., consistent states) among different nodes in a cloud system. Once detecting a violation to the inferred invariants (i.e., inconsistent states) while running the cloud system, CoFI injects network partitions to prevent the cloud system from recovering back to consistent states, and thoroughly tests whether the cloud system still proceeds correctly at inconsistent states.
Bugs Detected by CoFI
We applied CoFI to five versions of three widely-used cloud systems including Cassandra-3.11.5, HDFS-3.3.0, HDFS-2.10.0, YARN-3.3.0, and YARN-2.10.0. The following table shows the bugs detected by CoFI in these systems.
Bug ID | Failure Symptom |
---|---|
CASSANDRA-15758 | Thread crashes |
CASSANDRA-15548 | A created keyspace can’t be found |
CASSANDRA-15546 | Data read failure |
CASSANDRA-15437 | Decommission failure |
CASSANDRA-11804 | Data access failure |
HDFS-15367 | File metadata inaccessible |
HDFS-15235 | NameNode crashes |
YARN-10301 | Fail to stop a YARN service |
YARN-10294 | Misleading error message |
YARN-10288 | Invalid application state transition |
YARN-10232 | Invalid application state transition |
YARN-10231 | Misleading error message |
Getting Started
The easiest way to build and run CoFI is using Docker. This tutorial will guide you to build the docker images from scratch and apply CoFI to test Cassandra.
Building CoFI’s Docker Image
Run the following command at the repository’s root directory to build CoFI’s docker image.
$ make build-image
This will create a Docker image called hanseychen/cofi:0.1
.
Building Cassandra’s Docker Image.
To build the Docker image for Cassandra, run the following command in the repository’s root.
$ cd examples/cassandra-3.11.5 && make build-image
This will create a Docker image called cofi:cassandra-3
. This cassandra image
is built on top of the CoFI image. So, we can apply CoFI to Cassandra using the
Cassandra image. When building the Cassandra, CoFI instruments Cassandra’s code
to enable accessing the runtime values of possible interesting variables. Next,
we’ll run a Docker container using this image and run CoFI to test Cassandra.
Starting the Docker Container.
Inside Cassandra’s directory, run the following command to start a Docker
container called ca
using the Cassandra image.
$ make run-container image_name=cofi:cassandra-3 container_name=ca
Mining invariants
CoFI runs in two stages: an invariant mining stage and a fault injection stage.
To mine invariants of Cassandra, run the following command under the /app
directory inside the container.
$ /cofi/bin/mine-invariants.sh interesting-variables.txt test-case/data-test.sh test-case/cleanup.sh
This command will run the data-test.sh
test case and log the runtime values of
the interesting variables stored in interesting-variables.txt
. Afterwards,
CoFI’s invariant mining engine will process the logged values and generate
interesting invariants for guiding the later fault injection testing. The
generated invariants are stored in /cofi/selected-invariants.txt
.
Testing Cassandra
To run CoFI’s fault injection engine, run the following command under the /app
directory inside the container.
$ /cofi/bin/test-invariants.py test-case/data-test.sh test-case/cleanup.sh
This command will iterate over the interesting invariants inside
/cofi/selected-invariants.txt
, and use the invariants to guide fault injection
tests on Cassandra.
Understanding the failure report
If a test failure occurs when running the fault injection tests, a failure
report will be generated under the /app
directory. The file name follows the
format of failure-plan-<timestamp>.txt
. Inside the failure plan, you can find
the invariant used to trigger the failure, the messages being failed, and the
outputs generated by the test case.
Publication
If you are interested in how network partitions affect cloud systems and how CoFI effectively and efficiently detects partition bugs, you can find more details in our paper listed below. If you use our tool, please cite our paper.
CoFI: Consistency-Guided Fault Injection for Cloud Systems [preprint]
Haicheng Chen,
Wensheng Dou,
Dong Wang,
Feng Qin
35th IEEE/ACM International Conference on Automated Software Engineering (ASE 2020)