Pragmatic testing (P2: Good tests, Data-oriented testing, Fake tests)
In the last post, we discussed the reasons for writing tests, principles and how to write code that is easy to test. In this post, we’ll discuss more on how to write good tests.
First, let me repeat the important thing: Test is a not a silver bullet. Adding a few tests won’t help us improve the code quality that much. If you haven’t read the part 1 yet, it is highly recommended to read it first.
1. Good tests
We are finally here! Let’s talk about writing tests! Making code easy to test is only the first step, and to write good tests we will need some more skills.
1.1. Choose the right way to test
We know that there are various types of tests. And here are some examples:
- Unit test: Test a class or a small module by talking to them directly, e.g. calling public member functions.
- Module/Service test: Test a module or service by simulating its upstream and downstream services.
- End-To-End test (E2E test): Test a complete system by simulating the upstream and downstream services of the whole system.
- Integration test: Integrate all services, even multiple systems, as required and test them.
- In-Service/Probe test: Test the services within the service itself or with a dedicated probe service at regular intervals.
Each of these tests has its own advantages and disadvantages, roughly as follows.
Test type | Implementation difficulty | Local testing? | Test speed | Time to find problems | Complexity of finding problems | Difficulty of debugging problems |
---|---|---|---|---|---|---|
Unit Testing | Easy | Yes | Fast | Early | Simple | Easy |
Module/Service Testing | Relatively Easy | Doable | Relatively Fast | Relatively Early | Normal | Relatively Simple (if local debugging is possible) |
End-to-end testing | Normal | Doable | Nromal | Normal | Relatively Complex | Relatively Simple (if local debugging is possible) |
Integration Testing | Hard | Usually No | Slow | Late | Complex | Hard |
Probe Testing | Normal | Usually No | Relatively Fast | Late | Normal | Relatively Hard |
Therefore, when writing tests, we need to think about what we want to test and choose the appropriate way to test it.
When choosing, we must pay attention to the shortest distance principle. For example, we may feel that the integration test will have the best coverage, so we should test everything with integration tests. But precisely because of its coverage, we may spend a tremendous amount of work building these tests, and debugging them when test fails. Imagine an exception is thrown in a very strange place after dozens of services have called each other hundreds of times. And then, we need to debug this and find out why… (Good luck with this :D)
1.2. Reasonable prompt messages
“Transparency is a passive quality. A program is transparent when it is possible to form a simple mental model of its behavior that is actually predictive for all or most cases, because you can see through the machinery to what is actually going on.”
- Eric Steven Raymond, from “The Art of Unix Programming”
While tests can tell us that something is wrong, many people often miss the point that: it is more important to tell us what is wrong and what to check.
1.2.1. Clarify the test scenarios
The first and most overlooked thing is the names of the tests. Here are some examples that I’ve (often) seen in real projects:
1 | Test |
These test names won’t help us understand their purpose at all, so when something goes wrong we have no idea what to look at. A better way here would be to describe the test scenario well and use assertive test name.
1 | ScopeHandle_AfterDtor_UnderlyingHandleShouldBeClosed |
Don’t worry about long function names. Being descriptive is never wrong.
1.2.2. Actionable error messages
When test fails, the error message must be clear and actionable.
For example: when writing assertions, don’t just simply write something like Assert(a != 0);
. It is better to give some advice:
1 | Assert(dataIterator != dataMap.end(), |
Debugging such error will be much easier because the error is very clear and specific.
1.2.3. Make behavior changes obvious
As we mentioned before, one of the purpose of testing is to identify behavior changes. So, to ensure high observability, these behavior changes must be as clear and obvious as possible. These include changes in our service internal states, data reporting, and even behavior of debugging related tools (e.g., APIs for getting certain state of the service).
This idea is great, but it also leads to the problem - If we want to see all the behavior changes, we have to write code to test every behavior our program. This makes it extremely hard to maintain these tests!
Here’s a simple example: Testing a microservice by simulating a service upstream and downstream is a very common practise, so I’m sure you’ve seen countless of tests that look like this:
1 | TEST_METHOD(MyService_WhenReceivingValidRequest_DownstreamServicesShouldBeProgrammed) |
As we can see, even with the help from a lot of helper functions to simplify the code, the tests are still very tedious to write. And once any behavior change is made, all the tests must be changed along with it. This leads to an extremely high workload in maintaining the tests, which makes everyone so tired of adding more. So, is there a way to have the best of both worlds? Gladly, yes! Later, we will introduce metadata-based testing to help us solve this problem.
1.3. Focus on requirement and defect coverage
Nowadays, many testing tools can provide code coverage. And high code coverage is also enforced in many projects. While this indeed helps, it can also mislead people, making people overly value the “100% code coverage” data and forgot what real intention.
So, why is code coverage misleading? It is because 100% code coverage doesn’t mean that all cases have been tested. But, what?? Yes, this may sound strange and let’s look at the following code:
1 | Access GetUserAccess(UserRole role) { |
The bug in this code is so obvious, but the following test with 100% code coverage cannot find it!
1 | Assert::AreEqual(Access::Read | Access::Write, GetUserAccess(UserRole.Writer)); |
So only pursuing code coverage is no different from putting the cart before the horse. Then, what should we really go after? The answer always goes back to what our program really want to do, in other words - the requirements. The requirement can be as small as the purpose of a single class or as large as the customer scenario. The tests are written to ensure that our requirements are implemented correctly and do not regress. And this is called requirement coverage. Of course, it is hard to achieve high requirements coverage, because it requires us to have an good understanding of both our code and product to foresee what needs to be covered.
Besides foresight, hindsight is the equally important - defect coverage. Mistakes what were made before, we should add tests to ensure that the same mistakes will never happen again.
These two types of coverage are what we should really go after.
1.4. Create good scaffolding
Some tests are hard to write directly. For example, although microservices are usually small enough to be easily covered by service tests, building them are still not easy. This is where we can help by building a test framework or utility, i.e. scaffolding. Just like the scaffolding in real life, it helps us to create an (easy) way to do what we want to do (test what we want to test).
In this example of testing microservices, we can use mock to simulate the communication layer of that service and provide generic mocks for upstream and downstream services. With this, testing our services will be much simpler. We can just simulate the requests via the mock upstream service and check if the state in our service and requests send to downstream serviceas are all looks good.
Scaffolding is also frequently used in large regression tests, integration tests and end-to-end tests. These tests usually require a certain environment to be built before testing, such as creating all the services and then simulating the customer requests and verifying the whole system. These works are usually very tedious. And having a unified and easy-to-use scaffolding can make the whole team more efficient.
If you are planning to creating a scaffolding, please treat it as a product! And here our customer are our internal developers. This means that we need to understand the requirements before implementation, and we need to collect user feedback to help us improve and iterate from time to time. Here are some principles on how to create good scaffolding. Hope they help:
Principle #1: The cost of using scaffolding must be lower than the cost of implementing our main logic for testing.
- Scaffolding is used to help us simplify testing.
- If, after using scaffolding, we still see majority of the testing code is for building test environment, this scarffolding is definitely a failure.
Principle #2: A scaffolding is a framework. It means it has responsibility to not only help simplify testing for everyone, but also help people avoid making mistakes.
- A good framework is a great helper as well as a constraint. The creator of any framework must be forward-thinking and help (or even force) everyone use the right approach to do things.
- For example, if we use any actor model framework, it will be difficult to get data directly from one actor to another actor. Instead, we will have to send messages. This might sound annoying, but it’s also one of the cornerstones of actor model that makes people hard to make mistakes. (Extended reading: Go Proverbs: Don’t communicate by sharing memory, share memory by communicating.)
2. Metadata-based testing
“Put Abstractions in Code, Details in Metadata” - Andy Hunt, from “The Pragmatic Programmer: Your Journey to Mastery”
In the “Make behavior changes obvious” section above, we encountered a problem: the more tests we create, the harder they are to maintain. This ended up discouraging us from writing tests. This is where metadata-based testing can really help.
2.1. Extract metadata
To help us simplify our tests, I recommend applying the idea of separating application and metadata when writing tests: Abstract the tests as much as possible into a unified scaffolding, then extract the details as metadata and use textuality storage. In this way, if we need to write a new test case, we only need to add a new set of test metadata, the core testing logic doesn’t need to be changed at all! And even better - the textualized metadata can be easily managed by any source control, and all the behavior changes can be revealed at a glance by simply checking the diff.
For example, the tedious code above can be simplified as follows:
1 | TEST_METHOD(MyService_WhenReceivingValidRequest_DownstreamServicesShouldBeProgrammed) |
After this change, all the details are extracted into metadata and saved in a json file. (If you love yaml like me, you could use yaml or other formats as well. But please remember - the data must be in a human-readable form.)
1 | { |
This not only makes the code shorter and better to read, also makes adding tests extremely easy. And if we made some behavior changes to our service, the test code doesn’t need to be changed at all!
2.2. Baseline generation
Here, you may wonder, the workload is not really reduced at all, but just moved to changing the metadata file. So, what is the difference? Don’t worry, because of this small change, a sea change is about to begin!
So, let’s look at our metadata again. Do we really need to change it ourselves? No at all! To compare the states, we have to fetch the actual states as well as the expected states. So if we save the actual states as expected states, isn’t this the metadata we want for future, i.e. a new baseline? So, we only need to make a very small change to make it work:
1 | void RunMyServiceStateHandlingTest(_In_ const std::wstring& inputFilePath, _In_ const std::wstring& expectedStatesAfterUpdateFilePath) |
There you go! With only a few lines of code change, now we have the ability of baseline generation. After we change the behavior of the service, we just need to turn on the switch for baseline generation and run all the tests again, all the metadata will be updated without changing a single line of testing code.
2.3. Generating reference data for test failures
That’s not all. Another benefit this brings is that debugging becomes unbelievably easy! I don’t know if you’ve ever had the experience of debugging a test failure caused by two insanely complex objects (e.g. long/nested lists, structs/classses as follows) … So what failed the check? what else are different in the list besides the one fails the check? Who am I? Where am I? What am I doing here?
1 | Assert::AreEqual(longListWithDeeplyNestedStructs1, longListWithDeeplyNestedStructs2); |
And none of these issues are problems for metadata testing, because when test fails, we can generate reference data as well!
1 | template <class T> |
After this small change, when the test fails, we will get the reference data in our log folder. And we can simply diff it against our baseline to see exactly what is changed. This will not just tell us the first failure, but gives us a full picture of the state change! It might be an ordering change, or some changes with a very obvious pattern. Knowing the full picture can greatly help us to figure out what might go wrong exactly.
Maybe you are wondering, can’t we always generate the baseline and check diff for local development? For local develop, this is indeed possible, but sometimes some errors only occur on our build servers, which is when the reference data becomes quite useful. It greatly increases the observability of test failures.
2.4. Good enough? More to come!
I started experimenting this test approach since a few years ago and our team is currently using it. It has been proved to be very effective in reducing the development and maintenance costs of the tests. Hopefully, by this point, you will also start to get interested in metadata-based testing and willing to give it a try. However, its power doesn’t stop there. But this post starts to become too long again, we’ll stop here for now. In the next (and supposedly final) post, we’ll discuss the advanced use of metadata-based testing and demonstrate how it can be used to better help us in our daily development.
3. “Fake” tests
When writing tests, we have to be especially careful about several types of tests below. Even if they existed, they wouldn’t help us at all. In the best case, they will serve as placebo. But most of time, these tests are really harmful to our daily development.
3.1. Flaky tests
First and the worst, flaky tests. We mentioned it in “Avoid unstable code” in the previous post. For this kind of test, we should disable it as soon as possible and treat it as a high priority development task. As long as the test is not fixed, development of the new features must stop. The reason is simple - if test cannot even pass, how do we know that the issue won’t cause any problems for our customers?
The way to find this problem is quite simple and brutal, because the only way is to run the tests multiple times. So besides hearing multiple developers all complaining, we could either look at the test history (which will be mentioned in the next post) or just trigger a test stability test every so often and run each test many times to find the unstable ones.
I was once asked to fix such a test. Someone came to me and say: “This test had about a 20% chance of failing a week ago, but this week it feels like it went up to 30%, can you see what’s going on?”. To be honest, I have no idea how to fix this. First of all, I don’t know if it’s because of mercury retrograde causing you having a bad week and just simply being unlucky. Then, I don’t know if the failure I’m hitting is from the 20% that is “ignorable” and already exists, or the 10% that we need to fix? So, in the end, here is what I did:
- Disable this test completely, as whoever encounters its failure will just keep retry anyway. It is a total waste of our build and test resources.
- Read this test and try to fully understand what it is actually trying to test.
- Create a new test which can steadily reproduce the failure.
- Fix the failure from this new test and commit it, then repeat steps 3-4 until all the problems are fixed.
Finally, since the original test was written in a wrong way, I created a new test to test the scenario. After submitting it, I deleted the original test from our code base.
3.2. Slow tests
Tests are supposed to help us find problems early, which means we should test as early and as often as possible. But if the tests are slow, we won’t be able to do this. And ever worse, over time no one will run these tests anymore. Like the tests, which we mentioned in the previous post and blindly sleep for 30 seconds everywhere, no one runs it in our project. When any test failed, people just simply retry or even comment them out! … So please do not ignore the performance of tests.
Of course, we should not pursue the speed of testing blindly (TAOUP: Rule of Economy), but we will have different expectations for the speed of different types of tests. For example, large integration tests or end-to-end tests may take several hours to run, but we might only run such tests once a day for measuring the quality of our daily build. So, even if it takes an hour, it doesn’t matter that much. However, it may not be ideal for such tests to take several days. For example, if you need a create a hotfix release to fix a urgent online issue, but after the build is done, the tests need three days to run to tell us whether the release is good or bad. By that time, I believe the customer will probably go crazy when hearing this.
An exception here is performance test. Performance test usually requires us to repeatly run a scenario many times, so it’s natural that they take time. But again, this leads us to the same problem, they won’t be run as ofthen as other tests. So, to ensure that other regular tests won’t be impacted, we can isolate it out by putting they into a separate test class or test module.
3.3. Shallow Probes
Probe is very helpful to building high availablity services. It provides to ability to check if an service is healthy or not. And if something goes wrong, it can help automatically triggers failover, fixes, rollback, or in the worse case - alerts. This is critical to get things automated, otherwise every service issue must be manually checked and fixed. And in large scale services, errors will happen. Without automation, our work will be extremely ineffiecient.
Many service governance frameworks provide probe support, for example, kubernetes liveness, readiness, startup probe. These probes provide convenient configuration and support for a variety of implementations, such as doing a tcp connect, or http request. These probes are intended to provide a convenient and uniform mechanism to check the health of a service, but this also leads to a very common problem: the things checked in the probe are way too simple!
For example, a service is considered healthy as long as it is “started” successfully, e.g. process is running or certain port is open or certain core component is initialized. However, it is usually not enough for any service to properly serve any requests, which hides and real problems and gives us an impression that everything works fine, but actually not. And this brings up the real question: How healthy does a service have to be to be considered healthy? This is also the key ensuring the probe is implemented correctly.
But if we think about this question a bit carefully, we will quickly discover a very scary fact: A service has to be completely problem-free to be considered healthy! This means everything we need has to be correctly loaded, and not a single piece of customer data can go wrong (and the list goes on…)! Because only then, we can say the service should be able to correctly serve the requests now. This also means that the probe should check if everything of the service are running in the right state. I call this type of probe “Deep Probe”.
However, Deep probe also causes another problem: it takes too long to execute for a single probe request, which leads to timeout failures. So, the typical ways to implement deep probes are usually different from the regular probes:
- Timers and health reporting: The idea is to move the deep probe logic out of the probe request handling into timers. In some service governance frameworks, service health checks is not using a pull model (timed probes), but a push model (health reporting), such as Service Fabric. And when a problem is found, we can either report an unhealthy event or stop the heartbeat of the healthy event. And when using pull model, we can make our probe endpoint return failure when things goes wrong.
- Dedicated deep probe service: If the amount of data to be tested is too large, we can also create a dedicated deep probe service. The deep probe service can talk to other services and read in relevant information, then run tests and report errors. This makes it very easy to limit the overall resource usage and avoid impacting our key services. Of course, this also requires our services to have a good and unified service discovery and state discovery mechanism to support these operations.
So, depends on how our services are implemented, we need to choose the right way to implement the deep probe. For example, sometimes it might not be a good idea to put deep probe in kubernetes readiness probe, as we might not want to the traffic to be cut for that partition. And creating a dedicated deep-probe service might be a better idea.
4. Let’s take another break
Well, this post is a bit long again, so let’s end here and call it part 2 and summarize everything we mentioned here again:
- First, we discussed how to write good tests:
- Choosing a reasonable way test our scenarios.
- Giving reasonable hints about test failures, which include clarify test scenario by using assertive test names, make error messages actionable, and make behavior changes as obvious as possible.
- Pursuing requirements coverage and defect coverage, rather than simple code coverage.
- Build professional scaffolding to help us simplify the testing workload.
- To achieve the observability in behavior changes, maintaining tests becomes extremely expensive, so we introduced metadata-based testing and demonstrate how it helps greatly reduce the development and maintenance cost of test, improves even more on the observability of test failures, and improve the experience of debugging test failures.
- Finally, we discuss several types of “fake” tests: flaky tests, slow tests and shallow probes. None of them helps but only being harmful to our daily development. And we also discussed the deep probe and how to implement it, to solve the shallow probe problem.
In the next post, let’s move on to discuss more about metadata-based testing and other test related topics.
本文链接地址:Pragmatic testing (P2: Good tests, Data-oriented testing, Fake tests)