Property-based testing is a rare example of academic research that has made it to the mainstream in less than 30 years. Under the slogan "don't write tests, generate them" property-based testing has gained support from a diverse group of programming language communities. In fact, the Wikipedia page of the original property-basted testing Haskell library, QuickCheck, lists 57 reimplementations in other languages.
In this post I'd like to survey the most popular property-based testing implementations and compare them with what used to be the state-of-the-art fifteen years ago (2009). As the title already gives away, most of the libraries do not offer their users the most advanced property-based testing features. In order to best explain what's missing and why I think we ended up in this situation, let me start by telling the brief history of property-based testing.
In Gothenburg, Sweden's second most populated city, there's a university called Chalmers. At the computer science department of Chalmers there are several research groups, two of which are particularly relevant to our story -- the Functional Programming group and Programming Logic group. I'll let you guess what the former group's main interest is. The latter group's mostly concerned with a branch of functional programming where the type system is sufficiently expressive that it allows for formal specifications of programs, sometimes called dependently typed programming or type theory. Agda is an example of a Haskell-like dependently typed programming language, that also happens to be mainly developed by the Programming Logic group. Given the overlap of interest and proximity, researchers at the department are sometimes part of both groups or at least visit each others research seminars from time to time.
John Hughes is a long-time member of the Functional Programming group, who's also well aware of the research on dependently typed programming going on in the Programming Logic group. One day in the late nineties, after having worked hard on finishing something important on time, John found himself having a week "off". So, just for fun, he started experimenting with the idea of testing if a program respects a formal specification.
Typically in dependently typed programming you use the types to write
the specification and then the program that implements that type is the
formal proof that the program is correct. For example, let's say you've
implemented a list sorting function, the specification typically then is
that the output of the sorting function is ordered, i.e. for any index
While John was working on this idea, Koen Claessen, another member of the Functional Programming group, stuck his head into John's office and asked what he was doing. Koen got excited as well and came back the next day with his improved version of John's code. There was some things that Koen hadn't thought about, so John iterated on his code and so it went back and forth for a week until the first implementation of property-based testing was written and not long afterwards they published the paper QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs (ICFP 2000). I think it's worth stressing the lightweight tool part from the paper's title, the complete source code for the first version of the library is included in the appendix of the paper and it's about 300 lines of code.
Haskell and dependently typed programming languages, like Agda, are pure functional programming languages, meaning that it's possible at the type-level to distinguish whether a function has side-effects or not. Proofs about functions in Agda, and similar languages, are almost always only dealing with pure functions. Probably as a result of this, the first version of QuickCheck can only test pure functions. This shortcoming was rectified in the follow up paper Testing monadic code with QuickCheck (2002) by the same authors. It's an important extension as it allows us to reason about functions that use mutable state, file I/O and networking, etc. It also lays the foundation for being able to test concurrent programs, as we shall see below.
Around the same time as the second paper was published (2002), John was applying for a major grant at the Swedish Strategic Research Foundation. A part of the application process involved pitching in front of a panel of people from industry. Some person from Ericsson was on the panel and they were interested in QuickCheck. There was also a serial entrepreneur on the panel and she encouraged John to start a company, and the Ericsson person agreed to be a first customer, and so Quviq AB was founded in 20061 by John and Thomas Arts (perhaps somewhat surprisingly, Koen was never involved in the company).
The first project at Ericsson that Quviq helped out testing was written in Erlang. Unlike Haskell, Erlang is not a pure functional programming language and on top of that there's concurrency everywhere. So even the second, monadic, version of QuickCheck didn't turn out to be ergonomic enough for the job. This is what motivated the closed source Quviq QuickCheck version written in Erlang, first mentioned in the paper Testing telecoms software with Quviq QuickCheck (2006). The main features of the closed source version that, as we shall see, are still not found in many open source versions are:
- Sequential stateful property-based testing using a state machine model;
- Parallel testing with race condition detection by reusing the sequential state machine model.
We shall describe how these features work in detail later. For now let's just note that stateful testing in its current form was first mentioned in QuickCheck testing for fun and profit (2007). This paper also mentions that it took John four iterations to get the stateful testing design right, so while the 2006 paper already does mention stateful testing it's likely containing one of those earlier iteration of it.
While the 2007 paper also mentions parallel testing via traces and interleavings, it's vague on details. It's only later in Finding Race Conditions in Erlang with QuickCheck and PULSE (ICFP 2009) that parallel testing is described in detail including a reference to Linearizability: a correctness condition for concurrent objects by Herlihy and Wing (1990) which is the main technique behind it.
I'd like to stress that no Quviq QuickCheck library code is ever shared in any of these papers, they only contain the library APIs (which are public) and test examples implemented using said APIs.
After that most papers are experience reports of applying Quviq QuickCheck at different companies, e.g. Testing A Database for Race Conditions with QuickCheck (2011), Testing the hard stuff and staying sane (2014), Testing AUTOSAR software with QuickCheck (2015), Mysteries of Dropbox: Property-Based Testing of a Distributed Synchronization Service (2016).
Sometimes various minor extensions to stateful and parallel testings are needed in order to test some particular piece of software, e.g. C FFI bindings in the case of AUTOSAR or eventual consistency in the case of Dropbox, but by and large the stateful and parallel testing features remain the same.
As we've seen above, the current state-of-the-art when it comes to property-based testing is stateful testing via a state machine model and reusing the same sequential state machine model combined with linearisability to achieve parallel testing.
Next let's survey the most commonly used property-based testing libraries to see how well supported these two testing features are. Let me be clear up front that I've not used all of these libraries. My understanding comes from reading the documentation, issue tracker and sometimes source code.
To my best knowledge, as of July 2024, the following table summarises the situation. Please open an issue, PR, or get in touch if you see a mistake or an important omission.
Library | Language | Stateful | Parallel | Notes |
---|---|---|---|---|
CsCheck | C# | ☒ | ☒ | |
Eris | PHP | ☐ | ☐ | |
FsCheck | F# | ☒ | ☐ | Has experimental stateful testing. An issue to add parallel support has been open since 2016. |
Gopter | Go | ☒ | ☐ | The README says "No parallel commands ... yet?" and there's an open issue from 2017. |
Hedgehog | Haskell | ☒ | ☒ | Has parallel support, but the implementation has issues. |
Hypothesis | Python | ☒ | ☐ | |
PropEr | Erlang | ☒ | ☒ | First open source library to support both? |
QuickCheck | Haskell | ☐ | ☐ | There's an open issue to add stateful testing since 2016. |
QuickTheories | Java | ☒ | ☐ | Has experimental for stateful testing, there's also some parallel testing, but it's inefficient and restrictive compared to QuviQ's Erlang version of QuickCheck. From the source code: "Supplied commands will first be run in sequence and compared against the model, then run concurrently. All possible valid end states of the system will be calculated, then the actual end state compared to this. As the number of possible end states increases rapidly with the number of commands, command lists should usually be constrained to 10 or less." |
Rapid | Go | ☒ | ☐ | |
RapidCheck | C++ | ☒ | ☐ | There's an open issue to add parallel support from 2015. |
ScalaCheck | Scala | ☒ | ☐ | Has some support for parallel testing, but it's limited as can be witnessed by the fact that the two examples of testing LevelDB and Redis both are sequential (threadCount = 1 ). |
Supposition | Julia | ☒ | ☐ | The stateful testing lacks a proper interface though. |
SwiftCheck | Swift | ☐ | ☐ | There's an open issue to add stateful testing from 2016. |
fast-check | TypeScript | ☒ | ☐ | Has some support for race condition checking, but it seems different from Quviq QuickCheck's parallel testing. In particular it doesn't seem to reuse the sequential state machine model nor use linearisability. |
jetCheck | Java | ☒ | ☐ | From the source code "Represents an action with potential side effects, for single-threaded property-based testing of stateful systems.". |
jqwik | Java | ☒ | ☐ | |
jsverify | JavaScript | ☐ | ☐ | There's an open issue to add stateful testing from 2015. |
lua-quickcheck | Lua | ☒ | ☐ | |
propcheck | Elixir | ☒ | ☐ | There's an open issue to add parallel testing from 2020. |
proptest | Rust | ☐ | ☐ | See proptest-state-machine. |
proptest-state-machine | Rust | ☒ | ☐ | Documentation says "Currently, only sequential strategy is supported, but a concurrent strategy is planned to be added at later point.". |
qcheck-stm | OCaml | ☒ | ☒ | |
quickcheck | Prolog | ☐ | ☐ | |
quickcheck | Rust | ☐ | ☐ | Issue to add stateful testing has been closed. |
quickcheck-state-machine | Haskell | ☒ | ☒ | Second open source library with parallel testing support? (I was involved in the development.) |
rackcheck | Racket | ☐ | ☐ | |
rantly | Ruby | ☐ | ☐ | |
stateful-check | Clojure | ☒ | ☒ | |
test.check | Clojure | ☐ | ☐ | Someone has implemented stateful testing in a blog post though. |
test.contract | Clojure | ☒ | ☐ | |
theft | C | ☐ | ☐ |
By now I hope that I've managed to convince you that most property-based testing libraries do not implement what used to be the state-of-the-art fifteen years ago.
Many libraries lack stateful testing via state machines and most lack parallel testing support. Often users of the libraries have opened tickets asking for these features, but the tickets have stayed open for years without any progress. Furthermore it's not clear to me whether all libraries that support stateful testing can be generalised to parallel testing without a substantial redesign of their APIs. I don't think there's a single example of a library to which parallel testing was added later, rather than designed for from the start.
Here are three reasons I've heard from John:
-
The stateful and parallel testing features are not as useful as testing pure functions. This is what John told me when I asked him why these features haven't taken off in the context of Haskell (BobKonf 2017);
-
The state machine models that one needs to write for the stateful and parallel testing require a different way of thinking compared to normal testing. One can't merely give these tools to new users without also giving them proper training, John said in an interview;
-
Open source didn't work, a closed source product and associated services helps adoption:
"Thomas Arts and I have founded a start-up, Quviq AB, to develop and market Quviq QuickCheck. Interestingly, this is the second implementation of QuickCheck for Erlang. The first was presented at the Erlang User Conference in 2003, and made available on the web. Despite enthusiasm at the conference, it was never adopted in industry. We tried to give away the technology, and it didn’t work! So now we are selling it, with considerably more success. Of course, Quviq QuickCheck is no longer the same product that was offered in 2003—it has been improved in many ways, adapted in the light of customers’ experience, extended to be simpler to apply to customers’ problems, and is available together with training courses and consultancy. That is, we are putting a great deal of work into helping customers adopt the technology. It was naive to expect that simply putting source code on the web would suffice to make that happen, and it would also be unreasonable to expect funding agencies to pay for all the work involved. In that light, starting a company is a natural way for a researcher to make an impact on industrial practice—and so far, at least, it seems to be succeeding."
A cynic might argue that there's a conflict of interest between doing research and education on one hand and running a company that sells licenses, training and consulting on the other.
Let me be clear that I've the utmost respect for John, and I believe what he says to be true and I believe he acts with the best intentions. Having said that let me try to address John's points.
I think many people will agree that separating pure from side-effectful code is good practice in any programming language, and I do agree with John that you can get far by merely property-based testing those pure fragments.
However, I also think that stateful and parallel testing is almost equally important for many non-trivial software systems. Most systems in industry will have some database, stateful protocol or use concurrent data structures, which all benefit from the stateful and parallel testing features.
Regarding formal specification requiring a special way of thinking and therefor training, I believe this is a correct assessment. However I also believe that this is already true for property-based testing of pure functions. A non-trained user of pure property-based testing will likely test less interesting properties than someone who's trained.
Given that John has written papers and given talks on the topic of making property-based testing of pure functions more accessible to programmers, one might wonder why we cannot do the same for stateful and parallel testing?
The experience reports, that we've mentioned above, usually contain some novelty (which warrants publishing a new paper) rather than general advice which can be done with the vanilla stateful and parallel testing features. Furthermore they require buying a Quviq license in order to reproduce the results, a show stopper for many people.
I think it's also worth stressing that stateful specifications are not necessarily always more difficult than specifications for pure functions. For example, to model a key-value store one can get quite far with the model being a list of key-value pairs. In fact a simple model like that managed to find a 17 step (shrunk) counterexample in LevelDB to a known issue, within mere minutes. It took weeks for Google to provide a fix, and then after running the property again on the fixed code a new 31 step counterexample was found within minutes. Turns out there was a bug in the background compaction process. The compaction process improves read performance and reclaims disk space, which is important for a key-value store, but interestingly it's not explicitly part of the model. Joseph W Norton gave a talk at LambdaJam 2013 about it.
Regarding keeping the source closed helping with adoption, I think this is perhaps the most controversial point that John makes.
If we try to see it from John's perspective, how else would an academic get funding to work on tooling (which typically isn't recognised as doing research), or feedback from industry? Surely, one cannot expect research funding agencies to pay for this?
On the other hand one could ask why there isn't a requirement that published research should be reproducible using open source tools (or at least tools that are freely available to the public and other researchers)? Trying to replicate the results from the Quviq QuickCheck papers (from 2006 and onward) without buying a Quviq QuickCheck license, is almost impossible without a lot of reverse engineering work.
I suppose one could argue that one could have built a business around an open source tool, only charging for the training and consulting, but given how broken open source is today, unless you are a big company (which takes more than it gives back), it's definitely not clear that it would have worked (and it was probably even worse back in 2006).
Even if John is right and that keeping it closed source has helped adoption in industry, I think it's by now fair to say it has not helped open source adoption. Or perhaps another way to look at it, it's unlikely that a company that pays for a license in Erlang would then go and port the library in another language.
I think there's at least two things worth trying.
-
Provide a short open source implementation of stateful and parallel property-based testing, analogous to the original ~300 lines of code QuickCheck implementation.
Perhaps part of the original QuickCheck library's success in spreading to so many other languages can be attributed to the fact that its small implementation and that it is part of the original paper?
-
Try to make the formal specification part easier, so that we don't need to train developers (as much).
Perhaps we can avoid state machines as basis for specifications and instead reuse concepts that programmers are already familiar with from their current testing activities, e.g. mocking and test doubles more generally?
In order to test the above hypothesis, I'd like to spend the rest of this post as follows:
-
Show how one can implement stateful and parallel property-based testing in about 400 lines of code (similar to the size of the original QuickCheck implementation);
-
Make specifications simpler by using in-memory reference implementations similar to mocks, more accurately called fakes, rather than state machines.
Before we get started with stateful testing, let's first recap how property-based testing of pure functions works.
It's considered good practice to test new functions or functionality, to
make sure it does what we want. For example, imagine we've written a
linked-list reversal function called reverse
, then it might be
sensible to test it against a couple of lists such as the empty list
and, say, the three element list [1, 2, 3]
.
How does one choose which example inputs to test against though? Typically one wants to choose corner cases, such as the empty list, that perhaps were overlooked during the implementation. It's difficult to think of corner cases that you might have overlooked (because if you can then you probably wouldn't have overlooked them in the first place)! This is where generating random inputs, a key feature of property-based testing, comes in. The idea being that random inputs will eventually hit corner cases.
When we manually pick inputs for our tests, like [1, 2, 3]
, we know
what the output should be and so we can make the appropriate assertion,
i.e. reverse [1, 2, 3] == [3, 2, 1]
. When we generate random inputs we
don't always know what the output should be. This is where writing
properties that relate the output to the input somehow comes in. For
example, while we don't know what the output of reversing an arbitrary
list is, we do know that reversing it twice will give back the input.
This is how we can express this property in QuickCheck:
>>> quickCheck (\(xs :: [Int]) -> reverse (reverse xs) == xs)
+++ OK, passed 100 tests.
By default 100 tests get generated, but that can be adjusted:
>>> quickCheck (withMaxSuccess 5 (\(xs :: [Int]) -> reverse (reverse xs) == xs))
+++ OK, passed 5 tests.
We can see what test get generated using verboseCheck
:
>>> verboseCheck (withMaxSuccess 5 (\(xs :: [Int]) -> reverse (reverse xs) == xs))
Passed:
[]
Passed:
[1]
Passed:
[-2]
Passed:
[2]
Passed:
[-4,-2,-2,3]
+++ OK, passed 5 tests.
Or by using sample
on the appropriate Gen
erator. In this case we are
generating lists of integers, hence the Gen [Int]
type annotation:
>>> sample (arbitrary :: Gen [Int])
[]
[]
[-2,2,3,-2]
[-4]
[4,6,-2,-6,-1]
[1,7,5,-8,1]
[-11,4]
[3,-1,11]
[]
[-3,17,-14,-1,17,18,-8,-9,-13,-7]
[6,19,6,9,-15,-6,-19]
The list and integer generators are provided by the library and I hope you agree that these seem like sensible arbitrary lists to use in our tests.
Next let's have a look at when a property fails. For example this is what happens if we try to test that the output of reversing a list is the input list:
>>> quickCheck (\(xs :: [Int]) -> reverse xs == xs)
*** Failed! Falsified (after 3 tests and 2 shrinks):
[0,1]
We see that after 3 tests a test case was generated that failed, the
input got shrunk twice and the minimal counterexample [0, 1]
is
presented. Notice that we do need a list that is at least of length two,
because any shorter list will reverse to itself.
As pointed out earlier, coming up with these properties is not easy.
There are however a few patterns that come up over and over again. With
reverse
we saw an example of an involutory function, i.e.
f (f x) == x
, here are a few other examples:
- Inverses, e.g.
\(i :: Input) -> deserialise (serialise i) == i
; - Idempotency, e.g.
\(xs :: [Int]) -> sort (sort xs) == sort xs
; - Associativity, e.g.
\(i j k :: Int) -> (i + j) + k == i + (j + k)
; - Axioms of abstract data types, e.g.
\(x :: Int)(xs :: [Int]) -> member x (insert x xs) && not (member x (remove x xs))
; - Metamorphic properties, e.g.
\(g :: Graph)(m n :: Node) -> shortestPath g m n == shortestPath g n m
.
Readers familiar with discrete math might recognise some of the above.
In the pure property-based testing case, that we just looked at, the picture of the test setup looks a bit like this:
+-----+
i | | o
-----> f +---->
| |
+-----+
Where i
is the input we generate, f
is the function we are applying
the generated input to produce the output o
. In the case of the
reverse
example, from before, i
and o
are of type list of integers
([Int]
), f
is reverse . reverse
and the property that we check for
every generated input is that input is equal to the output.
Next let's contrast this picture with how the test setup looks when we
are testing a stateful component. A simple example of a stateful
component is a counter with an incr
ement operation which increment the
counter and returns the old count.
Unlike in the pure case, the same input will not give the same output.
For example the first time we do incr
we get back 0
(if we start
counting from zero) while the second time we do incr
we get 1
. A
database or a file system are two other examples of stateful components,
where the history of previous inputs affects the output of the next
input.
In the stateful case, the picture looks more like this:
+------+ +------+ +------+
| | i1 | | i2 | |
| s0 +-----> s1 +-----> s2 | ...
| | | | | |
+------+ +--+---+ +--+---+
| |
| o1 | o2
v v
---------------------------------> time
Where s
is the state, i
is an input (e.g. incr
) and o
is an
output. Notice how the state evolves over time and depends on the
history of inputs.
In the pure case each test case is a single input, in the stateful case
we need a sequence of inputs in order to test how the system changes
over time. In the pure case our properties were relations on the input
and output, i.e. R : i -> o -> Bool
. In the stateful case our
properties would need to be generalised to R' : [i] -> [o] -> Bool
to
account for how the state changes over time. Writing such properties is
cumbersome, an alternative is to account for the state explicitly by
means of some kind of model.
This is where our in-memory reference implementation, or fake, comes in.
We'll use a function of type m -> i -> (m, o)
, i.e. from the old model
and an input compute the next model and the output. From this we can
derive a property that for each input checks if the outputs of the
stateful component agrees with the output of the fake:
+------+ +------+ +------+
| | i1 | | i2 | |
real: | s0 +-----> s1 +-----> s2 | ...
| | | | | |
+------+ +--++--+ +--++--+
|| ||
||o1 || o2
|| ||
+------+ +--++--+ +--++--+
| | i1 | | i2 | |
fake: | m0 +-----> m1 +-----> m2 | ...
| | | | | |
+------+ +------+ +------+
---------------------------------> time
In case the outputs disagree we shrink the sequence of inputs and try to present the smallest counterexample, as in the pure case.
Let's make things more concrete with some actual code that we can run.
All examples, in the rest of this post, will have three parts:
- The software under test;
- The model that the software under test gets tested against;
- The generated tests and output from running them.
The first part is independent of the stateful testing library we are building. The second part is hooking up the first part to the library by implementing an interface (type class). We'll look at the definition of the type class after the example. The final part is how to write the actual property and interpret the output from running them.
This is how you can implement a counter using a global mutable variable in Haskell:
gLOBAL_COUNTER :: IORef Int
gLOBAL_COUNTER = unsafePerformIO (newIORef 0)
{-# NOINLINE gLOBAL_COUNTER #-}
incr :: IO ()
incr = do
n <- readIORef gLOBAL_COUNTER
writeIORef gLOBAL_COUNTER (n + 1)
get :: IO Int
get = readIORef gLOBAL_COUNTER
Notice that here incr
doesn't return the old value, like above, and
instead we have a separate operation get
which returns the current
value of the counter.
To model our counter we'll use an integer.
newtype Counter = Counter Int
deriving (Eq, Show)
-- We'll come back to the definition of the `StateModel` type class after this
-- example.
instance StateModel Counter where
-- We start counting from zero.
initialState :: Counter
initialState = Counter 0
-- The commands correspond to the names of the functions that operate on the
-- global counter.
data Command Counter r
= Incr
| Get
deriving (Show, Functor)
-- The responses correspond to the return types of each function. By
-- convention we'll add a underscore suffix to a response of the corresponding
-- command.
data Response Counter r
= Incr_ ()
| Get_ Int
deriving (Eq, Show, Functor, Foldable)
-- The state machine takes a command and the model of the counter and returns
-- a new model and a response. We'll come back to the role of the `Either` later.
runFake :: Command Counter r -> Counter -> Either Void (Counter, Response Counter r)
runFake Incr (Counter n) = return (Counter (n + 1), Incr_ ())
runFake Get m@(Counter n) = return (m, Get_ n)
-- We also need to explain which part of the counter API each command
-- corresponds to.
runReal :: Command Counter r -> IO (Response Counter r)
runReal Get = Get_ <$> get
runReal Incr = Incr_ <$> incr
-- We'll generate increments and reads of the counter with equal probability.
-- Notice that we only need to explain how to generate a single command, the
-- library will use this to generate sequences of commands as we'll see later.
generateCommand :: Counter -> Gen (Command Counter r)
generateCommand _s = elements [Incr, Get]
A common complaint is that the model (Counter
and runFake
) is as big
as the implementation itself. This is true, because it's an example. In
reality the model will often be many orders of magnitude smaller. This
is due to the fact that the model, unlike the real implementation,
doesn't need to persisting to disk, communicating over the network, or
perform various time or space optimisations. Recall the LevelDB example
from above.
The tests, or property, can now be written as follows.
prop_counter :: Commands Counter -> Property
prop_counter cmds = monadicIO $ do
runCommands cmds
run reset
assert True
reset :: IO ()
reset = writeIORef gLOBAL_COUNTER 0
To run them, we can load the module and type quickCheck prop_counter
in the REPL, which gives us an output like:
+++ OK, passed 100 tests:
89% Get
85% Incr
Commands (2151 in total):
52.02% Get
47.98% Incr
Where the first group of percentages tell us the proportion of tests that contained the get and increment command respectively, and the second group of percentages tell us the proportion of get and increment commands out of all commands generated. Note that the first group doesn't add up to 100%, because most tests will contain both commands, whereas the second group does. The reason the second group is almost 50-50 is because in the generator we generate both commands with equal probability.
Another thing to note is that we need to reset
the counter between
tests, otherwise the global counter will have the value from the last
test while the model always starts from zero and we get a mismatch.
To make things a bit more interesting, let's introduce a bug into our counter and see if the tests can find it. Let's make it so that if the counter has the value of 42, then it won't increment properly.
incr42Bug :: IO ()
incr42Bug = do
n <- readIORef gLOBAL_COUNTER
let n' = if n == 42
then n -- BUG
else n + 1
writeIORef gLOBAL_COUNTER n'
We also need to change the runReal
function to use our buggy increment
as follows.
- runReal Incr = Incr_ <$> incr
+ runReal Incr = Incr_ <$> incr42Bug
When we run the property now, we'll see something like the following output.
*** Failed! Assertion failed (after 66 tests and 29 shrinks):
Commands [Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Incr,Get]
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Incr --> Incr_ ()
Get --> Get_ 42
Expected: Get_ 43
Got: Get_ 42
Notice that this is indeed the smallest counterexample and how it took 66 randomly generated test cases to find the sequence of inputs that triggered the bug and then 29 shrink steps for QuickCheck to minimise it.
In the example above we implemented the StateModel
interface (or type
class), next we'll have a look at the definition of this interface and
the testing functionality we can derive by programming against the
interface.
Let me give you the full definition of the interface and then I'll explain it in words afterwards.
class ( ...
) => StateModel state where
-- If we think of the system under test as a black box, then commands are the
-- inputs and responses the outputs to the black box.
data Command state :: Type -> Type
data Response state :: Type -> Type
-- Sometimes a command needs to refer to a previous response, e.g. when a file
-- is opened we get a handle which is later refered to when writing or reading
-- form the file. File handles, and similar constructs, are called references
-- and can be part of commands and responses.
type Reference state :: Type
type Reference state = Void
-- Not all commands are valid in all states. Pre-conditions allow the user to
-- specify when a command is safe to execute, for example we cannot write or
-- read to or from an unopened file. The `PreconditionFailure` data type
-- allows the user to create custom pre-condition failures. By default now
-- pre-condition failures are allowed, thus the `Void` (empty) type.
type PreconditionFailure state :: Type
type PreconditionFailure state = Void
generateCommand :: state -> Gen (Command state (Var (Reference state)))
shrinkCommand :: state -> Command state (Var (Reference state))
-> [Command state (Var (Reference state))]
shrinkCommand _state _cmd = []
initialState :: state
runFake :: Command state (Var (Reference state)) -> state
-> Either (PreconditionFailure state)
(state, Response state (Var (Reference state)))
runReal :: Command state (Reference state)
-> CommandMonad state (Response state (Reference state))
monitoring :: (state, state)
-> Command state (Reference state)
-> Response state (Reference state)
-> Property -> Property
monitoring _states _cmd _resp = id
commandName :: (Show (Command state ref), Show ref)
=> Command state ref -> String
commandName = head . words . show
-- Most often the result of executing a command against the system under test
-- will live in the IO monad, but sometimes it can be useful to be able a
-- choose another monad.
type CommandMonad state :: Type -> Type
type CommandMonad state = IO
The interface is parametrised by a state
type that the user needs to
define before instantiating the interface. In the counter example
state
is newtype Counter = Counter Int
. The user needs to provide an
initialState :: state
from which we'll start generating commands and
executing the model, in the counter case this is Counter 0
.
As part of the instantiating the user also needs to specify a type of
Command
s and Response
s, these were the Incr
and Get
operations
of the counter and their response types respectively.
In addition there's also three optional types, that we've not needed in
the counter example. The first is references, these are used to refer to
previously created resources. For example if we open a file handle on a
POSIX-like file system, then later commands need to be able to refer to
that file handle when wanting to write or read from it. The second
datatype is PreconditionFailure
, which is used to give a nice error
message when a command is executed in a disallowed state. For example if
we try to read from a file handle that has been closed. The third data
type is CommandMonad
which let's us use a different monad than IO
for executing our commands in. After we've finished with the interface
definition we'll come back to more examples where we'll use these
optional types, hopefully these examples will help make things more
concrete.
We've already seen that the user needs to provide a way to generate
single command, the only thing worth mentioning is that in case our
commands contain references then during generation we only deal with
Var
s of references, where data Var a = Var Int
. The reason for this
is that we cannot generate, for example, real file handles (only the
operating system can), so instead we generate "symbolic" references
which are just Int
s. Think of these as placeholders for which real
references will be substituted in once the real references are created
during execution.
Shrinking of individual commands is optional and disabled by default, but as we've seen this doesn't exclude the sequence of commands to be shrunk. We'll shall see shortly how that is done in detail.
Next up we got runFake
and runReal
which executes a command against
the state
model and the real system respectively. Notice how runFake
can fail with a PreconditionFailure
, whereas runReal
is always
expected to succeed (because if a command fails the precondition check,
then it won't get generated and hence never reach runReal
). Another
difference is that runFake
uses symbolic references, while runReal
deals with real references. We'll shortly see how this substitution of
references works.
Lastly we have two optional functions related to keeping statistics of generated test cases, which is useful for coverage reporting among other things. We'll come back to how these can be used as we look at more examples after we've defined our stateful property-based testing library.
Once we have our interface we can start writing functions against the
interface. These functions are what the user gets once they implement
the interface. In this section we'll have a look at generation of
sequences of Commands
, which will be the inputs for our tests, and how
to shrink said inputs to produce a minimal counterexample.
Let's start by defining Commands
, notice that they use symbolic
references (i.e. Var (Reference state)
):
newtype Commands state = Commands
{ unCommands :: [Command state (Var (Reference state))] }
As mentioned above, when we generate commands we cannot generate real
references, e.g. file handles, thus Var (Reference state)
is used
which is isomorphic to just an Int
.
Sometimes it's convenient to split up runFake
into two parts, the
first checks if the command is allowed in the current state, i.e. the
precondition holds:
precondition :: StateModel state
=> state -> Command state (Var (Reference state)) -> Bool
precondition s cmd = case runFake cmd s of
Left _ -> False
Right _ -> True
And the second part advances the state:
nextState :: StateModel state
=> state -> Command state (Var (Reference state)) -> state
nextState s cmd = case runFake cmd s of
Right (s', _) -> s'
Left _err -> error "nextState: impossible, we checked for success in precondition"
We assume that we'll only ever look at the nextState
when the
precondition
holds.
Using these two functions we can implement QuickCheck's Arbitrary
type
class for Commands
which let's us generate and shrink Commands
:
instance StateModel state => Arbitrary (Commands state) where
arbitrary :: Gen (Commands state)
arbitrary = Commands <$> genCommands initialState
where
genCommands :: StateModel state
=> state -> Gen [Command state (Var (Reference state))]
genCommands s = sized $ \n ->
let
w = n `div` 2 + 1
in
frequency
[ (1, return [])
, (w, do mcmd <- generateCommand s `suchThatMaybe` precondition s
case mcmd of
Nothing -> return []
Just cmd -> (cmd :) <$> genCommands (nextState s cmd))
]
shrink :: Commands state -> [Commands state]
shrink = pruneShrinks . possibleShrinks
where
possibleShrinks :: Commands state -> [Commands state]
possibleShrinks = map (Commands . map fst) . shrinkList shrinker
. withStates initialState . unCommands
where
shrinker (cmd, s) = [ (cmd', s) | cmd' <- shrinkCommand s cmd ]
withStates :: StateModel state
=> state -> [Command state (Var (Reference state))]
-> [(Command state (Var (Reference state)), state)]
withStates s0 = go s0 []
where
go _s acc [] = reverse acc
go s acc (cmd : cmds) = go (nextState s cmd) ((cmd, s) : acc) cmds
pruneShrinks :: [Commands state] -> [Commands state]
pruneShrinks = coerce . filter (not . null)
. map (go initialState Set.empty [] . unCommands)
where
go _s _vars acc [] = reverse acc
go s vars acc (cmd : cmds)
| not (scopeCheck vars cmd) = go s vars acc cmds
| otherwise = case runFake cmd s of
Left _preconditionFailure -> go s vars acc cmds
Right (s', resp) ->
let
returnedVars = Set.fromList (toList resp)
vars' = returnedVars `Set.union` vars
in
go s' vars' (cmd : acc) cmds
scopeCheck :: Foldable (Command state)
=> Set (Var a) -> Command state (Var a) -> Bool
scopeCheck varsInScope cmd = usedVars `Set.isSubsetOf` varsInScope
where
usedVars = Set.fromList (toList cmd)
Notice how after shrinking we prune away all commands that don't pass the precondition or that are out of scope with respect to symbolic references.
The intuition here is that as we remove commands from the originally
generated Commands
(which all pass their preconditions), we might have
broken some preconditions and pruning simply removes the commands which
we made invalid in the process of shrinking. Similarly we can have a
command that creates a reference that later commands then depend on, if
we during shrinking remove the command that created the reference then
we must also remove the commands that depend on the reference.
Once we've generated Commands
we need to execute them against the
model and the real system using runFake
and runReal
. In the process
of doing so runReal
will produce Reference
s that later commands
might use, so we also need to substitute symbolic references for real
references. This, together with coverage statistics bookkeeping, is done
in the runCommands
function:
runCommands :: forall state. StateModel state
=> Commands state -> PropertyM (CommandMonad state) ()
runCommands (Commands cmds0) = go initialState emptyEnv cmds0
where
go :: state -> Env state -> [Command state (Var (Reference state))]
-> PropertyM (CommandMonad state) ()
go _state _env [] = return ()
go state env (cmd : cmds) = do
case runFake cmd state of
Left err -> do
monitor (counterexample ("Preconditon failed: " ++ show err))
assert False
Right (state', resp) -> do
let name = commandName cmd
monitor (tabulate "Commands" [name] . classify True name)
-- Here we substitute all symbolic references for real ones:
let ccmd = fmap (lookupEnv env) cmd
cresp <- run (runReal ccmd)
monitor (counterexample (show cmd ++ " --> " ++ show cresp))
monitor (monitoring (state, state') ccmd cresp)
-- Here we collect all references from the response and store it in
-- our environment, so that subsequence commands can be substituted.
let refs = toList cresp
env' = extendEnv env (zip [sizeEnv env..] refs)
cresp' = fmap (lookupEnv env') resp
ok = cresp == cresp'
unless ok $
monitor (counterexample ("Expected: " ++ show cresp' ++ "\nGot: " ++ show cresp))
-- And finally here's where we assert that the model and the real
-- implementation agree.
assert ok
go state' env' cmds
Where Env
is defined as follows.
newtype Env state = Env { unEnv :: IntMap (Reference state) }
That's all the pieces we need to implement that Counter
example that
we saw above, plus some new constructs to deal with precondition
failures and references.
Next let's have a look at an example where we need preconditions and references.
This example is taken from John's paper Testing the hard stuff and staying sane (2014).
The implementation is written in C and uses two indices which keep track of the front and back of the queue, this allows us to implement the queue in a circular fashion. I've copied the C code straight from the paper. In order to test it from Haskell, we'll use Haskell's foreign function interface.
typedef struct queue {
int *buf;
int inp, outp, size;
} Queue;
Queue *new(int n) {
int *buff = malloc(n*sizeof(int));
Queue q = {buff,0,0,n};
Queue *qptr = malloc(sizeof(Queue));
*qptr = q;
return qptr;
}
void put(Queue *q, int n) {
q->buf[q->inp] = n;
q->inp = (q->inp + 1) % q->size;
}
int get(Queue *q) {
int ans = q->buf[q->outp];
q->outp = (q->outp + 1) % q->size;
return ans;
}
int size(Queue *q) {
return (q->inp - q->outp) % q->size;
}
Notice that the C code doesn't do any error checking, e.g. if we get
from an empty queue then we'll get back uninitialised memory.
The circular buffer implementation is efficient, because it reuses the allocated memory as we go around in circles, but it's not obviously correct.
To model queues we'll use a more straight forward non-circular implementation. This is less efficient (doesn't matter as it's merely used during testing), but hopefully more obviously correct.
data FQueue = FQueue
{ fqElems :: [Int]
, fqSize :: Int
}
In the Counter
example above we only had one counter, so the model was
merely a single integer. In this example, because of new
returning a
queue, we need to be able to model arbitrary many queues. We can do this
using symbolic references (data Var a = Var Int
) as follows:
type State = Map (Var Queue) FQueue
emptyState :: State
emptyState = Map.empty
Where Queue
is the Haskell data type that corresponds to the C Queue
and the Var a
data type is provided by the library and is a symbolic
reference to a
(just an Int
eger). The idea being that in the model
we don't have access to real Queue
s, merely symbolic references to
them. This might seem a bit strange, but I hope that it will become more
clear when we model new
.
type FakeOp a = State -> Either Err (State, a)
fNew :: Int -> FakeOp (Var Queue)
fNew sz s =
let
v = Var (Map.size s)
s' = Map.insert v (FQueue [] sz) s
in
return (s', v)
As we have access to the state when defining our model, we can create
new unique symbolic references by simply counting how many symbolic
references we've created previously (using Map.size
)2.
As we said before, in the C code we don't do any error checking. In the
model we do check that, for example, the queue is non-empty before we
fGet
an item. These are our preconditions.
data Err
= QueueDoesNotExist
| QueueIsEmpty
deriving (Eq, Show)
fPut :: Var Queue -> Int -> FakeOp ()
fPut q i s
| q `Map.notMember` s = Left QueueDoesNotExist
| otherwise = return (Map.adjust (\fq -> fq { fqElems = fqElems fq ++ [i] }) q s, ())
fGet :: Var Queue -> FakeOp Int
fGet q s
| q `Map.notMember` s = Left QueueDoesNotExist
| otherwise = case fqElems (s Map.! q) of
[] -> Left QueueIsEmpty
i : is -> return (Map.adjust (\fq -> fq { fqElems = is }) q s, i)
fSize :: Var Queue -> FakeOp Int
fSize q s
| q `Map.notMember` s = Left QueueDoesNotExist
| otherwise = return (s, length (fqElems (s Map.! q)))
Recall that we won't generate a get operation unless the precondition
holds in the state that we are currently in, i.e. we will never generate
gets if the queue is empty and thus we'll never execute the C code for
get
which gives back uninitialised memory.
Having defined our model the interface implementation is almost mechanical.
instance StateModel State where
initialState = Map.empty
type Reference State = Queue
type PreconditionFailure State = Err
data Command State q
= New Int
| Put q Int
| Get q
| Size q
deriving (Show, Functor, Foldable)
data Response State q
= New_ q
| Put_ ()
| Get_ Int
| Size_ Int
deriving (Eq, Show, Functor, Foldable)
generateCommand s
| Map.null s = New . getPositive <$> arbitrary
| otherwise = oneof
[ New . getPositive <$> arbitrary
, Put <$> arbitraryQueue <*> arbitrary
, Get <$> arbitraryQueue
]
where
arbitraryQueue :: Gen (Var Queue)
arbitraryQueue = elements (Map.keys s)
shrinkCommand _s (New i) = [ New i' | Positive i' <- shrink (Positive i) ]
shrinkCommand _s (Put q i) = [ Put q i' | i' <- shrink i ]
shrinkCommand _s _cmd = []
runFake (New sz) s = fmap New_ <$> fNew sz s
runFake (Put q i) s = fmap Put_ <$> fPut q i s
runFake (Get q) s = fmap Get_ <$> fGet q s
runFake (Size q) s = fmap Size_ <$> fSize q s
-- Here `new`, `put`, `get` and `size` are FFI wrappers for their respective C
-- functions.
runReal (New sz) = New_ <$> new sz
runReal (Put q i) = Put_ <$> put q i
runReal (Get q) = Get_ <$> get q
runReal (Size q) = Size_ <$> size q
The only new thing worth paying attention to is the q
in Command
and
Response
, which is parametrised so that it works for both symbolic and
real references. The Functor
instance lets us do substitution, while
Foldable
lets us extract all new references from a response, so that
we can substitute them in later Command
s.
Having implemented the interface, we can write our property as follows.
prop_queue :: Commands State -> Property
prop_queue cmds = monadicIO $ do
runCommands cmds
assert True
When we run it, using quickCheck prop_queue
, we get the following
error.
*** Failed! Assertion failed (after 7 tests and 5 shrinks):
Commands {unCommands = [New 1,Put (Var 0) 0,Put (Var 0) 1,Get (Var 0)]}
New 1 --> New_ (Queue 0x00000000016e9010)
Put (Var 0) 0 --> Put_ ()
Put (Var 0) 1 --> Put_ ()
Get (Var 0) --> Get_ 1
Expected: Get_ 0
Got: Get_ 1
So we create a new queue of size 1
, put two items (0
and 1
) into
it, and finally we read a value from the queue and this is where the
assertion fails. Or model returns 0
, because it's a FIFO queue, but
the C code returns 1
. The reason for this is that in the C code
there's no error checking, so writing a value to a full queue simply
overwrites the oldest value. So there's actually nothing wrong with the
implementation, but rather the model is wrong. We've forgotten a
precondition:
data Err
= QueueDoesNotExist
| QueueIsEmpty
+ | QueueIsFull
fPut :: Var Queue -> Int -> State -> Either Err (State, ())
fPut q i s
| q `Map.notMember` s = Left QueueDoesNotExist
+ | length (fqElems (s Map.! q)) >= fqSize (s Map.! q) = Left QueueIsFull
| otherwise = return (Map.adjust (\fq -> fq { fqElems = fqElems fq ++ [i] }) q s, ())
We can add the counterexample that we got as a regression test to our test suite as follows:
unit_queueFull :: IO ()
unit_queueFull = quickCheck (withMaxSuccess 1 (expectFailure (prop_queue cmds)))
where
cmds = Commands
[ New 1
, Put (Var 0) 1
, Put (Var 0) 0
, Get (Var 0)
]
Notice that we can basically copy-paste cmds
from QuickCheck's output,
but I've done some formatting here to make it more readable.
After fixing the precondition for fPut
, unit_queueFull
fails as
follows:
+++ OK, failed as expected. Assertion failed (after 1 test):
New 1 --> New_ (Queue 0x00000000006f6d20)
Put (Var 0) 1 --> Put_ ()
Preconditon failed: QueueIsFull
When we rerun quickCheck prop_queue
we will not generate this example
again, because all preconditions need to hold, and the property passes:
>>> quickCheck prop_queue
+++ OK, passed 100 tests:
95% New
86% Put
67% Get
Commands (2497 in total):
44.13% New
41.25% Put
14.62% Get
However as we can see in the output there's no coverage for Size
! The
reason for this is because we've forgot to add it to our generator:
generateCommand s
| Map.null s = New . getPositive <$> arbitrary
| otherwise = oneof
[ New . getPositive <$> arbitrary
, Put <$> arbitraryQueue <*> arbitrary
, Get <$> arbitraryQueue
+ , Size <$> arbitraryQueue
]
After adding it and rerunning the property, we get the following error:
>>> quickCheck prop_queue
*** Failed! Assertion failed (after 25 tests and 8 shrinks):
Commands {unCommands = [New 1,Put (Var 0) 0,Size (Var 0)]}
New 1 --> New_ (Queue 0x0000000001444220)
Put (Var 0) 0 --> Put_ ()
Size (Var 0) --> Size_ 0
Expected: Size_ 1
Got: Size_ 0
Size should return how many items are in the queue, so after we put one
item into a queue we expect it to return 1
, but in the above
counterexample it returns 0
.
To understand why this happens we have to look at how put
and size
are implemented:
void put(Queue *q, int n) {
q->buf[q->inp] = n;
q->inp = (q->inp + 1) % q->size;
}
int size(Queue *q) {
return (q->inp - q->outp) % q->size;
}
In put
when we do q->inp = (q->inp + 1) % q->size
we get
q->inp = (0 + 1) % 1 == 0
and then when we calculate the size
we get
(0 - 0) % 1 == 0
. One way to fix this is to make q->size
be n + 1
rather than n
where n
is the size parameter of new
, that way put
will do q->inp = (0 + 1) % 2 == 1
instead and size will be
1 - 0 % 2 == 1
which is correct. Here's the diff:
Queue *new(int n) {
- int *buff = malloc(n*sizeof(int));
- Queue q = {buff,0,0,n};
+ int *buff = malloc((n + 1)*sizeof(int));
+ Queue q = {buff,0,0,n + 1};
Queue *qptr = malloc(sizeof(Queue));
*qptr = q;
return qptr;
}
As before, we can add a regression test for the size issue as follows:
unit_queueSize :: IO ()
unit_queueSize = quickCheck (withMaxSuccess 1 (prop_queue cmds))
where
cmds = Commands
[ New 1
, Put (Var 0) 0
, Size (Var 0)
]
After the change to new
this test passes, but if we rerun the property
we get the following error:
*** Failed! Assertion failed (after 38 tests and 12 shrinks):
Commands {unCommands = [New 1,Put (Var 0) 0,Get (Var 0),Put (Var 0) 0,Size (Var 0)]}
New 1 --> New_ (Queue 0x00007fd47c00a920)
Put (Var 0) 0 --> Put_ ()
Get (Var 0) --> Get_ 0
Put (Var 0) 0 --> Put_ ()
Size (Var 0) --> Size_ (-1)
Expected: Size_ 1
Got: Size_ (-1)
After the second put
we'll have q->inp = (1 + 1) % 2 == 0
while
q->outp = 1
due to the get
and so when we call size
we get
0 - 1 % 2 == -1
. Taking the absolute value:
int size(Queue *q) {
- return (q->inp - q->outp) % q->size;
+ return abs(q->inp - q->outp) % q->size;
}
Makes this test case pass, and in fact it also makes the property pass:
>>> quickCheck prop_queue
+++ OK, passed 100 tests:
93% New
79% Put
74% Size
59% Get
Commands (2340 in total):
32.09% New
29.06% Size
28.25% Put
10.60% Get
John says that at this point most programmers would probably be happy and believe that their implementation works, but if we rerun it again (or increase the amount of tests generated), we get:
>>> quickCheck prop_queue
*** Failed! Assertion failed (after 56 tests and 19 shrinks):
Commands {unCommands = [New 2,Put (Var 0) 0,Put (Var 0) 0,Get (Var 0),Put (Var 0) 0,Size (Var 0)]}
New 2 --> New_ (Queue 0x00007fbf4c006490)
Put (Var 0) 0 --> Put_ ()
Put (Var 0) 0 --> Put_ ()
Get (Var 0) --> Get_ 0
Put (Var 0) 0 --> Put_ ()
Size (Var 0) --> Size_ 1
Expected: Size_ 2
Got: Size_ 1
We can see that all queues of size 1
now work, because this test
starts by creating a queue of size 2
, so we've made progress. But
taking the absolute value isn't the correct way to calculate the size
(even though it works for queues of size 1
), the following is the
correct way to do it:
int size(Queue *q) {
- return abs(q->inp - q->outp) % q->size;
+ return (q->inp - q->outp + q->size) % q->size;
}
With this final tweak, the property passes. I hope that this somewhat long example gives you a feel for how property-based testing drives the development and debugging of the code.
In the movie Die Hard 3 there's a scene where Bruce Willis and Samuel L. Jackson have to solve a puzzle in order to stop a bomb from going off. The puzzle is: given a 3L and a 5L jug, how can you measure exactly 4L?
I first saw this example solved using TLA+ and I wanted to include it here because it shows that we don't necessarily need a real implementation, merely running the model/fake can be useful.
The main idea is to model the two jugs and all actions we can do with them and then throw an exception when the big jug contains 4L. This will fail the test and output the shrunk sequence of actions that resulted in the failure, giving us the solution to the puzzle.
data Model = Model
{ bigJug :: Int
, smallJug :: Int
}
deriving (Eq, Show)
instance StateModel Model where
initialState = Model 0 0
type Reference Model = Void
data Command Model r
= FillBig
| FillSmall
| EmptyBig
| EmptySmall
| SmallIntoBig
| BigIntoSmall
deriving (Show, Enum, Bounded, Functor, Foldable)
data Response Model r = Done | BigJugIs4
deriving (Eq, Show, Functor, Foldable)
generateCommand :: Model -> Gen (Command Model r)
generateCommand _s = elements [minBound ..]
runFake :: Command Model r -> Model -> Either void (Model, Response Model r)
runFake FillBig s = done s { bigJug = 5 }
runFake FillSmall s = done s { smallJug = 3 }
runFake EmptyBig s = done s { bigJug = 0 }
runFake EmptySmall s = done s { smallJug = 0 }
runFake SmallIntoBig (Model big small) =
let big' = min 5 (big + small) in
done (Model { bigJug = big'
, smallJug = small - (big' - big) })
runFake BigIntoSmall (Model big small) =
let small' = min 3 (big + small) in
done (Model { bigJug = big - (small' - small)
, smallJug = small'
})
runReal :: Command Model Void -> IO (Response Model (Reference Model))
runReal _cmd = return Done
monitoring :: (Model, Model) -> Command Model Void -> Response Model Void
-> Property -> Property
monitoring (_s, s') _cmd _resp =
counterexample $ "\n State: " ++ show s' ++ "\n"
done :: Model -> Either void (Model, Response Model ref)
done s' | bigJug s' == 4 = return (s', BigJugIs4)
| otherwise = return (s', Done)
prop_dieHard :: Commands Model -> Property
prop_dieHard cmds = withMaxSuccess 10000 $ monadicIO $ do
runCommands cmds
assert True
When we run quickcheck prop_dieHard
we get the following output:
+++ OK, failed as expected. Assertion failed (after 199 tests and 11 shrinks):
Commands [FillBig,BigIntoSmall,EmptySmall,BigIntoSmall,FillBig,BigIntoSmall]
FillBig --> Done
State: Model {bigJug = 5, smallJug = 0}
BigIntoSmall --> Done
State: Model {bigJug = 2, smallJug = 3}
EmptySmall --> Done
State: Model {bigJug = 2, smallJug = 0}
BigIntoSmall --> Done
State: Model {bigJug = 0, smallJug = 2}
FillBig --> Done
State: Model {bigJug = 5, smallJug = 2}
BigIntoSmall --> Done
State: Model {bigJug = 4, smallJug = 3}
Expected: BigJugIs4
Got: Done
Notice how the trace shows the intermediate states, making it easy to verify that it's indeed a correct solution to the puzzle3.
Let's now turn our focus to parallel property-based testing.
Debugging buggy concurrent code is not fun. The main reason for this is that the threads interleave in different ways between executions, making it hard to reproduce the bug and hard to verify that a bug fix actually worked.
Ideally we'd like to make working with concurrent code as pleasant as the sequential stateful case and without the user having to write any additional test code. In order to explain how we can achieve this, we need to first understand how we can test concurrent code in a reproducible way.
Recall our Counter
that we looked at in the sequential testing case.
Here we'll be using a slight generalisation where the incr
takes an
integer parameter which specifies by how much we want to increment
(instead of always incrementing by 1
).
When we interact with the counter sequentially, i.e. one command at the time, then it appears to count correctly:
>>> incr 1
>>> incr 2
>>> get
3
If we instead concurrently issue the incr
ements, we see something
strange:
> forM_ [0..100000] $ \i -> do
> c <- newCounter
> concurrently_ (incr c 1) (incr c 2)
> x <- get c
> if x == 3 then return () else error ("i = " ++ show i ++ ", x = " ++ show x)
*** Exception: i = 29768, x = 1
After 29768 iterations we get back 1
rather than the expected 3
! The
reason for this is because there's a race condition in the
implementation of incr
:
incr i = do
j <- readIORef gLOBAL_COUNTER
writeIORef gLOBAL_COUNTER (i + j)
Because we first read the old value and then write the new incremented value in an non-atomic way, it's possible that if two threads do this at the same time they overwrite each others increment. For example, consider the interleaving:
thread 1, incr 1 | thread 2, incr 2
---------------------+------------------
0 <- readIORef |
| 0 <- readIORef
| writeIORef (2 + 0)
writeIORef (1 + 0) |
|
v
time
If we read from the counter after the two increments are done we get 1
instead of the expected 3
. The fix to this problem is to do an atomic
update using atomicModifyIORef'
, instead of first reading and then
writing to the IORef
.
The concurrent test that we just wrote is not only specific to the
counter example but also only uses three fixed commands, the two
concurrent incr
ements followed by a get
. While it was enough to find
this race condition, in general we'd like to try arbitrary combinations
of commands and possibly involving more than two threads.
The key concept we need in order to accomplish that is that of concurrent history, which is perhaps easiest to explain in terms of a more familiar concept: a sequence diagram.
Consider the following sequence diagram:
Here we see that the first and second thread concurrently increment, the first thread then reads the counter concurrently with the second thread's increment that's still going on. The second thread's increment finishes and a third thread does a read which is concurrent with the first thread's read.
We can abstract away the arrows and merely focus on the intervals of the commands:
If we rotate the intervals we get the concurrent history:
Note that the execution of some commands overlap in time, this is what's meant by concurrent and arguably it's easier to see the overlap here than in the original sequence diagram.
We've also abstracted away the counter, it's a black box from the perspective of the threads. The only thing we know for sure is when we invoked the operation and when it returned, which is what our interval captures. We also know that the effect of the operation must have happened sometime within that interval.
One such concurrent history can have different interleavings, depending on when exactly the effect of the commands happen. Here are two possible interleavings, where the red cross symbolises when the effect happened (i.e. when exactly the counter update its state).
The first corresponds to the sequential history
< incr 1, get, incr 2, get >
:
and the other interleaving corresponds to the sequential history
< incr 1, incr 2, get, get >
:
One last thing we've left out from the concurrent history so far is the
responses. In this example, the only interesting responses are those of
the get
s.
Let's say that the get
s returned 1
and 3
respectively. Is this a
correct concurrent outcome? Yes, according to linearisability it's
enough to find a single interleaving for which the sequential state
machine model can explain the outcome and in this case the first
interleaving above < incr 1, get, incr 2, get >
does that.
What if the get
s both returned 3
? That's also correct and witnessed
by the second interleaving < incr 1, incr 2, get, get >
. When we can
find a sequential interleaving that supports the outcome of a concurrent
execution we say that the concurrent history linearises.
If the get
on the third thread returned 1
or 2
however, then it
would be a non-linearisable outcome. We can see visually that that get
happens after both incr
, so no matter where we choose to place the red
crosses on the incr
s the effects will happen before that get
so it
must return 3
. Is it even possible that 1
or 2
are returned? It's,
imagine if incr
is implemented by first reading the current value then
storing the incremented value, in that case there can be a race where
the incr
s overwrite each other.
So to summarise, we execute commands concurrently using several threads and gather a concurrent history of the execution. We then try to find a sequential interleaving (a choice of where the red crosses in the diagrams should be) which respects the a sequential state machine model specification. If we find a single one that does, then we say that the history linearises and that the concurrent execution is correct, if we cannot find a sequential interleaving that respects the model then the history doesn't linearise and we have found a problem.
Let's try to implement the above. We'll split up the implementation in three parts. First, we'll show how to generate and shrink parallel commands, these will be different than the sequential commands as we have more than one thread that does the execution. Second, we'll have a look at how to execute the generated parallel commands to produce a concurrent history. Finally, we'll implement linearisability checking by trying to find an interleaving of the concurrent history which respects the sequential model.
First we need define what a parallel program is:
newtype ParallelCommands state = ParallelCommands [Fork state]
newtype Fork state = Fork [Command state (Var (Reference state))]
The idea is that the commands inside Fork
s get executed in parallel,
this list will only be between one and three commands long, i.e.
capturing single, double or triple threaded execution. The amount of
Fork
s themselves vary with the size of the test case, just like when
we were doing the sequential testing.
Depending on the order in which the commands in the Fork
s get
executed, we can potentially get different models. For example
Fork [Write "a" "foo", Write "a" "bar"]
, depending on which branch of
the Fork
gets executed first we might end up with either "foo"
or
"bar"
being written to "a"
.
Because of this, we have generalised generation and shrinking to work on a set of states rather than just a single state:
class (StateModel state, Ord state) => ParallelModel state where
generateCommandParallel :: [state] -> Gen (Command state (Var (Reference state)))
generateCommandParallel ss = do
s <- elements ss
generateCommand s
shrinkCommandParallel :: [state] -> Command state (Var (Reference state))
-> [Command state (Var (Reference state))]
shrinkCommandParallel ss cmd = shrinkCommand (maximum ss) cmd
Notice that the default implementation for generation, which should be
good enough for most examples, picks an arbitrary state and reuses the
generation function from the sequential case. Similar shrinking picks
the biggest state (determined by the Ord
instance) as the default
implementation. The user is able to override these defaults, in case
generation or shrinking depends on some more specific state.
We can now write a generator for parallel programs.
arbitrary :: Gen (ParallelCommands state)
arbitrary = ParallelCommands <$> go [initialState]
where
go :: [state] -> Gen [Fork state]
go ss = sized $ \n ->
let
w = n `div` 2 + 1
in
frequency
[ (1, return [])
, (w, do k <- frequency [ (50, return 1) -- 50% single threaded
, (30, return 2) -- 30% double threaded
, (20, return 3) -- 20% triple threaded
]
mCmds <- vectorOf k (generateCommandParallel ss)
`suchThatMaybe` (parallelSafe ss . Fork)
case mCmds of
Nothing -> return []
Just cmds ->
(Fork cmds :) <$> go (nextStates ss cmds))
]
Where nextStates
gives all potential next states and is defined as
follows.
nextStates :: (StateModel state, Ord state)
=> [state] -> [Command state (Var (Reference state))] -> [state]
nextStates ss cmds = nubOrd [ foldl' nextState s cmds | s <- ss ]
The other helper function that we need for generation is parallelSafe
,
which requires a bit of background.
In the sequential case a precondition is a contract that needs to be
fulfilled by the client before the command is issued. In the parallel
case there are multiple clients, so it could be the case that one client
unknowingly breaks another clients precondition. For example
Fork [Write "a" "foo", Delete "a"]
, where the precondition for both
commands is that "a"
exists. If Delete
gets executed first then it
would break Write
's precondition.
The solution to the precondition problem is to check that they hold in
all possible interleavings of a Fork
, which is what parallelSafe
does:
parallelSafe :: ParallelModel state => [state] -> Fork state -> Bool
parallelSafe ss (Fork cmds0) = and
[ preconditionsHold s cmds | s <- toList ss, cmds <- permutations cmds0 ]
where
preconditionsHold s0 = all (go s0) . permutations
where
go _s [] = True
go s (cmd : cmds)
| precondition s cmd = go (nextState s cmd) cmds
| otherwise = False
While shrinking we also use parallelSafe
:
shrink :: ParallelCommands state -> [ParallelCommands state]
shrink = pruneShrinks . possibleShrinks
where
possibleShrinks :: ParallelCommands state -> [ParallelCommands state]
possibleShrinks
= map (coerce . map (map fst))
. shrinkList (shrinkList shrinker) . withParStates . unParallelCommands
where
withParStates :: (StateModel state, Ord state)
=> [Fork state]
-> [[(Command state (Var (Reference state)), [state])]]
withParStates = go [initialState] [] . coerce
where
go _ss acc [] = reverse acc
go ss acc (cmds : cmdss) =
go (nextStates ss cmds) (map (\cmd -> (cmd, ss)) cmds : acc) cmdss
shrinker :: (Command state (Var (Reference state)), [state])
-> [(Command state (Var (Reference state)), [state])]
shrinker (cmd, ss) = [ (cmd', ss) | cmd' <- shrinkCommandParallel ss cmd ]
pruneShrinks :: [ParallelCommands state] -> [ParallelCommands state]
pruneShrinks = coerce . filter (not . null)
. map (go [initialState] Set.empty [] . unParallelCommands)
where
go :: [state] -> Set (Var (Reference state)) -> [Fork state] -> [Fork state] -> [Fork state]
go _ss _vars acc [] = reverse acc
go ss vars acc (fork@(Fork cmds) : forks)
| all (scopeCheck vars) cmds
, parallelSafe ss fork =
let
ss' = nextStates ss cmds
vars' = getReturnedVars (head ss) vars cmds -- NOTE: head is safe
in
go ss' vars' (fork : acc) forks
| otherwise = go ss vars acc forks
-- It doesn't matter which of the possible states we start in, as all
-- commands in a fork pass their preconditions in all states. It also
-- doesn't matter in which interleaving we gather the responses, as
-- all we do is collect the `Var`s that get returned into an unordered
-- `Set`.
getReturnedVars _s vars [] = vars
getReturnedVars s vars (cmd : cmds) = case runFake cmd s of
Left _preconditionFailed ->
error "getReturnedVars: impossible, parallelSafe checks that all preconditions hold"
Right (_s', resp) ->
getReturnedVars s (vars `Set.union` Set.fromList (toList resp)) cmds
In addition we also check that shrinking doesn't create any scoping issues, i.e. if we remove a command which creates a symbolic variable we also need to remove any fork that contains a command which uses said symbolic variable.
Another option is to skip the scope checking and instead require the user to explicitly require preconditions which ensure the scope4.
One final difference between the sequential and the parallel case is
that because of the use of threads to achieve parallel execution, and
the fact we can only spawn threads of things of type IO
, we also need
to be able to interpret our CommandMonad
into IO
, which is what
runCommandMonad
(which is also part of the ParallelModel
type class)
does.
-- If another command monad is used we need to provide a way run it inside the
-- IO monad. This is only needed for parallel testing, because IO is the only
-- monad we can execute on different threads.
runCommandMonad :: proxy state -> CommandMonad state a -> IO a
We can now implement parallel execution of commands as follows:
newtype History state = History [Event state]
deriving stock instance
(Show (Command state (Var (Reference state))),
Show (Response state (Reference state))) => Show (History state)
data Event state
= Invoke Pid (Command state (Var (Reference state)))
| Ok Pid (Response state (Reference state))
deriving stock instance
(Show (Command state (Var (Reference state))),
Show (Response state (Reference state))) => Show (Event state)
newtype Pid = Pid Int
deriving stock (Eq, Ord, Show)
deriving newtype Enum
runParallelCommands :: forall state. ParallelModel state
=> ParallelCommands state -> PropertyM IO ()
runParallelCommands cmds0@(ParallelCommands forks0) = do
forM_ (parallelCommands cmds0) $ \cmd -> do
let name = commandName cmd
monitor (tabulate "Commands" [name] . classify True name)
monitor (tabulate "Concurrency" (map (show . length . unFork) forks0))
q <- liftIO newTQueueIO
c <- liftIO newAtomicCounter
env <- liftIO (runForks q c emptyEnv forks0)
hist <- History <$> liftIO (atomically (flushTQueue q))
let ok = linearisable env (interleavings hist)
unless ok (monitor (counterexample (show hist)))
assert ok
where
runForks :: TQueue (Event state) -> AtomicCounter -> Env state -> [Fork state]
-> IO (Env state)
runForks _q _c env [] = return env
runForks q c env (Fork cmds : forks) = do
envs <- liftIO $
mapConcurrently (runParallelReal q c env) (zip [Pid 0..] cmds)
let env' = combineEnvs (env : envs)
runForks q c env' forks
runParallelReal :: TQueue (Event state) -> AtomicCounter -> Env state
-> (Pid, Command state (Var (Reference state))) -> IO (Env state)
runParallelReal q c env (pid, cmd) = do
atomically (writeTQueue q (Invoke pid cmd))
eResp <- try (runCommandMonad (Proxy :: Proxy state) (runReal (fmap (lookupEnv env) cmd)))
case eResp of
Left (err :: SomeException) ->
error ("runParallelReal: " ++ displayException err)
Right resp -> do
-- NOTE: It's important that we extend the environment before writing `Ok`
-- to the history, otherwise we might get scope issues.
env' <- extendEnvParallel env c (toList resp)
atomically (writeTQueue q (Ok pid resp))
return env'
Extending the environment in the parallel case requires an atomic counter in order to avoid more than one thread adding the same variable:
newtype AtomicCounter = AtomicCounter (IORef Int)
newAtomicCounter :: IO AtomicCounter
newAtomicCounter = AtomicCounter <$> newIORef 0
-- Returns old value.
incrAtomicCounter :: AtomicCounter -> Int -> IO Int
incrAtomicCounter (AtomicCounter ioRef) n =
atomicModifyIORef' ioRef (\old -> (old + n, old))
extendEnvParallel :: Env state -> AtomicCounter -> [Reference state] -> IO (Env state)
extendEnvParallel env c refs = do
i <- incrAtomicCounter c (length refs)
return (extendEnv env (zip [i..] refs))
combineEnvs :: [Env state] -> Env state
combineEnvs = Env . IntMap.unions . map unEnv
Hopefully the execution part is clear, next let's have a look at how we check the result of an execution.
Recall from our parallel counter example in the introduction to parallel
testing that it's enough to find any possible interleaving which
respects the sequential model. So let's start by enumerating all
possible interleavings using a Rose
tree
data structure:
data Op state = Op (Command state (Var (Reference state)))
(Response state (Reference state))
interleavings :: History state -> Forest (Op state)
interleavings (History []) = []
interleavings (History evs0) =
[ Node (Op cmd resp) (interleavings (History evs'))
| (tid, cmd) <- takeInvocations evs0
, (resp, evs') <- findResponse tid
(filter1 (not . matchInvocation tid) evs0)
]
where
takeInvocations :: [Event state] -> [(Pid, Command state (Var (Reference state)))]
takeInvocations [] = []
takeInvocations ((Invoke pid cmd) : evs) = (pid, cmd) : takeInvocations evs
takeInvocations ((Ok _pid _resp) : _) = []
findResponse :: Pid -> [Event state] -> [(Response state (Reference state), [Event state])]
findResponse _pid [] = []
findResponse pid ((Ok pid' resp) : evs) | pid == pid' = [(resp, evs)]
findResponse pid (ev : evs) =
[ (resp, ev : evs') | (resp, evs') <- findResponse pid evs ]
matchInvocation :: Pid -> Event state -> Bool
matchInvocation pid (Invoke pid' _cmd) = pid == pid'
matchInvocation _ _ = False
filter1 :: (a -> Bool) -> [a] -> [a]
filter1 _ [] = []
filter1 p (x : xs) | p x = x : filter1 p xs
| otherwise = xs
We can then check if there is a path through this rose tree which agrees with the sequential model:
linearisable :: forall state. StateModel state
=> Env state -> Forest (Op state) -> Bool
linearisable env = any' (go initialState)
where
go :: state -> Tree (Op state) -> Bool
go s (Node (Op cmd cresp) ts) =
case runFake cmd s of
Left _preconditionFailure ->
error "linearisable: impossible, all precondtions are satisifed during generation"
Right (s', resp) ->
cresp == fmap (lookupEnv env) resp && any' (go s') ts
any' :: (a -> Bool) -> [a] -> Bool
any' _p [] = True
any' p xs = any p xs
Having defined the ParallelModel
interface (which depends on the
StateModel
interface from the sequential testing) and programmed our
parallel generation, shrinking and parallel execution and
linearisability checking against this interface, we basically get
parallel testing for free by reusing the sequential model.
This is the only new code we need to add to enable parallel testing of
our Counter
example5 from before:
instance ParallelModel Counter where
-- The command monad is IO, so we don't need to do anything here.
runCommandMonad _ = id
prop_parallelCounter :: ParallelCommands Counter -> Property
prop_parallelCounter cmds = monadicIO $ do
replicateM_ 10 $ do
run reset
runParallelCommands cmds
assert True
If we run the above property with runReal
runReal Incr = Incr_ <$> incrRaceCondition
being implemented using an increment with a race condition:
incrRaceCondition :: IO ()
incrRaceCondition = do
n <- readIORef gLOBAL_COUNTER
writeIORef gLOBAL_COUNTER (n + 1)
then a failure is found:
Assertion failed (after 36 tests and 1 shrink):
ParallelCommands [Fork [Incr,Incr],Fork [Incr],Fork [Get,Get,Get],
Fork [Incr],Fork [Get,Get],Fork [Get,Get],
Fork [Incr],Fork [Get,Get,Incr],Fork [Incr],
Fork [Get,Get,Get],Fork [Incr,Incr],
Fork [Incr,Get],Fork [Incr,Incr],Fork [Get],
Fork [Incr]]
However, shrinking didn't work well. The reason for this is that QuickCheck tries a smaller test case (which still has the race condition), but because of a different interleaving of threads the race doesn't get triggered and so QuickCheck thinks it found the minimal test case (because the smaller test case, that the shrinker picked, passes).
The proper solution to this problem is to use a deterministic thread scheduler, this is what they do the parallel testing paper6. A simpler workaround is to introduce a small sleep after each read or write to shared memory, this will make it more likely that the same interleaving happens when we shrink the test:
incrRaceCondition :: IO ()
incrRaceCondition = do
n <- readIORef gLOBAL_COUNTER
threadDelay 100
writeIORef gLOBAL_COUNTER (n + 1)
threadDelay 100
With this change we get the minimal test case that triggers the race condition:
Assertion failed (after 6 tests and 4 shrinks):
ParallelCommands [Fork [Incr,Incr],Fork [Get]]
We can avoid having to sprinkle sleeps around our interaction with shared state by creating a module with the same operations as on shared memory where the sleep is already included:
module SleepyIORef (module SleepyIORef, IORef) where
import Control.Concurrent (threadDelay)
import Data.IORef (IORef)
import qualified Data.IORef as IORef
readIORef :: IORef a -> IO a
readIORef ref = do
threadDelay 1000
IORef.readIORef ref
That way if we find a race, we can change the import from
import Data.IORef
to import SleepyIORef
and rerun the tests and get
better shrinking. This situation is not ideal, but save us the trouble
of having to re-implement a scheduler. It's worth stressing that the
race is found in the unmodified code and the introduction of sleep is
only needed to make the counterexample smaller.
For a slightly more complicated example containing race conditions, let's have a look at an implementation of the Erlang process registry7.
The idea behind Erlang's process registry is that you can spawn threads,
register the ThreadId
to some name of type string, and then lookup the
thread by name rather than its thread id. Threads can also be
unregistered and killed.
This is useful if threads die and get restarted and register the same name, then other threads can easily find the thread id of the new thread using the registry.
{-# NOINLINE registry #-}
registry :: IORef [(String,ThreadId)]
registry = unsafePerformIO (newIORef [])
alive :: ThreadId -> IO Bool
alive tid = do
s <- threadStatus tid
return $ s /= ThreadFinished && s /= ThreadDied
spawn :: IO ThreadId
spawn = forkIO (threadDelay 100000000)
whereis :: String -> IO (Maybe ThreadId)
whereis name = do
reg <- readRegistry
return $ lookup name reg
register :: String -> ThreadId -> IO ()
register name tid = do
ok <- alive tid
reg <- readRegistry
if ok && name `notElem` map fst reg && tid `notElem` map snd reg
then do
atomicModifyIORef registry $ \reg' ->
if name `notElem` map fst reg' && tid `notElem` map snd reg'
then ((name,tid):reg',())
else (reg',badarg)
else badarg
unregister :: String -> IO ()
unregister name = do
reg <- readRegistry
if name `elem` map fst reg
then atomicModifyIORef registry $ \reg' ->
(filter ((/=name).fst) reg',
())
else badarg
readRegistry :: IO [(String, ThreadId)]
readRegistry = do
reg <- readIORef registry
garbage <- filterM (fmap not.alive) (map snd reg)
atomicModifyIORef' registry $ \reg' ->
let reg'' = filter ((`notElem` garbage).snd) reg' in (reg'',reg'')
badarg :: a
badarg = error "bad argument"
kill :: ThreadId -> IO ()
kill tid = do
killThread tid
waitUntilDead 1000
where
waitUntilDead :: Int -> IO ()
waitUntilDead 0 = error "kill: thread didn't die"
waitUntilDead n = do
b <- alive tid
if b
then do
threadDelay 1000
waitUntilDead (n - 1)
else return ()
The model of the process registry contains three things that we haven't
seen before. The first thing to note is that Register
and Unregister
might fail, so we use an Either ErrorCall
in their respective
responses and a function called abstractError
to make the error from
the model and the error from the real implementation match up. It's
called abstract, because it abstracts away details from the real
implementation, and it's a useful technique to know as it can be applied
to other settings.
The second thing to notice is the use of monitoring
to keep statistics
of how often Register
and Unregister
actually do fail, this is
useful coverage information that ensures that our generators and or
pre-conditions are not too restrictive.
data RegState = RegState
{ tids :: [Var ThreadId]
, regs :: [(String, Var (ThreadId))]
, killed :: [Var ThreadId]
}
deriving (Eq, Ord, Show)
instance StateModel RegState where
initialState :: RegState
initialState = RegState [] [] []
type Reference RegState = ThreadId
data Command RegState tid
= Spawn
| WhereIs String
| Register String tid
| Unregister String
| Kill tid
deriving (Show, Functor, Foldable)
data Response RegState tid
= Spawn_ tid
| WhereIs_ (NonFoldable (Maybe tid))
| Register_ (Either ErrorCall ())
| Unregister_ (Either ErrorCall ())
| Kill_ ()
deriving (Eq, Show, Functor, Foldable)
generateCommand :: RegState -> Gen (Command RegState (Var ThreadId))
generateCommand s = oneof $
[ return Spawn ] ++
[ Register <$> arbitraryName <*> elements (tids s) | not (null (tids s)) ] ++
[ Unregister <$> arbitraryName
, WhereIs <$> arbitraryName
] ++
[ Kill <$> elements (tids s) | not (null (tids s)) ]
where
arbitraryName :: Gen String
arbitraryName = elements allNames
runFake :: Command RegState (Var ThreadId)-> RegState
-> Either void (RegState, Response RegState (Var ThreadId))
runFake Spawn s = let tid = Var (length (tids s)) in
return (s { tids = tids s ++ [tid] }, Spawn_ tid)
runFake (WhereIs name) s = return (s, WhereIs_ (NonFoldable (lookup name (regs s))))
runFake (Register name tid) s
| tid `elem` tids s
, name `notElem` map fst (regs s)
, tid `notElem` map snd (regs s)
, tid `notElem` killed s
= return (s { regs = (name, tid) : regs s }, Register_ (Right ()))
| otherwise
= return (s, Register_ (Left (ErrorCall "bad argument")))
runFake (Unregister name) s
| name `elem` map fst (regs s) =
return (s { regs = remove name (regs s) }, Unregister_ (Right ()))
| otherwise = return (s, Unregister_ (Left (ErrorCall "bad argument")))
where
remove x = filter ((/= x) . fst)
runFake (Kill tid) s = return (s { killed = tid : killed s
, regs = remove tid (regs s)}, Kill_ ())
where
remove x = filter ((/= x) . snd)
runReal :: Command RegState ThreadId -> IO (Response RegState ThreadId)
runReal Spawn = Spawn_ <$> spawn
runReal (WhereIs name) = WhereIs_ . NonFoldable <$> whereis name
runReal (Register name tid) = Register_ <$> fmap (left abstractError) (try (register name tid))
runReal (Unregister name) = Unregister_ <$> fmap (left abstractError) (try (unregister name))
runReal (Kill tid) = Kill_ <$> kill tid
monitoring :: (RegState, RegState) -> Command RegState ThreadId
-> Response RegState ThreadId -> Property -> Property
monitoring (_s, s') cmd resp =
let
aux tag = classify True (show tag)
. counterexample ("\n State: " ++ show s' ++ "\n")
in
case (cmd, resp) of
(Register {}, Register_ (Left _)) -> aux RegisterFailed
(Register {}, Register_ (Right _)) -> aux RegisterSucceeded
(Unregister {}, Unregister_ (Left _)) -> aux UnregisterFailed
(Unregister {}, Unregister_ (Right _)) -> aux UnregisterSucceeded
_otherwise -> counterexample $ "\n State: " ++ show s' ++ "\n"
-- Throws away the location information from the error, so that it matches up
-- with the fake.
abstractError :: ErrorCall -> ErrorCall
abstractError (ErrorCallWithLocation msg _loc) = ErrorCall msg
allNames :: [String]
allNames = ["a", "b", "c", "d", "e"]
data Tag = RegisterFailed | RegisterSucceeded | UnregisterFailed | UnregisterSucceeded
deriving Show
The last new thing to note here is that WhereIs_
returns the thread id
that we wanted to look up, but thread ids also happen to be references.
The way we implemented extending the environment with new references is
that we call Data.Foldable.toList
on all responses, which gives us all
references from the responses. In the Spawn_
case this does the right
thing, since spawn returns a reference to the newly spawned thread id,
but in this case the thread id from WhereIs_
is not a new reference
(it's merely a reference to the thread id we wanted to look up), so we
shouldn't extend the environment with the reference that WhereIs_
returns. We solve this problem with wrapping the response of WhereIs_
in NonFoldable
which has a toList
which doesn't return anything.
newtype NonFoldable a = NonFoldable a
deriving stock (Eq, Show)
instance Functor NonFoldable where
fmap f (NonFoldable x) = NonFoldable (f x)
instance Foldable NonFoldable where
foldMap _f (NonFoldable _x) = mempty
We can now write our sequential testing property as we've done earlier for the other examples.
prop_registry :: Commands RegState -> Property
prop_registry cmds = monadicIO $ do
void (run cleanUp)
runCommands cmds
assert True
cleanUp :: IO [Either ErrorCall ()]
cleanUp = sequence
[ try (unregister name) :: IO (Either ErrorCall ())
| name <- allNames
]
This property passes and we can see, thanks to our monitoring
, that we
got good coverage of failing commands as well:
+++ OK, passed 100 tests:
83% Spawn
82% WhereIs
79% Unregister
78% UnregisterFailed
70% Kill
70% Register
62% RegisterFailed
59% RegisterSucceeded
29% UnregisterSucceeded
To make sure everything works as expected, let's introduce a bug on purpose:
register :: String -> ThreadId -> IO ()
register name tid = do
ok <- alive tid
reg <- readRegistry
if ok && name `notElem` map fst reg && tid `notElem` map snd reg
then atomicModifyIORef' registry $ \reg' ->
if name `notElem` map fst reg' && tid `notElem` map snd reg'
- then ((name,tid):reg',())
+ then ([(name,tid)],())
else (reg',badarg)
else badarg
If we rerun the tests with this bug in place, we get test failures like the following:
*** Failed! Assertion failed (after 30 tests and 7 shrinks):
Commands {unCommands = [Spawn,Spawn,Register "e" (Var 1),Register "d" (Var 0),Unregister "e"]}
Spawn --> Spawn_ (ThreadId 154)
State: RegState {tids = [Var 0], regs = [], killed = []}
Spawn --> Spawn_ (ThreadId 155)
State: RegState {tids = [Var 0,Var 1], regs = [], killed = []}
Register "e" (Var 1) --> Register_ (Right ())
State: RegState {tids = [Var 0,Var 1], regs = [("e",Var 1)], killed = []}
Register "d" (Var 0) --> Register_ (Right ())
State: RegState {tids = [Var 0,Var 1], regs = [("d",Var 0),("e",Var 1)], killed = []}
Unregister "e" --> Unregister_ (Left bad argument)
State: RegState {tids = [Var 0,Var 1], regs = [("d",Var 0)], killed = []}
Expected: Unregister_ (Right ())
Got: Unregister_ (Left bad argument)
As we can see, unregister fails when it in fact should succeed. We've
registered "e"
so we should be allowed to unregister it, but the real
implementation has, due to the bug, forgot that the registration
happened.
Let's move on to the parallel tests, all we need to add is:
instance ParallelModel RegState where
runCommandMonad _ = id
prop_parallelRegistry :: ParallelCommands RegState -> Property
prop_parallelRegistry cmds = monadicIO $ do
replicateM_ 10 $ do
void (run cleanUp)
runParallelCommands cmds
assert True
When we run the tests we get rather long counterexamples:
*** Failed! (after 24 tests and 7 shrinks):
Exception:
bad argument
CallStack (from HasCallStack):
error, called at src/Example/Registry/Real.hs:69:10 in stateful-pbt-with-fakes-0.0.0-inplace:Example.Registry.Real
ParallelCommands [Fork [Spawn,WhereIs "a"],Fork [Spawn],
Fork [Register "c" (Var 1),Spawn],Fork [Register "e" (Var 2),Register "a" (Var 2)]]
If we replace our shared memory operations with version that do a bit of sleep beforehand:
- import Data.IORef
+ import SleepyIORef
We get better shrinking results:
*** Failed! (after 5 tests and 5 shrinks):
Exception:
bad argument
CallStack (from HasCallStack):
error, called at src/Example/Registry/Real.hs:69:10 in stateful-pbt-with-fakes-0.0.0-inplace:Example.Registry.Real
ParallelCommands [Fork [Spawn],Fork [Register "b" (Var 0),Register "c" (Var 0)]]
Here we see clearly that there's some problem in Register
, as that's
the only thing that happens in parallel. If we look at the
implementation of register
it's obvious where the race condition is,
for example we are using atomicModifyIORef
to update the registry. The
problem is that we call readRegistry
to check if a name has already
been registered and then call atomicModifyIORef
, so the race can be if
another thread sneaks in between those two calls.
We can fix this problem by adding a global lock around register
:
{-# NOINLINE lock #-}
lock :: MVar ()
lock = unsafePerformIO (newMVar ())
registerNoRace :: String -> ThreadId -> IO ()
registerNoRace name tid = withMVar lock $ \_ -> register name tid
When rerunning the tests with this fixed version of registry
, we get:
*** Failed! Assertion failed (after 30 tests and 13 shrinks):
ParallelCommands [Fork [Spawn],Fork [Spawn],Fork [Spawn],
Fork [Register "d" (Var 2)],Fork [Unregister "d",Unregister "d"]]
Which seems to suggest that we have a similar problem with unregister
,
which is indeed the case. After applying the same fix to unregister
,
we get:
*** Failed! Assertion failed (after 15 tests and 4 shrinks):
ParallelCommands [Fork [Spawn],Fork [Register "d" (Var 0)],
Fork [Kill (Var 0),Register "e" (Var 0)]]
Killing a thread will unregister it, so we get a similar problem again.
If we take the lock before calling kill
, then the parallel tests
finally pass.
These race conditions are essentially variants on the parallel counter bug, but I hope you agree that they're not as obvious in the process registry case. I also hope that by now it's clear that as a user we get these parallel tests basically without doing any extra work. All the heavy lifting is done by the library by reusing the sequential model, and this code can be written once and then reused for all our parallel testing examples!
Throughout this post we've used in-memory models, or fakes, as reference implementations to test against. The use of fakes diverges from the original work on Quviq's Erlang QuickCheck, where a more traditional state machine specification is used with post-conditions.
As far as I know, Edsko de Vries' post (2019) was the first to propose the use of fakes instead of state machine specifications with post-conditions. Edsko also showed how one can implement fake-based specifications on top of a library that uses state machine specifications8.
Using fakes instead of state machine specifications with post-conditions is easier for programmers unfamiliar with formal specifications, because most programmers are already familiar with mocks and fakes are similar to mocks.
There are other advantages to using fakes, for example we can use this fake in integration tests with components that depend on the software that we tested with the fake.
Imagine a situation where we have two components with the first depending on the second:
+--------+ +---------+
| | depends on | |
| Real A +------------->| Real B |
| | | |
+--------+ +---------+
If we want to integration test component
With fakes we can get fast and deterministic integration tests by depending on the fake of the dependency instead of the real dependency:
+--------+ +---------+
| | depends on | |
| Real A +------------->| Fake B |
| | | |
+--------+ +---------+
One of the problems with integration testing against fakes is that the fake can be wrong. The standard solution to solve that problem is to contract test the fake to make sure that it is faithful to the software it's supposed to be a fake of. We don't have this problem, because our stateful and parallel property-based tests assure that the fake is faithful:
+--------+ +---------+
| | tested against | |
| Real B +----------------->| Fake B |
| | (contract test) | |
+--------+ +---------+
In this, final, section we'll look at examples of how integration testing against fakes works. Hopefully this shows how the testing methodology that we've explored in this post can be scaled to a bigger system of components.
As our first example of integration testing, let's recall our queue example from the section on stateful testing. We can introduce an interface for it as follows:
data IQueue q = IQueue
{ iNew :: Int -> IO q
, iPut :: q -> Int -> IO ()
, iGet :: q -> IO Int
, iSize :: q -> IO Int
}
The real implementation can instantiate this interface in a straightforward way:
real :: IQueue Queue
real = IQueue
{ iNew = new
, iPut = put
, iGet = get
, iSize = size
}
The interesting part is that our fake can also instantiate the same
interface by storing the state in a mutable reference (IORef
) as
follows.
fake :: IO (IQueue (Var Queue))
fake = do
ref <- newIORef emptyState
return IQueue
{ iNew = \n -> updateIORef ref (fNew n)
, iPut = \q i -> updateIORef ref (fPut q i)
, iGet = \q -> updateIORef ref (fGet q)
, iSize = \q -> updateIORef ref (fSize q)
}
where
updateIORef :: IORef State -> FakeOp a -> IO a
updateIORef ref op =
atomicModifyIORef' ref (\fs -> assoc fs (op fs)) >>= \case
Left err -> throwIO err
Right x -> return x
where
assoc fs (Left err) = (fs, Left err)
assoc _fs (Right (fs', x)) = (fs', Right x)
We can now write components or services against this interface:
prog :: IQueue q -> IO ()
prog iq = do
q <- iNew iq 3
iPut iq q 0
iPut iq q 1
iPut iq q 2
x <- iGet iq q
assert (x == 0) (return ())
sz <- iSize iq q
assert (sz == 2) (return ())
test :: IO ()
test = prog =<< fake
deploy :: IO ()
deploy = prog real
When we integration test our new component we can use the fake
instance to make the tests fast and deterministic, while when we deploy
we use the real
instance and because of our stateful property-based
tests we know that the fake is faithful to the real implementation.
The next example is a file system, first used by Edsko in the post that also introduced using fakes as models.
The interface is parametrised by a file handle. We can create directories, open files to get a hold of a file handle, file handles can then be read from and written to, and finally closed:
data IFileSystem h = IFileSystem
{ iMkDir :: Dir -> IO ()
, iOpen :: File -> IO h
, iWrite :: h -> String -> IO ()
, iClose :: h -> IO ()
, iRead :: File -> IO String
}
The real implementation of this interface uses the real file system. In
order to isolate the tests all operations will be relative to some
root
directory:
root :: FilePath
root = "/tmp/qc-test"
rMkDir :: Dir -> IO ()
rMkDir d = createDirectory (dirFP root d)
rOpen :: File -> IO Handle
rOpen f = openFile (fileFP root f) AppendMode
rWrite :: Handle -> String -> IO ()
rWrite h s = hPutStr h s
rClose :: Handle -> IO ()
rClose h = hClose h
rRead :: File -> IO String
rRead f = readFile (fileFP root f)
real :: IFileSystem Handle
real = IFileSystem
{ iMkDir = rMkDir
, iOpen = rOpen
, iWrite = rWrite
, iClose = rClose
, iRead = rRead
}
The fake implementation of the interface is, as usual, implemented using an in-memory data structure:
type FHandle = Var Handle
data FakeFS = F {
dirs :: Set Dir
, files :: Map File String
, open :: Map FHandle File
, next :: FHandle
}
deriving Show
emptyFakeFS :: FakeFS
emptyFakeFS = F (Set.singleton (Dir [])) Map.empty Map.empty (Var 0)
type FakeOp a = FakeFS -> (Either PrecondFail a, FakeFS)
fMkDir :: Dir -> FakeOp ()
fMkDir d m@(F ds fs hs n)
| d `Set.member` ds = (Left AlreadyExists, m)
| parent d `Set.notMember` ds = (Left DoesNotExist, m)
| otherwise = (Right (), F (Set.insert d ds) fs hs n)
fOpen :: File -> FakeOp FHandle
fOpen f m@(F ds fs hs n@(Var n_))
| alreadyOpen = (Left Busy, m)
| not dirExists = (Left DoesNotExist, m)
| fileExists = (Right n, F ds fs hs' n')
| otherwise = (Right n, F ds (Map.insert f "" fs) hs' n')
where
hs' = Map.insert n f hs
n' = Var (succ n_)
fileExists = f `Map.member` fs
dirExists = fileDir f `Set.member` ds
alreadyOpen = f `List.elem` Map.elems hs
fWrite :: FHandle -> String -> FakeOp ()
fWrite h s m@(F ds fs hs n)
| Just f <- Map.lookup h hs = (Right (), F ds (Map.adjust (++ s) f fs) hs n)
| otherwise = (Left HandleClosed, m)
fClose :: FHandle -> FakeOp ()
fClose h (F ds fs hs n) = (Right (), F ds fs (Map.delete h hs) n)
fRead :: File -> FakeOp String
fRead f m@(F _ fs hs _)
| alreadyOpen = (Left Busy , m)
| Just s <- Map.lookup f fs = (Right s , m)
| otherwise = (Left DoesNotExist , m)
where
alreadyOpen = f `List.elem` Map.elems hs
fake :: IO (IFileSystem FHandle)
fake = do
ref <- newIORef emptyFakeFS
return IFileSystem
{ iMkDir = \d -> updateIORef ref (fMkDir d)
, iOpen = \f -> updateIORef ref (fOpen f)
, iWrite = \h s -> updateIORef ref (fWrite h s)
, iClose = \h -> updateIORef ref (fClose h)
, iRead = \f -> updateIORef ref (fRead f)
}
where
updateIORef :: IORef FakeFS -> FakeOp a -> IO a
updateIORef ref op =
atomicModifyIORef' ref (\fs -> swap (op fs)) >>= \case
Left err -> throwIO err
Right x -> return x
where
swap (x, y) = (y, x)
Assuming we've tested that the fake file system is faithful to the real one, we can depend on the interface in all components of our system that need the file system:
prog :: IFileSystem h -> IO ()
prog ifs = do
let d = Dir ["foo"]
iMkDir ifs d
let f = File d "bar"
h <- iOpen ifs f
iWrite ifs h "baz"
iClose ifs h
putStrLn =<< iRead ifs f
We can then use the fake file system when we integration test and thus get fast and deterministic tests, and then use the real file system when we deploy.
test :: IO ()
test = prog =<< fake
deploy :: IO ()
deploy = prog real
Because of the fact that we know that the fake is faithful to the real file system implementation, we can be relatively sure that swapping in the real file system instead of the fake one when deploying will not introduce bugs. If it does introduce a bug then we have a mismatch between the fake and the real implementation and we need to investigate how it slipped through our stateful property-based tests.
Note that prog
is just a silly example, in a real system the component
that uses the file system can be more complex, for example in the system
that Edsko worked on the component that depended on the file system is a
database. In such cases it makes sense to write a whole new stateful and
parallel property-based test suite using database commands and
responses, it's those tests that do the integration testing between the
database and the fake file system, while the stateful and parallel
property-based tests of the file system alone do the contract tests that
ensure that the file system fake is faithful to the real file system.
The picture looks like this, where "DB" is the database and "FS" is the
file system:
+---------+ +----------+
| | depends on | |
| Real DB +------------->| Fake FS |
| | | |
+----+----+ +----------+
|
| tested against
v
+---------+
| |
| Fake DB |
| |
+---------+
The examples given above, a queue and a file system, might not seems necessary to fake9 so to finish of let's sketch how the same technique scales to a bigger system of components or services.
Imagine we have three components or services, where component A depends on component B which depends on component C:
+---+ +---+ +---+
| | | | | |
| A +----->| B +----->| C |
| | | | | |
+---+ +---+ +---+
Following the pattern that we did for the queue and file system example, we'd define three interfaces:
data IA = ...
data IB = ...
data IC = ...
And the dependencies are made clear when we instantiate the interfaces:
iC :: IO IC -- C has no dependencies.
iB :: IC -> IO IB -- B depends on C.
iA :: IB -> IO IA -- A depends on B.
The testing strategy is then as follows:
- Stateful and parallel test C, this gives us a fake of C which is contract tested;
- Use C fake when integration testing B;
- Use B fake (which uses the C fake) when testing A.
Hopefully it should be clear that this strategy scales to more components or services10.
We've seen how stateful and parallel property-based testing can be implemented in about 400 lines of code, which is comparable to the 300 lines of code of the first version of QuickCheck (which didn't have shrinking). We've also had a look at several examples of how we can use fakes as models and how to test bigger systems in a compositional manner by reusing the fakes.
I hope that this is enough material to get people curious and experimenting in their favorite programming languages11. Feel free to get in touch or open an issue in case there's anything I can help with. Who knows, together, we might even be able to slightly improve the state of property-based testing libraries!
I'd like to thank Daniel Gustafsson for helping implement the
quickcheck-state-machine
library with me seven years ago, discussing a
fix for parallel commands generation
issue
that I found while writing this post, and for proofreading.
I also want to thank Larry Diehl for proofreading and making several suggestions on how to improve the text and make it more readable.
The post generated comments in various places:
- HackerNews comments;
- Lobsters comments;
- r/programming comments;
- r/haskell comments;
- Haskell discourse comments.
It also seem to have inspired some follow-up posts:
- Alex "matklad" Kladov expanded his Lobsters comments into a full post about how to do deterministic scheduling of threads in Rust;
- Anthony Lloyd wrote about the happy state of property-based testing in C# in general, and about some innovative functionality in CsCheck in particular;
- Pierre "ibizaman" Penninckx commented that the post made him write a follow up to one of his own earlier posts about stateful property-based testing in Go.
Footnotes
-
Is there a source for this story? I can't remember where I've heard it. This short biography gives some of the details:
"From 2002-2005 he led a major research project in software verification, funded by the Swedish Strategic Research Foundation. This led to the development of Quviq QuickCheck in Erlang."
-
There's some room for error here from the users side, e.g. the user could create non-unique references. In a proper library one might want to introduce a
genSym
construct which guarantees uniqueness. ↩ -
So stateful property-based testing with a trivial
runReal
can be seen as crude version of a random path exploring "model checker". One could perhaps implement something closer to TLC (the model checker for TLA+), which enumerates all paths up to some depth, by usingsmallcheck
rather thanQuickCheck
. If this topic interests you, you might also want to have a look at Gabriella Gonzalez's HasCal.I don't have an example for this, but I guess one can also think of stateful property-based testing with a trivial
runFake
as a crude version of a fuzzer (without coverage guidance). For more on this and how to add coverage guidance, see Coverage guided, property based testing (2019). ↩ -
Another idea might be to drop all preconditions in the parallel case and make all commands be able to fail gracefully instead of crashing, e.g.
Write_ (Either DoesntExist ())
.The problem with this approach is that some examples, such as the ticket dispenser, have initialisation commands such as
New
which create a ticket dispenser reference upon which the later commands depend on, so without preconditions forbidding more than oneNew
we can end up generating:Fork [New, New]
, which doesn't make sense. It should also be noted that makingNew
fail gracefully when aNew
has already been executed would need a global boolean flag, which is ugly. ↩ -
The parallel counter example is similar to the ticket dispenser example that appears in John's paper Testing the hard stuff and staying sane (2014). ↩
-
For how to implement a deterministic scheduler, see matklad's post. I've also started porting his ideas to Haskell over here. ↩
-
The sequential variant of the process registry example first appeared in the paper QuickCheck testing for fun and profit (2007) and is also part of John's Midlands Graduate School course (2019). The parallel tests were introduced in Finding Race Conditions in Erlang with QuickCheck and PULSE (2009). ↩
-
I believe the post-condition formulation is more general, as it allows a relational rather than a functional specification. So I don't think we can show the converse of what Edsko did, i.e. implement a post-condition style specification on top of our fake-based one. ↩
-
Unless we want to test what happens when failures, such as the disk being full etc. Research shows that "almost all (92%) of the catastrophic system failures are the result of incorrect handling of non-fatal errors explicitly signaled in software. [...] in 58% of the catastrophic failures, the underlying faults could easily have been detected through simple testing of error handling code.". Fakes make it easier to inject faults, but that's a story for another day. ↩
-
See the talk Integrated Tests Are A Scam by J.B. Rainsberger for a longer presentation of this idea. ↩
-
In the unlikely case that there's someone out there who wants to dig even deeper into this topic, then I've compiled a list of ideas for improvements and further exploration. ↩