Writing tests with hypothesis

Finally, we're going to go back to a topic we've already coveredunit tests. Unit tests are very important; they will give you peace of mind during developmentyou really don't want to play a whack-a-mole game with your bugs.

Now, testing a data-heavy application is hard. Depending on complex datasets, data structures expose us to dozens of rare, but possible, quirks and edge cases. Often, we don't even need to think of those possibilities, instead focusing on the datasets we have at hand. For example, any function that operates on a dataframe should deal (one way or another) with an empty dataframe, the dataframe of a wrong datatype, a NumPy array, a dataframe of null values, and so on.

One approach to mitigate this problem is to use pre-generated suites of tests that are focused on quirks and possible issues of specific data structures.

To illustrate this idea with an example, let's use hypothesis, as follows:

  1. Let's play in a sandbox environment of a Jupyter Notebook. We'll start by importing all the necessary pieces:
from hypothesis.strategies import integers, randoms, composite
from hypothesis.extra.pandas import series
from hypothesis import given, strategies as st
  1. Now, we will define a custom strategy (a sample generator). Consider the following code. Here, we are synthetically creating a series of strings that resemble the ones from the Wikipedia entrythey do have numbers and keywords to parse:
units = [
' men',
' guns',
' tanks',
' airplanes',
' captured'
]

def generate_text(values, r):
r.shuffle(units)
result = ''
for i, el in enumerate(values):
result += str(el)
result += (units[i] + ' ')

return (values, result.strip())

StrSintetic = st.builds(generate_text,
st.lists(st.integers(min_value=1, max_value=2000),
min_size=1, max_size=5),
st.randoms())

SyntSeries = series(StrSintetic)
  1. Now, we can pass SyntSeries as a sample value for our tests:
@given(SyntSeries)
def test_parse_casualties_h(s):
from wikiwwii.parse.casualties import _parse_casualties

values = _parse_casualties(s)
assert ( values.sum(1) > 0).all(), values

A new sample will be generated every time. It won't be completely random, however strategies memorize previous examplesand failed testsand will start with the values that failed on the previous runs, and new examples if everything prior passed. This particular test has passed.

  1. Just to illustrate, let's add an explicit case of an empty stringit will be raised. Parsing the empty strings will result in a zero value sum:
@given(SyntSeries)
@example(pd.Series(["", ""]))
def test_parse_casualties_h(s):
from wikiwwii.parse.casualties import _parse_casualties
values = _parse_casualties(s)

The output on adding an explicit case of an empty string is as follows:

> assert (values.sum(1) > 0).all(), values.to_string()
E AssertionError: killed wounded captured tanks airplane guns ships submarines
E 0 0.0 0 0 0 0 0 0 0
E 1 0.0 0 0 0 0 0 0 0
E assert False

As we can see, this failed. Note that if you run the code for a second time, it will fail fasterHypothesis will run the same failed sample first. Those generators, called strategies, are the main superpower of the package. Due to this, Hypothesis ensures that the code behaves well not only on a few hand-picked cases but also in the wild when fed with synthesized datasets. The test we added may seem not-so-useful (we tested that function before), but it will be quick to catch if we break parsing by mistakeand will start with the failed case on the next run to check whether the code was fixed. It also has a set of smart strategies that have been built for the most popular datatypes.

Hypothesis is a great tool for data-driven testing, as it will automatically generate most of edge cases and make us cover edge cases we haven't even thought about. Because of that, it proves to be a valuable asset for any data-heavy application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.126.122