Quality assurance

I know we have spent a lot of time cleaning the data, but there is still one last task we need to perform – quality assurance. Proper quality assurance is a very important practice. In a nutshell, you need to define certain assumptions about the dataset (for example, minimum and maximum values, the acceptable number of missing values, standard deviation, medians, the number of unique values, and many more). The key is to start with something that is somewhat reasonable, and then run tests to check whether the data fits your assumptions. If not, investigate specific data points to check whether your assumptions were incorrect (and update them), or whether there are still some issues with the data. It just gets a little more tricky for the multilevel columns. Consider the following code:

assumptions = {
'killed': [0, 1_500_000],
'wounded': [0, 1_000_000],
'tanks': [0, 1_000],
'airplane': [0, 1_000],
'guns': [0, 10_000],
('start', 'end'): [pd.to_datetime(el) for el in ('1939-01-01', '1945-12-31')]
}

def _check_assumptions(data, assumptions):
for k, (min_, max_) in assumptions.items():
df = data.loc[:, idx[:, k]]
for i in range(df.shape[1]):
assert df.iloc[:, i].between(min_, max_).all(), (df.iloc[:, i].name, df.iloc[:, i].describe())


_check_assumptions(data, assumptions)

Here, we use a dictionary to describe our assumptions—a key representing the column, and a value being minimum and maximum values. Using multilevel slicing, we can treat the key as the lowest column name—hence, testing both allies and axis casualties in the same pass. The describe() method returns a series of descriptive statistics for the column (in this case) or the entire dataframe—minimum and maximum values, most frequent value, and many more.

Note that the preceding assumptions will not hold. Feel free to run them and investigate which battles go beyond your expectations and whether their values are correct. The QA checkup process usually does require some back-and-forth on the first try, as you usually have to relax your requirements somewhat. This is a valuable process on its own—even here, you're usually learning some new information about your data. 

Finally, let's write our resulting clean dataset so that we can use it in our next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.227.251