Data validity

After finishing your completeness checks, it is important to check the validity of the data in the records that you do have. For each field in the measurements shelf, look for outliers well beyond any other data point. Also check for specific values that show up at a high frequency. The first could be either an error value or is serving as an indicator of an event other than a measurement. The second could be a default value that was intended to be overridden by the actual measurement value. There can be multiple explanations for unusual values; the goal is to identify the values and the approximate frequency of occurrence.

Looking at the Qgag values on a scale, it is apparent that there is a common outlier value on the negative side (-9999) and another one on the positive side (999.990, which may be rounded to 1,000 in the view). Select points in each of the outlier areas to review individual records. See if the actual values are consistently the same or have some variation. Consistent values are likely serving as intentional indicators. Variation could be due to a calculation error or conversion error that has the decimal point in the wrong place:

Qgag values on scale

Check the same view on a station by station basis. All stations have at least one -9999 value but not all have 999.99.

Qgag values by station

Since the extreme values do not appear to be actual readings, filter out both the high and low, by filtering to values between 0 and 900. Then, review the results to get a feel for typical ranges across weather stations. The values range between 0 and 2.4:

Filtered Qgag values by station

Repeat the same process with the other measure in the dataset, Qpcp. You will see that it also has the same extreme values (-9999 and 999.990) with the exception that some stations do not have a -9999 record. Another difference is that where Qgag values are along a continuous scale, Qpcp appears to be reported in the increments of 0.1:

Filtered Qpcp values by station

Select some points and review individual data records to see if the Qpcp values are precise to the tenths digit or if there is some variation that is not visible on the chart. In this case, all values are precisely at the tenths digit.

We have shown some ways to look into data validity, but do not limit yourself to only what has been demonstrated. Continue exploring the data by slicing and dicing it in as many ways you can brainstorm. Talk with your design engineers on what range of measurement values should be expected and compare observations to that range. Refer back to your IoT device diagram in Chapter 1, Defining IoT Analytics and Challenges and use it to look for inconsistencies or distortions in the data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.35.81