Mismatched values

A mismatched value is any value that seems to be of a different data type than the type specified for the column. In the database we are analyzing at this point, there are several values that seem to deviate from the type attributed to the column. For example, in the right column is a cell that contains a dot; it is clear that this is an error in the phase of populating the database. It is equally clear that such a value can cause many problems during the analysis phase, which is why it must be appropriately dealt with.

As seen for missing values, mismatched values are also represented in the data quality bar at the top of each column. In the data quality bar, mismatched values are identified in red, as shown in the following screenshot:

To fix mismatched data, there are several options available:

  • Change the data type
  • Replace the values with constant values
  • Set the values with other columns' values
  • Transform the data with functions
  • Delete rows
  • Hide the column for now
  • Drop the column

In this case, since two of the selected lines contain mismatched data in other columns too, we will eliminate all three columns. To do this, simply click on the Add button in the Delete rows area of the suggestions panel.

Analyzing the dataset, we can see that mismatched data is still present. In fact, the data quality bar in the age column has an area in red. We try to fix this problem too. This time it is not appropriate to delete the entire row. In fact, by analyzing the first column, we can see that the NA value refers to a line whose name is clearly referring to a female (Olivia). So the most appropriate solution is to replace this value with a known value, in this case with 'F'.

To do this, we will write the following formula in the set item of the suggestions panel:

ifmismatched($col, ['Gender'], 'F')

A new step will be added to the Recipe panel. Once again, we can see a preview of the changes made to the dataset. In fact, we can see that the wrong value is no longer present; in its place, there is an 'F'.

We have so far adjusted several things, but at first glance, there is still something to be done. If we pay attention to the right1 column, which represents the percentage of correct answers provided, we notice that the range of values is as follows: -19 to 98. But -19 is obviously an incorrect value since for that variable, the permissible values are between 0 and 100 (this is a percentage). We can assume that a minus sign was added by mistake when creating the dataset. We can then modify this value, leaving only the value 19.

To do this, we perform the following steps:

  1. Open the Recipe panel. Just click on the icon (Recipe) at the top left of the Run Job button.
  2. Click on the New Step button. The Transform Builder is opened.
  3. In the Transformation drop-down menu, select Apply formula in the list of available transforms.
  4. Specify the Columns (right).
  5. Edit the following formula in the formula box:
  6. IF($col==-19,19,$col)
  7. Click on the Add button.
  8. A new step is added to the Recipe panel.

The operations we have added to the Recipe panel are five, as shown in this screenshot:

By analyzing the Recipe panel, we then have a summary of the actions we have planned on the dataset. Moreover, a visual analysis of the dataset preview with the changes made does not show any anomalies to be fixed. In the data quality bar at the top of each column, no red/black zones are highlighted, although it's too early to party yet! The data preparation work is far from done.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.17.105