Best practice 3 – maintaining the consistency of field values

In a dataset that already exists, or in one we collect from scratch, oftentimes we see different values representing the same meaning. For example, there are American, US, and U.S.A in the country field, and male and M in the gender field. It is necessary to unify or standardize values in a field. For example, we can only keep M and F in the gender field and replace other alternatives. Otherwise it will mess up the algorithms in later stages as different feature values will be treated differently even if they have the same meaning. It is also a great practice to keep track of what values are mapped to the default value of a field.

In addition, the format of values in the same field should also be consistent. For instance, in the age field, there are true age values, such as 21 and 35, and incorrect age values, such as 1990 and 1978; in the rating field, both cardinal numbers and English numerals are found, such as 1, 2, and 3, and one, two, and three. Transformation and reformatting should be conducted in order to ensure data consistency.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.60.158