Best practice 3 - maintain consistency of field values

In a dataset that exists or we collect from scratch, often we see values representing the same meaning. For example, there are "American", "US", and "U.S.A" in the Country field, and "male" and "M" in the "Gender" field. It is necessary to unify values in a field. For example, we can only keep "M" and "F" in the "Gender" field and replace other alternatives. Otherwise, it will mess up the algorithms in later stages as different feature values will be treated differently even if they have the same meaning. It is also a great practice to keep track of what values are mapped to the default value of a field.

In addition, the format of values in the same field should also be consistent. For instance, in the "Age" field, there are true age values such as 21, 35, and mistaken year values such as 1990, 1978; the "Rating" field, both cardinal numbers and English numerals are found, such as 1, 2, 3, and "one", "two", "three". Transformation and reformatting should be conducted in order to ensure data consistency.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.200.71