Exploratory data analysis

The second step, after loading the data, is to carry out Exploratory Data Analysis (EDA). By doing this, we get to know the data we are supposed to work with. Some insights we try to gather are:

  • What kind of data do we actually have, and how should we treat different types?
  • What is the distribution of the variables?
    • Are there outliers in the data, and how can we treat them?
    • Are any transformations required? For example, some models work better with (or require) normally distributed variables, so we might want to use techniques such as log transformation.
    • Does the distribution vary per group (for example, gender or education level)?
  • Do we have cases of missing data? How frequent are these, and in which variables?
  • Is there a linear relationship between some variables (correlation)?
  • Can we create new features using the existing set of variables? An example might be deriving hour/minute from a timestamp, day of the week from a date, and so on.
  • Are there any variables that we can remove as they are not relevant for the analysis? An example might be a randomly generated customer identifier.

EDA is extremely important in all data science projects, as it enables analysts to develop an understanding of the data, facilitates asking better questions, and makes it easier to pick modeling approaches suitable for the type of data being dealt with.

In real-life cases, it makes sense to carry out univariate analysis (one feature at a time) for all relevant features to get a good understanding of them, and then proceed to multivariate analysis (comparing distributions per group, correlations, and so on). For brevity, we only show some popular approaches on selected features, but a deeper analysis is highly encouraged.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.199.250