Working with data science and ML

ML is all about working with data. The quality of the training data and labels is crucial to the success of an ML model. High-quality data leads to a more accurate ML model and the right prediction. Often in the real world, your data has multiple issues such as missing values, noise, bias, outliers, and so on. Part of data science is the cleaning and preparing of your data to get it ready for ML.

The first thing about data preparation is to understand business problems. Data scientists are often very eager to jump into the data directly, start coding, and start producing insights. However, without a clear understanding of the business problem, any insights you develop have a high chance of becoming a solution which is unable to address a problem. It makes much more sense to start with a clear user story and business objectives before getting lost in the data. After building a solid understanding of the business problem, you can begin to narrow down the ML problem categories and determine whether ML will be suitable to solve your particular business problem.

Data science includes data collection, analysis, preprocessing, and feature engineering. Exploring the data provides us with necessary information such as data quality and cleanliness, interesting patterns in the data, and likely paths forward once you start modeling.

As shown in the following diagram, data preprocessing and learning to create a ML model are interconnected—your data preparation will heavily influence your model, while the model you choose heavily influences the type of data preparation you will do. Finding the correct balance is highly iterative and is very much an art (or trial and error):

ML workflow

As shown in the preceding diagram, the ML workflow includes the following phases:

Preprocessing: In this phase, the data scientist preprocesses the data and divides it into training, validation, and testing datasets. Your ML model gets trained with the training dataset to fit the model, and is evaluated using the validation dataset. Once the model is ready, you can test it using a testing dataset. Taking into the amount of data and your business case, you need to divide the data into training, testing, and validation sets, perhaps keeping 70% of the data for training, 10% for validation, and 20% for testing.

Features are independent attributes of your dataset that may or may not influence the outcome. Feature engineering involves finding the right feature, which can help to achieve model accuracy. The label is your target outcome, which is dependent on feature selection. To choose the right feature, you can apply dimensionality reduction, which filters and extracts the most effective feature for your data.

Learning: In the learning phase, you select the appropriate ML algorithm as per the business use case and data. The learning phase is the core of the ML workflow, where you train your ML model on your training dataset. To achieve model accuracy, you need to experiment with various hyperparameters and perform model selection.
Evaluation: Once your ML model gets trained in the learning phase, you want to evaluate the accuracy with a known dataset. You use the validation dataset kept aside during the preprocessing phase to assess your model. Required model tuning needs to be performed as per the evaluation result if your model prediction accuracy is not up to the exceptions as determined by validation data.
Prediction: Prediction is also known as inference. In this phase, you deployed your model and started making a prediction. These predictions can be made in real time or in batches.

As per your data input, often, the ML model can have overfitting or underfitting issues, which you must take into account to get the right outcome.

Table of Contents for Working with data science and ML

Create new playlist

Sign In

Sign Up

Table of Contents for
Working with data science and ML