Generating new data

Before proceeding with the further steps, let's quickly understand one very crucial step for every machine learning engineer—data generation. We know that all machine learning and deep learning techniques require a huge amount of data—in simple terms: the bigger, the better. But what if you don't have enough data? Well, you can end up with a model that doesn't have enough accuracy. The common technique employed (if you are not able to generate any new data) is to use the majority of the data for training. The major downside of this is that you have a model that is not generalized or, in other terms, suffers from overfitting.

One solution to deal with the preceding issue is to generate new data or, as it is commonly referred to, synthetic data. The key point to note here is that the synthetic data should have similar features to your real data. The more similar they are to the real data, the better it is for you as an ML engineer. This technique is referred to as data augmentation, where we use various techniques such as rotation and mirror images to generate new data that is based upon the existing data.

Since we are dealing with a hypothetical case here, we can write simple Python code to generate random data—since there are no set features for us here. In a real-world case, you would use data augmentation to generate realistic-looking new data samples. Let's see how to approach this for our case.

Here, the dataset is actually a list of dictionaries, where every dictionary constitutes a single data point that contains a patient's blood work, age, and gender, as well as the drug that was prescribed. So, we know that we want to create new dictionaries and we know the keys to use in this dictionary. The next thing to focus on is the data type of the values in the dictionaries.

We start with age, which is an integer, then we have the gender, which is either M or F. Similarly, for other values, we can infer the data types and, in some cases, using common sense, we can even infer the range of the values to use.

It is very important to note that common sense and deep learning don't go well together most of the time. This is because you want your model to understand when something is an outlier. For example, we know that it's highly unlikely for someone to have an age of 130 but a generalized model should understand that this value is an outlier and should not be taken into account. This is why you should always have a small portion of data with such illogical values.

Let's see how we can generate some synthetic data for our case:

import random

def generateBasorexiaData(num_entries):
    # We will save our new entries in this list 
    list_entries = []
    for entry_count in range(num_entries):
        new_entry = {}
        new_entry['age'] = random.randint(20,100)
        new_entry['sex'] = random.choice(['M','F'])
        new_entry['BP'] = random.choice(['low','high','normal'])
        new_entry['cholestrol'] = random.choice(['low','high','normal'])
        new_entry['Na'] = random.random()
        new_entry['K'] = random.random()
        new_entry['drug'] = random.choice(['A','B','C','D'])
        list_entries.append(new_entry)
    return list_entries

We can call the preceding function using entries = generateBasorexiaData (5) if we want to generate five new entries.

Now that we know how to generate the data, let's have a look at what we can do with this data. Can we figure out the doctor's reasoning for prescribing drugs A, B, C, or D? Can we see a relationship between a patient's blood values and the drug that the doctor prescribed?

Chances are, it is as difficult a question to answer for you as it is for me. Although the dataset might look random at first glance, I have, in fact, put in some clear relationships between a patient's blood values and the prescribed drug. Let's see whether a decision tree can uncover these hidden relationships.

Table of Contents for Generating new data

Create new playlist

Sign In

Sign Up

Table of Contents for
Generating new data