Creating a toy dataset

The function I'm referring to resides within scikit-learn's datasets module. Let's create 100 data points, each belonging to one of two possible classes, and group them into two Gaussian blobs. To make the experiment reproducible, we specify an integer to pick a seed for random_state. You can again pick whatever number you prefer. Here, I went with Thomas Bayes' year of birth (just for kicks):

In [1]: from sklearn import datasets
...     X, y = datasets.make_blobs(100, 2, centers=2,
        random_state=1701, cluster_std=2)

Let's have a look at the dataset we just created using our trusty friend, Matplotlib:

In [2]: import matplotlib.pyplot as plt
...     plt.style.use('ggplot')
...     %matplotlib inline
In [3]: plt.scatter(X[:, 0], X[:, 1], c=y, s=50);

I'm sure this is getting easier every time. We use scatter to create a scatter plot of all x values (X[:, 0]) and y values (X[:, 1]), which will result in the following output:

In agreement with our specifications, we see two different point clusters. They hardly overlap, so it should be relatively easy to classify them. What do you think—could a linear classifier do the job?

Yes, it could. Recall that a linear classifier would try to draw a straight line through the diagram, trying to put all blue dots on one side and all red dots on the other. A diagonal line going from the top-left corner to the bottom-right corner could clearly do the job. So we would expect the classification task to be relatively easy, even for a Naive Bayes classifier.

But first, don't forget to split the dataset into training and test sets! Here, I reserve 10% of the data points for testing:

In [4]: import numpy as np
...     from sklearn import model_selection as ms
...     X = X.astype(np.float32)
...     X_train, X_test, y_train, y_test = ms.train_test_split(
...         X, y, test_size=0.1
...     )

Table of Contents for Creating a toy dataset

Create new playlist

Sign In

Sign Up

Table of Contents for
Creating a toy dataset