Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 12

Stretching Python’s Capabilities

IN THIS CHAPTER

Understanding how Scikit-learn works with classes

Using sparse matrices and the hashing trick

Testing performances and memory consumption

Saving time with multicore algorithms

If you’ve gone through the previous chapters, by this point you’ve dealt with all the basic data loading and manipulation methods offered by Python. Now it’s time to start using some more complex instruments for data wrangling (or munging) and for machine learning. The final step of most data science projects is to build a data tool able to automatically summarize, predict, and recommend directly from your data.

Before taking that final step, you still have to process your data by enforcing transformations that are even more radical. That’s the data wrangling or data munging part, where sophisticated transformations are followed by visual and statistical explorations, and then again by further transformations. In the following sections, you learn how to handle huge streams of text, explore the basic characteristics of a dataset, optimize the speed of your experiments, compress data and create new synthetic features, generate new groups and classifications, and detect unexpected or exceptional cases that may cause your project to go wrong.

From here onward, you use the Scikit-learn package more (which means knowing more about it — the full documentation appears at https://scikit-learn.org/stable/documentation.html). The Scikit-learn package offers a single repository containing almost all the tools that you need to be a data scientist and for your data science project to be successful. In this chapter, you discover important characteristics of Scikit-learn, structured in modules, classes, and functions, and some advanced Python time savers for improving performance with big unstructured data and highly time-consuming computational operations.

You don’t have to type the source code for this chapter in by hand. In fact, it’s a lot easier if you use the downloadable source (see the Introduction for download instructions). The source code for this chapter appears in the P4DS4D2_12_Stretching_Pythons_Capabilities.ipynb source code file.

Playing with Scikit-learn

Sometimes the best way to discover how to use something is to spend time playing with it. The more complex a tool, the more important play becomes. Given the complex math tasks you perform using Scikit-learn, playing becomes especially important. The following sections use the idea of playing with Scikit-learn to help you discover important concepts in using Scikit-learn to perform amazing feats of data science work.

Understanding classes in Scikit-learn

Understanding how classes work is an important prerequisite for being able to use the Scikit-learn package appropriately. Scikit-learn is the package for machine learning and data science experimentation favored by most data scientists. It contains a wide range of well-established learning algorithms, error functions, and testing procedures.

At its core, Scikit-learn features some base classes on which all the algorithms are built. Apart from BaseEstimator, the class from which all other classes inherit, there are four class types covering all the basic machine-learning functionalities:

Classifying
Regressing
Grouping by clusters
Transforming data

Even though each base class has specific methods and attributes, the core functionalities for data processing and machine learning are guaranteed by one or more series of methods and attributes called interfaces. The interfaces provide a uniform Application Programming Interface (API) to enforce similarity of methods and attributes between all the different algorithms present in the package. There are four Scikit-learn object-based interfaces:

estimator: For fitting parameters, learning them from data, according to the algorithm
predictor: For generating predictions from the fitted parameters
transformer: For transforming data, implementing the fitted parameters
model: For reporting goodness of fit or other score measures

The package groups the algorithms built on base classes and one or more object interfaces into modules, each module displaying a specialization in a particular type of machine-learning solution. For example, the linear_model module is for linear modeling, and metrics is for score and loss measure.

To find a specific algorithm in Scikit-learn, you must first find the module containing the same kind of algorithm that interests you, and then select it from the list of contents of the module. The algorithm is typically a class itself, whose methods and attributes are already known because they’re common to other algorithms in Scikit-learn.

Getting accustomed to the Scikit-learn class approach may take some time. However, the API is the same for all the tools available in the package, so learning one class necessarily tells you about all the other classes. The best approach is to learn one class completely and then apply what you know to other classes.

Defining applications for data science

Figuring out ways to use data science to obtain constructive results is important. For example, you can apply the estimator interface to a

Classification problem: Guessing that a new observation is from a certain group
Regression problem: Guessing the value of a new observation

It works with the method fit(X, y) where X is the bidimensional array of predictors (the set of observations to learn) and y is the target outcome (another array, unidimensional).

By applying fit, the information in X is related to y, so that, knowing some new information with the same characteristics of X, it’s possible to guess y correctly. In the process, some parameters are estimated internally by the fit method. Using fit makes it possible to distinguish between parameters, which are learned, and hyperparameters, which instead are fixed by you when you instantiate the learner.

Instantiation involves assigning a Scikit-learn class to a Python variable. In addition to hyperparameters, you can also fix other working parameters, such as requiring normalization or setting a random seed to reproduce the same results for each call, given the same input data.

Here is an example with linear regression, a very basic and common machine learning algorithm. You upload some data to use this example from the examples that Scikit-learn provides. The Boston dataset, for instance, contains predictor variables that the example code can match against house prices, which helps build a predictor that can calculate the value of a house given its characteristics.

from sklearn.datasets import load_boston

boston = load_boston()

X, y = boston.data,boston.target

print("X:%s y:%s" % (X.shape, y.shape))

The returned dimensions for the X and y variables are

X:(506, 13) y:(506,)

The output specifies that both arrays have the same number of rows and that X has 13 features. The shape method performs array analysis and reports the arrays’ dimensions.

The number of X rows must equal those in y. You also ensure that X and y correspond, because learning from data happens when the algorithm matches the rows of X with the corresponding element of y. If you randomize the two arrays, no learning is possible.

The characteristics of X, expressed as X’s columns, are called variables (a more statistical term) or features (a term more related to machine learning).

After importing the LinearRegression class, you can instantiate a variable called hypothesis and set a parameter indicating the algorithm to standardize (that is, to set mean zero and unit standard deviation for all the variables, a statistical operation for having all the variables at a similar level) before estimating the parameters to learn.

from sklearn.linear_model import LinearRegression

hypothesis = LinearRegression(normalize=True)

hypothesis.fit(X, y)

print(hypothesis.coef)_

Afterwards, the coefficients of the linear regression hypothesis are printed:

[-1.07170557e-01 4.63952195e-02 2.08602395e-02

2.68856140e+00 -1.77957587e+01 3.80475246e+00

7.51061703e-04 -1.47575880e+00 3.05655038e-01

-1.23293463e-02 -9.53463555e-01 9.39251272e-03

-5.25466633e-01]

After fitting, hypothesis holds the learned parameters, and you can visualize them using the coef_ method, which is typical of all the linear models (where the model output is a summation of variables weighted by coefficients). You can also call this fitting activity training (as in, “training a machine learning algorithm”).

A hypothesis is a way to describe a learning algorithm trained with data. The hypothesis defines a possible representation of y given X that you test for validity. Therefore, it’s a hypothesis in both scientific and machine learning language.

Apart from the estimator class, the predictor and the model object classes are also important. The predictor class, which predicts the probability of a certain result, obtains the result of new observations using the predict and predict_proba methods, as in this script:

import numpy as np

new_observation = np.array([1, 0, 1, 0, 0.5, 7, 59,

6, 3, 200, 20, 350, 4],

dtype=float).reshape(1, -1)

print(hypothesis.predict(new_observation))

The single observation is thus converted into a prediction:

[25.8972784]

Tip Make sure that new observations have the same feature number and order as in the training X; otherwise, the prediction will be incorrect.

The class model provides information about the quality of the fit using the score method, as shown here:

hypothesis.score(X, y)

The quality is expressed as a float number:

0.7406077428649427

In this case, score returns the coefficient of determination R² of the prediction. R² is a measure ranging from 0 to 1, comparing our predictor to a simple mean. Higher values show that the predictor is working well. Different learning algorithms may use different scoring functions. Please consult the online documentation of each algorithm or ask for help on the Python console:

help(LinearRegression)

The transform class applies transformations derived from the fitting phase to other data arrays. LinearRegression doesn’t have a transform method, but most preprocessing algorithms do. For example, MinMaxScaler, from the Scikit-learn preprocessing module, can transform values in a specific range of minimum and maximum values, learning the transformation formula from an example array.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

scaler.fit(X)

print(scaler.transform(new_observation))

Running the code returns transformed values for the observations:

[ 0.01116872 0. 0.01979472 0.

0.23662551 0.65893849 0.57775489 0.44288845

0.08695652 0.02480916 0.78723404 0.88173887

0.06263797]

In this case, the code applies the min and max values learned from X to the new_observation variable and returns a transformation.

Performing the Hashing Trick

Scikit-learn provides you with most of the data structures and functionality you need to complete your data science project. You can even find classes for the trickiest and most advanced problems.

For instance, when dealing with text, one of the most useful solutions provided by the Scikit-learn package is the hashing trick. You discover how to work with text by using the bag of words model (as shown in the “Using the Bag of Words Model and Beyond” section of Chapter 8) and weighting them with the Term Frequency times Inverse Document Frequency (TF-IDF) transformation. All these powerful transformations can operate properly only if all your text is known and available in the memory of your computer.

A more serious data science challenge is to analyze online-generated text flows, such as from social networks or large, online text repositories. This scenario poses quite a challenge when trying to turn the text into a data matrix suitable for analysis. When working through such problems, knowing the hashing trick can give you quite a few advantages by helping you

Handle large data matrices based on text on the fly
Fix unexpected values or variables in your textual data
Build scalable algorithms for large collections of documents

Using hash functions

Hash functions can transform any input into an output whose characteristics are predictable. Usually they return a value where the output is bound at a specific interval — whose extremities range from negative to positive numbers or just span through positive numbers. You can imagine them as enforcing a standard on your data — no matter what values you provide, they always return a specific data product.

Their most useful hash function characteristic is that, given a certain input, they always provide the same numeric output value. Consequently, they’re called deterministic functions. For example, input a word like dog and the hashing function always returns the same number.

In a certain sense, hash functions are like a secret code, transforming everything into numbers. Unlike secret codes, however, you can’t convert the hashed code to its original value. In addition, in some rare cases, different words generate the same hashed result (also called a hash collision).

Demonstrating the hashing trick

There are many hash functions, with MD5 (often used to check file integrity, because you can hash entire files) and SHA (used in cryptography) being the most popular. Python possesses a built-in hash function named hash that you can use to compare data objects before storing them in dictionaries. For instance, you can test how Python hashes its name:

print(hash('Python'))

The command returns a large integer number:

-1126740211494229687

Technicalstuff The Python session on your computer may return a different value than the one shown on the preceding line. Don’t worry — the built-in hash functions aren’t always consistent across computers. When you need consistent output, rely on the Scikit-learn hash functions instead because the output is consistent across machines.

A Scikit-learn hash function can also return an index in a specific positive range. You can obtain something similar using a built-in hash by employing standard division and its remainder:

print(abs(hash('Python')) % 1000)

This time the resulting hash is an integer number with fewer numbers:

687

When you ask for the remainder of the absolute number of the result from the hash function, you get a number that never exceeds the value you used for the division. To see how this technique works, pretend that you want to transform a text string from the Internet into a numeric vector (a feature vector) so that you can use it for starting a machine-learning project. A good strategy for managing this data science task is to employ one-hot encoding, which produces a bag of words. Here are the steps for one-hot encoding a string (“Python for data science”) into a vector.

Assign an arbitrary number to each word, for instance, Python=0 for=1 data=2 science=3.
Initialize the vector, counting the number of unique words that you assigned a code in Step 1.
Use the codes assigned in Step 1 as indexes for populating the vector with values, assigning a 1 where there is a coincidence with a word existing in the phrase.

The resulting feature vector is expressed as the sequence [1,1,1,1] and made of exactly four elements. You have started the machine-learning process, telling the program to expect sequences of four text features, when suddenly a new phrase arrives and you must convert the following text into a numeric vector as well: “Python for machine learning”. Now you have two new words — “machine learning” — to work with. The following steps help you create the new vectors:

Assign these new codes: machine=4 learning=5. This is called encoding.
Enlarge the previous vector to include the new words: [1,1,1,1,0,0].
Compute the vector for the new string: [1,1,0,0,1,1].

One-hot encoding is quite optimal because it creates efficient and ordered feature vectors.

from sklearn.feature_extraction.text import *

oh_enconder = CountVectorizer()

oh_enconded = oh_enconder.fit_transform([

'Python for data science','Python for machine learning'])

print(oh_enconder.vocabulary_)

The command returns a dictionary containing the words and their encodings:

{'python': 4, 'for': 1, 'data': 0, 'science': 5,

'machine': 3, 'learning': 2}

Unfortunately, one-hot encoding fails and becomes difficult to handle when your project experiences a lot of variability with regard to its inputs. This is a common situation in data science projects working with text or other symbolic features where flow from the Internet or other online environments can suddenly create or add to your initial data. Using hash functions is a smarter way to handle unpredictability in your inputs:

Define a range for the hash function outputs. All your feature vectors will use that range. The example uses a range of values from 0 to 24.
Compute an index for each word in your string using the hash function.
Assign a unit value to vector’s positions according to word indexes.

In Python, you can define a simple hashing trick by creating a function and checking the results using the two test strings:

string_1 = 'Python for data science'

string_2 = 'Python for machine learning'

def hashing_trick(input_string, vector_size=20):

feature_vector = [0] * vector_size

for word in input_string.split(' '):

index = abs(hash(word)) % vector_size

feature_vector[index] = 1

return feature_vector

Now you can test both strings.

print(hashing_trick(

input_string='Python for data science',

vector_size=20))

Here is the first string encoded as a vector:

[0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,0,0,0,0,1]

As before, your results may not precisely match those in the book because hashes may not match across machines. The code now prints the second string encoded:

print(hashing_trick(

input_string='Python for machine learning',

vector_size=20))

Here’s the result for the second string:

[0,0,0,0,0,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0]

When viewing the feature vectors, you should notice that:

You don’t know where each word is located. When it’s important to be able to reverse the process of assigning words to indexes, you must store the relationship between words and their hashed value separately (for example, you can use a dictionary where the keys are the hashed values and the values are the words).
For small values of the vector_size function parameter (for example, vector_size=10), many words overlap in the same positions in the list representing the feature vector. To keep the overlap to a minimum, you must create hash function boundaries that are greater than the number of elements you plan to index later.

The feature vectors in this example are made mostly of zero entries, representing a waste of memory when compared to the more memory-efficient one-hot-encoding. One of the ways in which you can solve this problem is to rely on sparse matrices, as described in the next section.

Working with deterministic selection

Sparse matrices are the answer when dealing with data that has few values, that is, when most of the matrix values are zeroes. Sparse matrices store just the coordinates of the cells and their values, instead of storing the information for all the cells in the matrix. When an application requests data from an empty cell, the sparse matrix will return a zero value after looking for the coordinates and not finding them. Here’s an example vector:

[1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0]

The following code turns it into a sparse matrix.

from scipy.sparse import csc_matrix

print csc_matrix([1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 1, 0, 1, 0])

Here is the representation provided by the csc_matrix:

(0, 0) 1

(0, 5) 1

(0, 16) 1

(0, 18) 1

Notice that the data representation is in coordinates (expressed in a tuple of row and column index) and the cell value.

The package SciPy offers a large variety of sparse matrix structures — each one storing the data in a different way and each one performing in a different way. (Some are good with slicing; some others are better for computations.) Usually the csc_matrix (a compressed matrix based on rows) is a good choice because most Scikit-learn algorithms accept it as input and it’s optimal for matrix operations.

As a data scientist, you don’t have to worry about programming your own version of the hashing trick unless you would like some special implementation of the idea. Scikit-learn offers HashingVectorizer, a class that rapidly transforms any collection of text into a sparse data matrix using the hashing trick. Here’s an example script that replicates the previous example:

import sklearn.feature_extraction.text as txt

htrick = txt.HashingVectorizer(n_features=20,

binary=True, norm=None)

hashed_text = htrick.transform(['Python for data science',

'Python for machine learning'])

hashed_text

Python reports the size of the sparse matrix and a count of the stored elements present in it:

<2x20 sparse matrix of type '<class 'numpy.float64'>'

with 8 stored elements in Compressed Sparse Row format>

As soon as new text arrives, CountVectorizer transforms the text based on the previous encoding schema where the new words weren’t present; hence, the result is simply an empty vector of zeros. You can check this by transforming the sparse matrix into a normal, dense one using todense:

oh_enconder.transform(['New text has arrived']).todense()

As expected, the printed matrix is empty:

matrix([[0, 0, 0, 0, 0, 0]], dtype=int64)

Contrast the output from CountVectorizer with HashingVectorizer, which always provides a place for new words in the data matrix:

htrick.transform(['New text has arrived']).todense()

The matrix populated by HashingVectorizer represents the new words:

matrix([[1., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 1.,

0., 0., 0., 0., 0., 0., 1.]])

At worst, a word settles in an already occupied position, causing two different words to be treated as the same one by the algorithm (which won’t noticeably degrade the algorithm’s performances).

HashingVectorizer is the perfect function to use when your data can’t fit into memory and its features aren’t fixed. In the other cases, consider using the more intuitive CountVectorizer.

Considering Timing and Performance

As the book introduces more and more complex themes, such as Scikit-learn machine-learning classes and SciPy sparse matrices, you may start to wonder how all this processing might influence application speed. The increased processing requirements affect both running time and available memory.

Managing the best use of machine resources is indeed an art, the art of optimization, and it requires time to master. However, you can start immediately becoming proficient in it by doing some accurate speed measurement and realizing what your problems really are. Profiling the time that operations require, measuring how much memory adding more data takes, or performing a transformation on your data can help you to spot the bottlenecks in your code and start looking for alternative solutions.

As described in Chapter 5, Jupyter is the perfect environment for experimenting, tweaking, and improving your code. Working on blocks of code, recording the results and outputs, and writing additional notes and comments will help your data science solutions take shape in a controlled and reproducible way.

Benchmarking with timeit

While working through the hashing trick example in the “Performing the Hashing Trick” section, earlier in this chapter, you compare two alternatives for encoding textual information into a data matrix that can address different needs:

CountVectorizer: Optimally encodes text into a data matrix but cannot address subsequent novelties in text.
HashingVectorizer: Provides flexibility in situations when it is likely that the application will receive new data, but is less optimal than techniques based on hashing functions.

Although their advantages are quite clear in terms of how they handle the data, you may wonder what impact using one or the other has on your data processing in terms of speed and memory feasibility.

Concerning speed, Jupyter offers an easy, out-of-the-box solution, the line magic %timeit and the cell magic %%timeit:

%timeit: Calculates the best performance time for an instruction.
%%timeit: Calculates the best time performance for all the instructions in a cell, apart from the one placed on the same cell line as the cell magic (which could therefore be an initialization instruction).

Both magic commands report the best performance in r trials repeated for n loops. When you add the –r and –n parameters, the notebook chooses the number automatically in order to provide a fast answer.

Here is an example of determining the time required to assign a list 10**6 ordinal values by using list comprehension:

%timeit l = [k for k in range(10**6)]

The reported timing is:

109 ms ± 11.8 ms per loop

(mean ± std. dev. of 7 runs, 10 loops each)

The result for the list comprehension can be tested by incrementing both the sample performance and repetitions of the test:

%timeit –n 20 –r 5 l = [k for k in range(10**6)]

After a while, the timing is reported:109 ms ± 5.43 ms per loop

(mean ± std. dev. of 5 runs, 20 loops each)

As a comparison, you can check the time required to assign the values in a for loop. Since the for loop requires an entire cell, the example uses the cell magic, %%timeit, call. Notice that the first line that assigns the value of 10**6 to a variable is not considered in the performance.

%%timeit

l = list()

for k in range(10**6):

l.append(k)

The resulting timing is

198 ms ± 6.62 ms per loop

(mean ± std. dev. of 7 runs, 10 loops each)

The results show that list comprehension is about 50 percent faster than using a for loop. You can then repeat the test using different text encoding strategies:

import sklearn.feature_extraction.text as txt

htrick = txt.HashingVectorizer(n_features=20,

binary=True,

norm=None)

oh_enconder = txt.CountVectorizer()

texts = ['Python for data science',

'Python for machine learning']

After performing initial loading of the classes and instantiating them, you can test the two solutions:

%timeit oh_enconded = oh_enconder.fit_transform(texts)

Here is the timing for the word encoder based on the CountVectorizer:

1.15 ms ± 22.5 µs per loop

(mean ± std. dev. of 7 runs, 1000 loops each)

You now run the test on the HashingVectorizer:

%timeit hashing = htrick.transform(texts)

And obtain the following much better timing (µs [microseconds] are smaller than ms [milliseconds]):

186 µs ± 13 µs per loop

(mean ± std. dev. of 7 runs, 10000 loops each)

The hashing trick is faster than one hot encoder, and it’s possible to explain the difference by noting that the latter is an optimized algorithm that keeps track of how the words are encoded, something that the hashing trick doesn’t do.

Jupyter is the best environment to benchmark the speed of your data science solution code. If you’d like to track performance on the command line or in a script running from an IDE, you can import the timeit class and use the timeit function for tracking performance of the command by providing the input parameter as a string.

If your command needs variables, classes, or functions that aren’t available in the base Python (such as the Scikit-learn classes), you can provide them as a second input parameter. You formulate a string in which Python imports all the necessary objects from the main environment, as shown in the following example:

import timeit

cumulative_time = timeit.timeit(

"hashing = htrick.transform(texts)",

"from __main__ import htrick, texts",

number=10000)

print(cumulative_time / 10000.0)

USING THE PREFERRED INSTALLER PROGRAM (PIP) AND CONDA

Python provides a huge number of packages that you can install. Many of these packages come as separate, downloadable modules. Some of them have an executable suitable for a platform such as Windows, which means you can easily install the package. However, many other packages rely on pip, the preferred installer program, which is a feature that you can access directly from the command line.

To use pip, you open the Anaconda Prompt. If you need to install a package from scratch, such as NumPy, you type pip install numpy, and the software will download the package as well as all the related packages that it needs to work, and will install everything. You can even install a specific version by typing, for example, pip install –U numpy==1.14.5, or simply update the package to its most recent version if is already installed: pip install –U numpy.

If you installed Anaconda, , you can use conda instead of pip, which is even more efficient when installing because it sets all the other packages to the right version for your newly installed Python package (which implies that it can install, upgrade or even downgrade existing packages on your system). Using conda for installing a new package is achieved from the Anaconda Prompt, as well, by entering conda install numpy. The software analyzes your system, reports the changes, and then asks whether it should proceed. Press y if you want to proceed with the installation. You also use conda to update existing packages (enter conda update numpy) or the entire system (enter conda update --all).

This book uses Jupyter as its environment. Installing and upgrading while using Jupyter is a bit complicated. Jake VanderPlas from the University of Washington wrote a very informative post about this issue, which you can find at https://jakevdp.github.io/blog/2017/12/05/installing-python-packages-from-jupyter/. The article proposes a few ways to handle package installation and upgrading while using the Jupyter interface. At the beginning, until you gain confidence and experience, the best option is to install and update your system first and then run Jupyter, making the installation much easier and smoother.

Working with the memory profiler

As you’ve seen when testing your application code for performance (speed) characteristics, you can obtain analogous information about memory usage. Keeping track of memory consumption could tell you about possible problems in the way data is processed or transmitted to the learning algorithms. The memory_profiler package implements the required functionality. This package is not provided as a default Python package and it requires installation. Use the following command to install the package directly from a cell of your Jupyter notebook, as explained by Jake VanderPlas’s post described in the “Using the preferred installer program (pip) and conda” sidebar:

import sys

!{sys.executable} -m pip install memory_profiler

Use the following command for each Jupyter Notebook session you want to monitor:

%load_ext memory_profiler

After performing these tasks, you can easily track how much memory a command consumes:

hashing = htrick.transform(texts)

%memit dense_hashing = hashing.toarray()

The reported peak memory and increment tell you about memory usage:

peak memory: 90.42 MiB, increment: 0.09 MiB

Obtaining a complete overview of memory consumption is possible by saving a notebook cell to disk and then profiling it using the line magic %mprun on an externally imported function. (The line magic works only by operating with external Python scripts.) Profiling produces a detailed report, command by command, as shown in the following example:

%%writefile example_code.py

def comparison_test(text):

import sklearn.feature_extraction.text as txt

htrick = txt.HashingVectorizer(n_features=20,

binary=True,

norm=None)

oh_enconder = txt.CountVectorizer()

oh_enconded = oh_enconder.fit_transform(text)

hashing = htrick.transform(text)

return oh_enconded, hashing

from example_code import comparison_test

text = ['Python for data science',

'Python for machine learning']

%mprun -f comparison_test comparison_test(text)

You will get an output similar to this one (the output appears in a separate window at the bottom of the Notebook display by default):

Line # Mem usage Increment Line Contents

========================================

1 94.8 MiB 94.8 MiB def comparison_test(text):

2 94.8 MiB 0.0 MiB import…

3 94.8 MiB 0.0 MiB htrick = …

4 94.8 MiB 0.0 MiB …

5 94.8 MiB 0.2 MiB …

6 94.8 MiB 0.0 MiB oh_encoder = …

7 94.8 MiB 0.0 MiB oh_encoded = …

8 94.8 MiB 0.0 MiB hashing = …

9 94.8 MiB 0.0 MiB return …

The resulting report details the memory usage from every line in the function, pointing out the major increments.

REDUCING MEMORY USAGE AND COMPUTING FAST

You use NumPy arrays or pandas DataFrames when working with data. However, even if they appear as different data structures: one focuses on storing data as a matrix and the other on handling complex datasets stored in different ways — DataFrames rely on NumPy arrays. Understanding how arrays work and are used by pandas allows you to reduce memory usage and achieve faster computations.

NumPy arrays are a tool for handling data by using contiguous memory blocks to store the values. Because the data appears in the same area of computer memory, Python can retrieve the data faster and slice it more easily. It’s the same principle as disk fragmentation: If your data is scattered on disk, it occupies more space and requires more handling time.

Depending on your needs, you can order array data by rows (the default choice of both NumPy and the C/C++ programming language) or columns. Computer memory stores cells one after the other in a line. Consequently, you can record your array row after row, allowing fast processing by rows, or column by column, allowing faster processing by columns. All these details, though hidden from your eyes, are crucial because they render working with NumPy arrays fast and efficient for data science (which uses numeric matrices and often computes information by rows). This is why all Scikit-learn algorithms expect a NumPy array as an input and why NumPy arrays have a fixed data type (they can be only of the same type as the data sequence; they can’t vary).

pandas DataFrames are just well-arranged collections of NumPy arrays. Your variables in DataFrame, depending on the type, are compacted in an array. For instance, all your integer variables are together in an IntBlock, all your float data in a FloatBlock, and the rest in an ObjectBlock. This means that when you want to operate on a single variable, you are actually operating on all the variables. Consequently, if you have an operation to apply, it’s better to apply it to all variables of the same type simultaneously. In addition, this also means that working with string variables is incredibly expensive in terms of memory and computations. Even if you store something as simple as a short series of color names in a variable, it will require the use of a complete string (at least 50 bytes) and handling it will be quite cumbersome using the NumPy engine. As suggested in Chapter 7, you can transform your string data in categorical variables; by doing so, behind the scenes, strings are transformed into numbers. In this way, you greatly reduce the memory usage and increase the speed you experience when manipulating the data.

Running in Parallel on Multiple Cores

Most computers today are multicore (two or more processors in a single package), some with multiple physical CPUs. One of the most important limitations of Python is that it uses a single core by default. (It was created in a time when single cores were the norm.)

Data science projects require quite a lot of computations. In particular, a part of the scientific aspect of data science relies on repeated tests and experiments on different data matrices. Don’t forget that working with huge data quantities means that most time-consuming transformations repeat observation after observation (for example, identical and not related operations on different parts of a matrix).

Using more CPU cores accelerates a computation by a factor that almost matches the number of cores. For example, having four cores would mean working at best four times faster. You don’t receive a full fourfold increase because there is overhead when starting a parallel process — new running Python instances have to be set up with the right in-memory information and launched; consequently, the improvement will be less than potentially achievable but still significant. Knowing how to use more than one CPU is therefore an advanced but incredibly useful skill for increasing the number of analyses completed, and for speeding up your operations both when setting up and when using your data products.

Multiprocessing works by replicating the same code and memory content in various new Python instances (the workers), calculating the result for each of them, and returning the pooled results to the main original console. If your original instance already occupies much of the available RAM memory, it won’t be possible to create new instances, and your machine may run out of memory.

Performing multicore parallelism

To perform multicore parallelism with Python, you integrate the Scikit-learn package with the joblib package for time-consuming operations, such as replicating models for validating results or for looking for the best hyperparameters. In particular, Scikit-learn allows multiprocessing when

Cross-validating: Testing the results of a machine-learning hypothesis using different training and testing data
Grid-searching: Systematically changing the hyperparameters of a machine-learning hypothesis and testing the consequent results
Multilabel prediction: Running an algorithm multiple times against multiple targets when there are many different target outcomes to predict at the same time
Ensemble machine-learning methods: Modeling a large host of classifiers, each one independent from the other, such as when using RandomForest-based modeling

You don’t have to do anything special to take advantage of parallel computations — you can activate parallelism by setting the n_jobs parameter to a number of cores more than 1 or by setting the value to –1, which means you want to use all the available CPU instances.

If you aren’t running your code from the console or from a Jupyter Notebook, it is extremely important that you separate your code from any package import or global variable assignment in your script by using the if __name__==’__main__’: command at the beginning of any code that executes multicore parallelism. The if statement checks whether the program is directly run or is called by an already-running Python console, avoiding any confusion or error by the multiparallel process (such as recursively calling the parallelism).

Demonstrating multiprocessing

It’s a good idea to use a notebook when you run a demonstration of how multiprocessing can really save you time during data science projects. Using Jupyter provides the advantage of using the %timeit magic command for timing execution. You start by loading a multiclass dataset, a complex machine learning algorithm (the Support Vector Classifier, or SVC), and a cross-validation procedure for estimating reliable resulting scores from all the procedures. You find details about all these tools later in the book. The most important thing to know is that the procedures become quite large because the SVC produces 10 models, which it repeats 10 times each using cross-validation, for a total of 100 models.

from sklearn.datasets import load_digits

digits = load_digits()

X, y = digits.data,digits.target

from sklearn.svm import SVC

from sklearn.model_selection import cross_val_score

%timeit single_core = cross_val_score(SVC(), X, y,

cv=20, n_jobs=1)

As a result, you get the recorded average running time for a single core:

18.2 s ± 265 ms per loop

(mean ± std. dev. of 7 runs, 1 loop each)

After this test, you need to activate the multicore parallelism and time the results using the following commands:

%timeit multi_core = cross_val_score(SVC(), X, y,

cv=20, n_jobs=-1)

Running on multiple cores allows for a better average time:

10.8 s ± 137 ms per loop

(mean ± std. dev. of 7 runs, 1 loop each)

The example machine demonstrates a positive advantage using multicore processing, despite using a small dataset where Python spends most of the time starting consoles and running a part of the code in each one. This overhead, a few seconds, is still significant given that the total execution extends for a handful of seconds. Just imagine what would happen if you worked with larger sets of data — your execution time could be easily cut by two or three times.

Although the code works fine with Jupyter, putting it down in a script and asking Python to run it in a console or using an IDE may cause errors because of the internal operations of a multicore task. The solution, as mentioned before, is to put all the code under an if statement, which checks whether the program started directly and wasn’t called afterward. Here’s an example script:

from sklearn.datasets import load_digits

from sklearn.svm import SVC

from sklearn.cross_validation import cross_val_score

if __name__ == '__main__':

digits = load_digits()

X, y = digits.data,digits.target

multi_core = cross_val_score(SVC(), X, y,

cv=20, n_jobs=-1)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 12: Stretching Python’s Capabilities

Create new playlist

Sign In

Sign Up

Stretching Python’s Capabilities

Playing with Scikit-learn

Understanding classes in Scikit-learn

Defining applications for data science

Performing the Hashing Trick

Using hash functions

Demonstrating the hashing trick

Working with deterministic selection

Considering Timing and Performance

Benchmarking with timeit

Working with the memory profiler

Running in Parallel on Multiple Cores

Performing multicore parallelism

Demonstrating multiprocessing

Table of Contents for
Chapter 12: Stretching Python’s Capabilities