Introducing Python for data science

The fundamental task of data analysis is to generalize some trends and shared properties over a dataset of multiple—probably many—data points. Imagine how that would look in a standard Python distribution: you'll have a list of, say, Person objects, each with its own values. To run some aggregate statistics, we would have to loop over each object, pull its properties, and calculate the statistics. If we need to get a few measurements, the code will quickly grow large and unmaintainable.

Instead, many computations in data analysis can be vectorized. Here, vectorization is a fancy term for saying the same exact loops will be run in C, rather than Python, which speeds things up by a few orders of magnitude. At the same time, it means that we won't need to explicitly write those loops, making code cleaner and more readable. For example, we could represent our animals from Chapter 8, Simulation with Classes and Inheritance, as a numeric 2-dimensional matrix, defining their properties as columns and each particular animal as a row in this matrix. This could significantly simplify the code—for example, to age animals and drop the old ones, we could write just one line:

animals['age'] += 1

See, no loops! And yet, age will be increased for every animal (row) in the matrix.

Consider the following example:

import pandas as pd
animals = pd.DataFrame({'survival_skill':[1,2,3], 'age':[0,0,0]})
max_age = 2

animals['age'] += 1                         # this adds one to age of all animals
animals = animals[animals.age <= max_age]   # this drops all animals with age above maximum

Here, the animals variable is a table of two columns (age and survival_skill) and three rows, each representing a separate animal. Using that table, we now can run vectorized operations: first, we add 1 to the age of every animal in the table. The next line consists of two vectorized operations: code within the square brackets creates a Boolean vector of three values, one per animal, each answering whether the corresponding animal's age is below or equal to the maximum threshold (all of them do). This mask is then used to filter rows in the table, essentially dropping animals that are too old.

This approach is not only simpler but it's also a faster one. You see, Python has dynamic typing—we don't have to strictly define the type of each variable beforehand, as in other languages (C and Java). It makes it faster to write code but has its consequences, as the computer needs to check the type of each value, on every operation—even if we are looping through an age property of each of the thousands of animals. As properties of objects are independent of each other (for example, the property of one object could have a different type than one of the others), this means that each variable is treated as an independent value—and is sent and cached on the CPU separately.

Instead, we could define the properties of all objects of the same type (or even all of their properties) as having a specific type, and represent the whole collection of objects (for example, all animals on the island) as one object (matrix), which the computer will then treat as one entity, with no need to check the type each time, and this whole object could be sent and cached on the CPU simultaneously. For that to happen, we have to introduce arrays.

As you can now see, operating large sets of data and performing operations on them quickly is a crucial task for modern data science. Using vectorized code allows us to not only do that but also to write the code for those computations in a clean and expressive fashion. At the core of those fast operations are efficient data structures—first of all, multidimensional data arrays. And when we talk about numeric arrays in pandas, those are likely to be NumPy arrays.

Table of Contents for Introducing Python for data science

Create new playlist

Sign In

Sign Up

Table of Contents for
Introducing Python for data science