© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
T. SarkarProductive and Efficient Data Science with Pythonhttps://doi.org/10.1007/978-1-4842-8121-5_3

3. How to Use Python Data Science Packages More Productively

Tirthajyoti Sarkar1  
(1)
Fremont, CA, USA
 

Python is, without any doubt, the most used and fastest growing programming language of choice for data scientists (and other related professionals such as machine learning engineers or artificial intelligence researchers) all over the world. There are many reasons for this explosive growth of Python as the lingua franca of data science (mostly in the last decade or so). It has an easy learning curve, it supports dynamic typing, it can be written both script-type and in object-oriented fashion, and more.

However, probably the most important reason for its growth is the amazing open-source community activity and the resulting ecosystem of powerful and rich libraries and frameworks focused on data science work. The default, barebone installation of Python cannot be used to do any meaningful data science task. However, with minimal extra work, any data scientist can install and use a handful of feature-rich, well-tested, production-grade libraries that can jumpstart their work immediately.

Some of the most popular and widely used among these jump-starter packages are the following:
  • NumPy for numerical computing (used as the foundation of almost all data science Python libraries)

  • pandas for data analytics with tabular, structured data

  • Matplotlib/Seaborn for powerful graphics and statistical visualization

However, just because these libraries provide easy APIs and smooth learning curves does not mean that everybody uses them in a highly productive and efficient manner. One must explore these libraries and understand both their powers and limitations to exploit them fully for productive data science work.

This is the goal of this chapter: to show how and why these libraries should be used in various typical data science tasks for achieving high efficiency. You’ll start with the NumPy library as it is also the foundation of pandas and SciPy. Then you’ll explore the pandas library, followed by a tour of the Matplotlib and Seaborn packages.

It is to be noted, however, that my goal is not to introduce you to typical features and functions of these libraries. There are plenty of excellent courses and books for that purpose. You are expected to have basic knowledge of and experience with using some, if not all, of these libraries. I will show you canonical examples of how to use these packages to do your data science work in a productive manner.

You may also wonder where another widely used Python ML package named scikit-learn fits in this scheme. I cover that in Chapter 4. Additionally, in Chapter 7, I cover how to use some lesser-known Python packages to aid NumPy and pandas to use them more efficiently and productively.

Why NumPy Is Faster Than Regular Python Code and By How Much

NumPy (or Numpy), short for Numerical Python, is the fundamental package used for high-performance scientific computing and data analysis in the Python ecosystem. It is the foundation on which nearly all of the higher-level data science tools and frameworks such as pandas and Scikit-learn are built.

Deep learning libraries such as TensorFlow and PyTroch use, as their fundamental building block, NumPy arrays, on top of which they build their specialized Tensor objects and graph flow routines for deep learning tasks. Most of the machine learning algorithms make heavy use of linear algebra operations on a long list/vector/matrix of numbers for which NumPy code and methods have been optimized.

NumPy Arrays are Different

The fundamental data structure introduced by NumPy is the ndarray or N-dimensional numerical arrays. For beginners in Python, sometimes these arrays look similar to a Python list. But they are anything but similar. Let’s demonstrate this using a simple example.

Consider the following code which creates two Python lists. When you use the + operator on them, the second list gets appended to the first one.
lst1 = [i for i in range(1,11)]
lst2 = [i*10 for i in range(1,11)]
print(lst1+lst2)
>> [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

The treatment of the elements in the lists feel object-like, not very numerical, doesn’t it? If these were numerical vectors instead of a simple list of numbers, you would expect the + operator to act slightly different and add the numbers from the first list to the corresponding numbers in the second list element-wise.

That’s precisely what the NumPy array version of these lists does:
import numpy as np
arr1 = np.array(lst1)
arr2 = np.array(lst2)
arr1+arr2
>> array([ 11,  22,  33,  44,  55,  66,  77,  88,  99, 110])

What is np.array? It is nothing but the array method called from the NumPy module (the first line of the code did that with import numpy as np).

Perhaps the easiest way to see the richness of this array representation is to check the list of all methods associated with the data structure. You can do that using the dir function like this:
for p in dir(lst1):
    if '__' not in p:
        print(p, end=', ')
>> append, clear, copy, count, extend, index, insert, pop, remove, reverse, sort,
If you run similar code for the arr1 object, you will see the following output:
>> T, all, any, argmax, argmin, argpartition, argsort, astype, base, byteswap, choose, clip, compress, conj, conjugate, copy, ctypes, cumprod, cumsum, data, diagonal, dot, dtype, dump, dumps, fill, flags, flat, flatten, getfield, imag, item, itemset, itemsize, max, mean, min, nbytes, ndim, newbyteorder, nonzero, partition, prod, ptp, put, ravel, real, repeat, reshape, resize, round, searchsorted, setfield, setflags, shape, size, sort, squeeze, std, strides, sum, swapaxes, take, tobytes, tofile, tolist, tostring, trace, transpose, var, view,

There are so many more (and different looking) functions and attributes available with the NumPy array object. In particular, take note of methods such as mean, std, and sum, as they clearly indicate a focus on numerical/statistical computing with this kind of array objects. And these operations are fast too. How fast? You will see that now.

NumPy Array vs. Native Python Computation

NumPy is much faster due to its vectorized implementation and the fact that many of its core routines were originally written in the C language (based on the CPython framework). NumPy arrays are densely packed arrays of homogeneous types. Python lists, by contrast, are arrays of pointers to objects, even when all of them are of the same type. So, we get the benefits of the locality of reference.

Many NumPy operations are implemented in the C language, avoiding the general cost of loops in Python, pointer indirection, and elementwise dynamic type checking. In particular, the boost in speed depends on what operation you are performing. For data science and ML tasks, this is an invaluable advantage because it avoids looping in long and multi-dimensional arrays.

Locality of reference (www.geeksforgeeks.org/locality-of-reference-and-cache-operation-in-cache-memory/) is one of the main reasons behind NumPy arrays being much faster and more efficient than Python list objects. Spatial locality in memory access patterns results in performance gains notably due to the CPU cache operations. The cache loads bytes in chunks from RAM to the CPU registers (the fastest memory in a computer system, located next to the processor). Adjacent items in memory are then loaded very efficiently.

NumPy and Native Python Implementation

Let’s illustrate this using the familiar @timing decorator from the last chapter. Here is a code wrapping the decorator around two functions, std_dev and std_dev_python, implementing the calculation of standard deviation of a list/array with NumPy and native Python code, respectively.
@timing
def std_dev(a):
    if isinstance(a,list):
        a = np.array(a)
    s = a.std()
    return s
from math import sqrt
@timing
def std_dev_python(lst):
    s = sum(lst)
    av = s/len(lst)
    sumsq = 0
    for i in lst:
        sumsq+=(i-av)**2
    sumsq_av = sumsq/len(lst)
    result = sqrt(sumsq_av)
    return result
Next, you define two objects, a NumPy array and a Python list, of the same length (1,000,000) and calculate the time it takes for the standard deviation computation:
a = np.arange(1000000)
lst = [i for i in range(1000000)]
For the NumPy function,
std_dev(a)
>> Function 'std_dev' took 8.996 milliseconds to run
>> 288675.1345946685
For the Python function,
std_dev_python(lst)
>> Function 'std_dev_python' took 212.995 milliseconds to run
>> 288675.1345958226

So, the NumPy implementation is much faster and should be used for data science tasks by default.

Conversion Adds Overhead

If you look at the code for the NumPy function, you will notice a small but significant code for type checking and coercion at the beginning. This to handle the situation of a NumPy function receiving a list object instead of the NumPy array it was expecting.
if isinstance(a,list):
        a = np.array(a)
If you pass the lst object to std_dev function, you may see something like this:
std_dev(lst)
>> Function 'std_dev' took 84.004 milliseconds to run
>> 288675.1345946685

This is interesting. The operation is still quite a bit faster than the native Python implementation, but definitely much slower than the case where a NumPy array was passed into the function. The result is also slightly different (only after five decimal places though). This is because of the conversion of the lst object to the NumPy array type inside the function that takes the extra time. The conversion also impacts the numerical precision leading to the slightly different result.

Therefore, although type-checking and conversion should be part of your code, you should focus on converting numerical lists or tables to NumPy arrays as soon as possible at the beginning of a data science pipeline and work on them afterwards, so that you do not lose any extra time at the computation stage.

Using NumPy Efficiently

NumPy offers a dizzying array of functions and methods to use on numerical arrays and matrices for advanced data science and ML engineering. You can find a plethora of resources going deep into those aspects and features of NumPy.

Since this book is about productive data science, I am focusing more on the fundamental aspect of how to use NumPy for building efficient programming pattern in data science work. I prefer to illustrate that by showing typical examples of inefficient coding style and how to use the NumPy-based code correctly to increase your productivity. Let’s start on that path.

Conversion First, Operation Later

Although not a guaranteed outcome, it is almost always better to vectorize your data first (Figure 3-1). In other words, convert it to NumPy arrays as early as possible and run the mathematical operations on those array objects rather than running native Python functions and then converting them to an array.
Figure 3-1

NumPy is best taken advantage of when you vectorize your data first and then do the necessary operations

Here’s a list of numbers and a mathematical operation function:
lst_of_nums = [i for i in range(100000)]
def calc_nums(x):
    return (x+1)/(x+1000)
It is a bad practice to do the following, yet this kind of code pattern is ubiquitous in the data science world:
result_lst = []
for i in lst_of_nums:
    result_lst.append(calc_nums(i))
result_array = np.array(result_lst)
Instead, first convert to the array format and then apply the mathematical operations directly on the array. You don’t even need to write a separate Python function.
array_of_nums = np.array(lst_of_nums)
result_array = (array_of_nums+1)/(array_of_nums+1000)

If you test the execution time, you will see the second option is 2X to 3X faster for this data. For a bigger data size, this much improvement may prove significant.

Data in real-life situations comes from business operations and databases. Data comes either in streaming or batch mode. Data can also come in web APIs in the format of JSON or XML. It almost will never come in a nicely NumPy-formatted manner. This is why it is so important to understand the pros and cons of array conversion, operations like appending to and updating an array, back conversion to a Python list in case you must stream the data back to another API through a JSON interface, and so on.

Vectorize Logical Operations

You can also vectorize a list where you need to check for logical condition before doing the mathematical operation directly with NumPy. Suppose from the previous example you want to apply the function only to the numbers that are integral multiples of 7. You may be tempted to write this code:
result_lst = []
for i in lst_of_nums:
    if i%7==0:
        result_lst.append(calc_nums(i))
result_array = np.array(result_lst)
Instead, you should use the NumPy operations directly in this manner:
array_of_nums = np.array(lst_of_nums)
array_div7 =  array_of_nums[array_of_nums%7==0]
result_array = (array_div7+1)/(array_div7+1000)

The second line of this code uses the Boolean indexing with NumPy where you create a Boolean NumPy array with array_of_nums%7==0 and then use this array as an index of the main array. This effectively creates an array with only the elements that are divisible by 7. Finally, you run your operation on this shorter array_div7. In a way, this is a filtering operation too where you filter the main array into a shorter array based on a logical check.

Use the Built-In Vectorize Function

NumPy provides a built-in vectorizing function to help many user-defined functions to be vectorized as easily as possible. The exact improvement in speed and efficiency depends on the type and complexity of the specific function in question. Here is an example of a function that works on two floating point numbers and performs certain math operation based on their mutual relationship:
from math import sin
def myfunc(x,y):
    if (x>0.5*y and y<0.3):
        return (sin(x-y))
    elif (x<0.5*y):
        return 0
    elif (x>0.2*y):
        return (2*sin(x+2*y))
    else:
        return (sin(y+x))
In such situations, you can almost mechanically apply the numpy.vectorize method in the following way:
vectfunc = np.vectorize(myfunc,
                        otypes=[np.float64],
                        cache=False)
result_array=vectfunc(lst_x,lst_y)

Here you pass on the custom function object myfunc as the first argument in the np.vectorize and define the object types it should expect by the otypes parameter. The great thing is that although the main myfunc works on individual floating point numbers x and y, the resulting vectfunc can accept any array (or even a Python list) with the np.float64 data type (or even native Python floating point data, which will be coerced into the np.float64 type automatically).

Avoid Using the .append Method

Appending new or incoming data to an array is a common data science operation. Often the situation is that the data is generated by a stochastic or random process (e.g., a financial transaction or a sensor measurement) and it has to be recorded in a NumPy array (for later use in an ML algorithm, for example).

NumPy has an append method but it is quite inefficient because of its behavior of copying the entire data array into memory every time the update happens. You have two choices for appending this kind of random data to an NumPy array:
  • If you know the final length of the array, then initialize an empty NumPy array (with the numpy.empty method) or an array of zeroes/ones and just put the new piece of data in the present index while iterating over the range.

  • Alternatively, you can use a Python list, append to it, and then convert to a NumPy array at the end. You can use this with a while loop until the random process terminates, so you don’t need to know the length beforehand.

You can see this is directly contrary to what we discussed in the subsection “Conversion First, Operation Later.” However, the situation is subtly different here because, in this case, you are updating the array with incoming data that results from an unknown process, so you don’t know what precise mathematical operation to perform on the array.

As an example, the following code initializes an empty NumPy array with a known shape (equal to the known data length of 1,000), records a Gaussian random number 1,000 times, and puts the square of that number in the array:
desired_length = 1000
results = np.empty(desired_length)
for i in range(desired_length):
    sample = np.random.normal()
    results[i] = sample**2
The following code emulates a situation when the length of the data is itself uncertain. The process terminates when the variable TERMINATE itself goes over 2.0.
TERMINATE = np.random.normal()
result_lst = []
while TERMINATE < 2.0:
    sample = np.random.normal()
    result_lst.append(sample**2)
    TERMINATE = np.random.normal()
result_array = np.array(result_lst)

As discussed, because of the uncertainty in the length of the data or the process that generates it, it is advisable to use a Python list to append the data as it comes in. When the data collection is finished, go back to the “conversion first, operation later” principle and convert the Python list to a NumPy array before doing any sophisticated mathematical operation over it.

When does

TERMINATE become greater than 2.0? In the code above, since the variable TERMINATE is generated from a normal distribution with a zero mean and a unity standard deviation, any value greater than 2.0 will be located more than two standard deviations from the mean. That means it will have ~5% chance of producing a value greater than 2.0 at each iteration. If you run this code repeatedly, you will have a new NumPy array of a different length each time you rerun the code.

Utilizing NumPy Reading Utilities

How would you read a text file where numerical data is stored in a CSV format into a NumPy array? This situation is extremely common in a regular data science pipeline as CSV (comma-separated value) remains one of the most popular file formats in use across all platforms (Windows, Linux, Mac OS, etc.).

Of course, you can use the csv module that comes with Python and read line by line. But, conveniently enough, NumPy provides many utility functions to read from file or string objects. Using them makes the code cleaner and thereby more productive. These routines are well-optimized for speed too, so your code remains efficient.

Reading from a Flat Text File

The method numpy.fromfile can be used for this purpose. It is a highly efficient way of reading binary data with a known datatype, as well as parsing simply formatted text files. For example, you may be reading a bunch of numeric data written on a text file with a comma separator:
with open('fdata.txt') as f:
    data = f.readline()
data = data.split(',')
fr = np.array(data[:-1],dtype=float)
Note that when you use the native Python readline with an opened file, you get a string object. So, you need to split the string with the comma separator and then read the resulting list as a NumPy array with the dtype set to float. You can do the same reading with just one line of code:
fr = np.fromfile('fdata.txt',sep=',')

It is clear that there is less chance of bugs and errors in this approach than the native Python file-reading code.

Utility for Tabular Data in a Text File

Numpy offers another similar text-reading utility called loadtxt, which is even more powerful and feature-rich. It works with text file where data is written in tabular format (i.e., in rows and columns) and loads data directly into a multi-dimensional array as long as the number of entries in each row remains same. Figure 3-2 illustrates this.
Figure 3-2

Showing how the loadtxt utility works in NumPy

For example, suppose you have a CSV text file with three rows and three columns of data, as shown in Figure 3-3.
Figure 3-3

A simple text file with tabulated comma-separated data to be read

One line of code can read the contents of this file into a 3x3 NumPy array/matrix:
np.loadtxt('npread.txt',delimiter=',')
>> array([[  9.2,  22.1, -33.6],
       [  6.4,   2.3,  -5.4],
       [ 12.2,   4.5,   7.2]])
You can even read selective columns from the file. This is particularly useful if you always get a massive data file from a customer, but you know that only certain specific columns are useful for your data science work. Then, you can load only selective data into memory and make your pipeline fast and efficient.
np.loadtxt('npread.txt',delimiter=',',usecols=(0,2))
>> array([[  9.2, -33.6],
       [  6.4,  -5.4],
       [ 12.2,   7.2]])

Imagine the amount of custom text-reading code you would have to write if you did not have this utility function from NumPy. In the spirit of productive data science and keeping your code clean and readable, use these utilities whenever possible.

Using pandas Productively

After covering some of the best practices and productive utilities of the NumPy library, let’s now look at the most widely used data analytics package in the Python ecosystem: pandas. This package is used by almost every data scientist and analyst that you may come across.

pandas uses NumPy at its foundation and interfaces with other highly popular Python libraries like Scikit-learn so that you can do data analytics and wrangling work in pandas and transport the processed data seamlessly to an ML algorithm. It also provides a rich set of data-reading options from various kinds of common data sources (e.g., a web page, HTML, CSV, Microsoft Excel, JSON formatted object, and even zip files) which makes it invaluable for data wrangling tasks.

However, it is a large library with many methods and utilities that can be used in myriad ways to accomplish the same end goal. This makes it highly likely that different data scientists (even within the same team) are using different programming styles and patterns with pandas to get the same job done. Some of these patterns yield faster and cleaner execution than others and should be preferred. In this subsection, I cover a few of these areas with simple examples.

Setting Values in a New DataFrame

pandas provides a variety of options to index, select particular data, and set it to a given value. In many situations, you will find yourself with a Python list or NumPy array that you want to set at a particular position (row) in your DataFrame.

For demonstration, let’s define a simple list with six values:
  • First name (a Python string object)

  • Last name (a Python string object)

  • Age (a Python integer object)

  • Address (a Python string object)

  • Price (a Python float object)

  • Date (a Python datetime object)

profile_data = ['First name', 'Last name', 30, 'An address', 25.2, today]

You have a few options to insert this data to the rows of a DataFrame. Note that in reality you will have a dictionary or a few thousand such lists (all different). Just for the speed demonstration, I show inserting the same list data in the DataFrame.

You can create an empty DataFrame like this, defining the column names explicitly:
df = pd.DataFrame(columns = ['FirstName',  'LastName', 'Age', 'Address', 'Price', 'Date'])

Now comes the part where you iterate and insert the data into one row after another.

The .at or .iloc Methods Are Slow

A lot of data scientists use the .at or .iloc methods for indexing and slicing data once they start working with a DataFrame. They are very useful methods to have at your disposal, and they are fine to use for indexing purpose. However, try to avoid them for inserting/setting data when you are building a DataFrame from scratch.

Set N = 2000 for the speed test and run the following code to measure the speed of setting data with these methods:
%%timeit -n5 -r10
for i in range(N):
    df.at[i] = profile_data
>> 207 ms ± 58.6 ms per loop (mean ± std. dev. of 10 runs, 5 loops each)
and
%%timeit -n5 -r10
for i in range(N):
    df.iloc[i] = profile_data
>> 116 ms ± 5.63 ms per loop (mean ± std. dev. of 10 runs, 5 loops each)

In this instance, the .iloc method is slightly faster, but this depends on the type of the data and other aspects. In general, inserting data this way should be avoided as much as possible.

Use .values to Speed Things Up Significantly

The method pandas.DataFrame.values returns a NumPy representation of the DataFrame and therefore is optimized for speed in the best possible manner. So, if you run the following code, you get much faster execution time:
%%timeit -n5 -r10
for i in range(N):
    df.values[i] = profile_data
>> 12 ms ± 2.63 ms per loop (mean ± std. dev. of 10 runs, 5 loops each)

Note that for this to work, you must have a pre-existing DataFrame with 2,000 rows. Now, with this code you can set new values much faster than using .at or .iloc methods. This won’t work on a newly created, empty DataFrame.

Specify Data Types Whenever Possible

Making pandas guess data types is one of the most frequent inefficient code patterns and it happens with almost all data scientists. It is inefficient because when you import data into a DataFrame without specifically telling pandas the datatypes of the columns, it will read the entire dataset into memory just to figure out the data types. Quite naturally, it hogs the system memory and results in a highly wasteful process that can be avoided with more explicit code.

So, how do you do it as a standard practice? Reading data from the disk is often done with some sort of plain text file like a CSV. You can read just the first few lines of the CSV file, determine the data types, create a dictionary, and pass it on for the full file read, or use it repeatedly for reading similar files (if the column types are unchanged). You can use the dtype parameter in various pandas reading functions to specify the expected data types.

Here is boilerplate code for accomplishing this task. The function csv_read() accepts a filename (string) argument and returns a DataFrame. Internally, it does so by first reading a sample data of the first 20 rows (nrows=20), determining the data types (df_sample.dtypes), creating a dictionary of those types, and then reading the full dataset with explicit type mention by passing that dictionary (dtype = dt):
def csv_read(filename):
    """
    Reads a CSV file with explicit data types
    """
    # Reads only the first 20 rows
    df_sample = pd.read_csv(filename, nrows=20)
    # Constructs data type dictionary
    dt = {}
    for col,dtyp in zip(df_sample.columns, df_sample.dtypes):
        dt[col] = dtyp
    # Full read with explicit data type
    df1 = pd.read_csv(filename, dtype = dt)
    return df1
Figure 3-4 shows a visual illustration of the idea of reading sample data first, determining the data type, and then utilizing it for the full reading of the data.
Figure 3-4

Reading large data files in pandas first by determining the data types and then specifying them explicitly while reading

As a practical example,

imagine that every morning your data processing pipeline must read a large CSV file from all the business transactions that were put into a data warehouse last night. The column names and types are unchanged, and only the raw data changes every day. You do a lot of data cleaning and processing on the new raw data every day to pass it on to some cool machine learning algorithm. In this situation, you should have your data type dictionary ready and pass it on to your file reading function every morning. You should still run an occasional check to determine if the data types have changed somehow (e.g., int to float, string to Boolean).

Iterating Over a DataFrame

It is a quite common situation where you are given a large pandas DataFrame and are asked to check some relationships between various fields in the columns, in a row-by-row fashion. The check could be some logical operation or some conditional logic involving a sophisticated mathematical transformation on the raw data.

Essentially, it is a simple case of iterating over the rows of the DataFrame and doing some processing at each iteration. You can choose from the following approaches. Interestingly, some of the approaches are much more efficient than others.

Brute-Force For Loop

The code for this naïve approach will go something like this:
for i in range(len(df)):
    if (some condition is satisfied):
        do some calculation with df.iloc[i]

Essentially, you are iterating over each row (df.iloc[i]) using a generic for loop and processing them one at a time. There’s nothing wrong with the logic and you will get the correct result at the end.

But this is quite inefficient. As you increase the number of columns or the complexity of the calculation (or of the condition checking done at each iteration), you will see that they quickly add up. Therefore, this approach should be avoided as much as possible for building scalable and efficient data science pipelines.

Better Approaches: df.iterrows and df.values

Depending on the situations at hand, you may have choices of two better approaches for this iteration task.

pandas offers a dedicated method for iterating over rows called iterrows() (https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iterrows.html), which might be handy to use in this particular situation. Depending on the DataFrame size and complexity of the row operations, this may reduce the total execution time by ~10X over the for loop approach.

You already saw the pandas method for obtaining a NumPy representation of the DataFrame: df.values(). This can significantly speed things up (even better than iterrows). However, this method removes the axis labels (column names) and so you must use the generic NumPy array indexing like 0, 1 to process the data. The pseudocode will look like the following:
for row in df.values:
    if function(row) satisfies some condition:
        do some calculation with row

A clear, worked-out example on this topic of comparing the efficiencies of multiple pandas methods can be found in the article cited below. It also shows how the speed improvement depends on the complexity of the specific operation at each iteration. “Faster Iteration in pandas,” (https://medium.com/productive-data-science/faster-iteration-in-pandas-15cac58d8226), Towards Data Science, July 2021.

Using Modern, Optimized File Formats

CSV is a flat-file format used widely in data analytics. It is simple to work with and performs decently in small to medium data regimes. However, as you do data processing with bigger files (and also, perhaps, pay for the cloud-based storage of them), there are some excellent reasons to move towards file formats using the columnar data storage principle (www.stitchdata.com/columnardatabase/). The basic idea of columnar data storage (vs. the traditional row-based storage) is illustrated in Figure 3-5.
Figure 3-5

Columnar (vs. traditional row-based) data format illustration

Apache Parquet is one of the most popular of these columnar file formats. It’s an excellent choice in the situation when you have to store and read large data files from disk or cloud storage. Parquet is intimately related to the Apache Arrow framework. But what is Apache Arrow?

As per their website, https://arrow.apache.org/, “Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware.

Therefore, to take advantage of this columnar storage format, you need to use some kind of Python binding or tool to read data stored in Parquet files into the system memory and possibly transform that into a pandas DataFrame for the data analytics tasks. This can be accomplished by using the PyArrow framework.

Impressive Speed Improvement

PyArrow is a Python binding (API) for the Apache Arrow framework. Detailed coverage of Apache Arrow or PyArrow (https://arrow.apache.org/docs/python/) is far beyond the scope of this book, but interested readers can refer to the official documentation at https://arrow.apache.org/ to get started.

Using the PyArrow function read_table, you can demonstrate considerable improvement of the reading speed of large data files over the commonly used pandas read_csv method. For example, Figure 3-6 shows the ratio of pandas and PyArrow reading times of the same data, stored in CSV and Parquet, respectively. The ratio goes up as the data size increases; PyArrow performs considerably better with larger file sizes.
Figure 3-6

pandas vs. PyArrow reading time ratio for CSV (and Parquet) files. Source: https://towardsdatascience.com/how-fast-is-reading-parquet-file-with-arrow-vs-csv-with-pandas-2f8095722e94, author permission granted

This is something truly astonishing to ponder. pandas is based on the fast and efficient NumPy arrays, yet it cannot match the file-reading performance shown by the Parquet format. If we think about it deeply, the reason becomes clear that the file-reading operation has almost nothing to do with how pandas optimize the in-memory organization of the data after it is loaded into the memory. Therefore, while pandas can be a fast and efficient package for in-memory analytics, we don’t have to stay dependent upon traditional file formats like CSV or Excel to work with pandas. Instead, we should move towards using more modern and efficient formats like Parquet.

Read Only What Is Needed

Often, you may not need to read all the columns from a columnar storage file. For example, you may apply some filter on the data and choose only selected data for the actual in-memory processing. With CSV files or regular SQL databases, this means you are choosing specific rows out of all the data. However, for the columnar database, this effectively means choosing specific columns. Therefore, you do have an advantage in terms of reading speed when you are reading only a small fraction of columns from the Parquet file.

Figure 3-7 shows the reading advantage as the number of columns increases for the same CSV vs. Parquet comparison. When you read a very small fraction of columns, say < 10 out of 100, the reading speed ratio becomes as large as > 50 (i.e., you get 50X speedup compared to the regular pandas CSV file reading). The speedup tapers off for large fractions of columns and settles down to a stable value.
Figure 3-7

pandas vs. PyArrow reading time ratio for CSV (and Parquet) files as the number of columns vary. Source: https://towardsdatascience.com/how-fast-is-reading-parquet-file-with-arrow-vs-csv-with-pandas-2f8095722e94, author permission granted

Reading selected columns from a large dataset is an extremely common scenario in data analytics and machine learning tasks. Often, subject matter experts advise data scientists with domain knowledge and can preselect a few features from a large dataset although the default data collection mechanism may store a file with many columns/features. In these situations, it makes logical sense to read only what is needed and process those columns for the ML workload. Storing the data in a columnar data format like Parquet pays handsomely for these cases.

PyArrow to pandas and Back

While the results shown above are impressive, the central question is about how to take advantage of this fast and efficient file format for pandas-based data analytics tasks. This has been made extremely simple by PyArrow utility methods, as this simple boilerplate code illustrates:
import pyarrow as pa
import pandas as pd
df = pd.DataFrame({"a": [1, 2, 3],
                   "b":[2.7,-1.2,5.4],
                   "c": ['abc','xyz','pqr']})
# Convert from pandas to Arrow
table = pa.Table.from_pandas(df)
# Convert back to pandas
df_new = table.to_pandas()
So, there are ready functions to convert PyArrow tables and pandas DataFrame back and forth. You can take advantage of this in the scenario illustrated in Figure 3-8.
Figure 3-8

Storing large datasets in Parquet (vs. CSV) may offer overall speed advantage for many processing tasks with pandas

Suppose you have a large CSV file of numeric quantities with ~1 million rows and 14 columns, and you want to calculate the basic descriptive stats on this dataset. Not so surprisingly, if you only use pandas code, the majority of the time will be taken by the file reading operation, not by the statistical calculation. You can make this task efficient by storing the file in the Parquet format instead of CSV, reading it using the read_table method, converting to pandas using the to_pandas method, doing the statistical calculation, and then just storing the result back in CSV or Parquet. The output consists of only a few rows/columns as it is just the descriptive stats, so the file format does not matter much. A demo example with speed comparison is shown in the accompanying Jupyter notebook with this book.

Other Miscellaneous Ideas

pandas is such a vast and storied library that there are thousands of ways to improve upon inefficient and non-productive code patterns while using it. A few miscellaneous suggestions are mentioned here.

Remove Orphan DataFrames Regularly

A very common programming pattern is the following:
  • Create a DataFrame from an in-memory object or a file on the disk.

  • Drop or fill Null or NaN values.

  • Apply a user-defined function on certain columns.

  • Group the final dataset by some specific column.

  • Further processing on the grouped object…

Often, data scientists create intermediate DataFrames while executing this pipeline and don’t remove them from the active memory space, thereby piling up orphan or unused DataFrames as large memory-hogging garbage.
df1 = pd.read_csv("A large file")
df2 = df1.dropna()
df3 = df2.apply(user_function, columns = [...])
df4 = df3.groupby([column_1, column_2])
df_final = ...
If the only in-memory object that matters is df_final, then you must actively track and delete all intermediate DataFrames as soon as their utility is over:
df1 = pd.read_csv("A large file")
df2 = df1.dropna()
del(df1)
df3 = df2.apply(user_function, columns = [...])
del(df2)
df4 = df3.groupby([column_1, column_2])
del(df3)
df_final = ...

Chaining Methods

Continuing from the example above, it makes perfect sense to let the system handle all the active tracking and deleting of intermediate DataFrame objects for a productive codebase. pandas allows chaining methods, which makes this a relatively easy approach to implement. The code can read something like this:
df_final = pd.read_csv("A large file").dropna().apply(user_function, columns = [...]).groupby([column_1, column_2])

As long as the methods and the chained code are readable, this is a perfectly sensible approach.

Using Specialized Libraries to Enhance Performance

There are, in fact, quite a few external libraries that can help speed up pandas tasks significantly. They include, but are not limited to, the following:

Each of these ideas needs a significant space to discuss at any reasonable details. Therefore, I cover them separately in later chapters.

Efficient EDA with Matplotlib and Seaborn

Matplotlib and Seaborn are two widely used visualization libraries for data science tasks in the Python ecosystem. Together, they offer unparalleled versatility, rich graphics options, and deep integration with the Python data science ecosystem for doing any kind of visual analytics you can think of.

However, there are a few common situations where you can end up using these fantastic packages in an inefficient manner. Additionally, you may also waste valuable time writing unnecessary code or using additional tools to make your visual analytics end products more presentable, which could have been accomplished with simple modifications in the settings of Matplotlib and Seaborn. In this section, I cover tips and tricks that can come handy to make your data science and analytics tasks productive when using either of these libraries.

Embrace the Object-Oriented Nature of Matplotlib

Matplotlib is built in a thoughtful manner (www.aosabook.org/en/matplotlib.html) following multiple layers of abstractions and object-oriented design hierarchy (as shown in Figure 3-9). Almost always, a data scientist deals with the scripting layer to draw quick plots (e.g., plt.scatter(x,y)) and change the look and feel of that graphical output (e.g., plt.xlabel("The x-axis variable", fontsize=15)). Sometimes, they venture into the middle artistic layer, creating custom Axes and setting the properties of Figure objects. Usually, a data scientist does not need to work directly with the backend layer for regular data analytics tasks.
Figure 3-9

Matplotlib layers and core abstractions/objects

However, it is a great education for a data scientist to have deep knowledge about this layered architecture and follow the best practices that leverage the strength of a solid object-oriented design. In particular situations such as those involving subplots, this becomes prominent.

Two Approaches for Creating Panels with Subplots

A simple example of a good practice is to not to use the following old style of code to create two subplots or panels stacked vertically:
# Create the main figure
plt.figure()
# The first of two panels
plt.subplot(2, 1, 1) # (rows, columns, panel number)
plt.plot(x, np.sin(x))
# The second panel
plt.subplot(2, 1, 2)
plt.plot(x, np.cos(x));
A better alternative is to use the following code:
# Create an array of two Axes objects
fig, ax = plt.subplots(2)
# Call plot() method on the appropriate object
ax[0].plot(x, np.sin(x))
ax[1].plot(x, np.cos(x))
They produce identical graphical output, as shown in Figure 3-10.
Figure 3-10

Matplotlib subplots panel example

But why is the second approach better or more productive? Think about the cognitive load you might have to carry if it were 5 or 15 plots instead of 2 and the chances of bugs that could have been introduced writing code like plt.subplot(3, 1, 3) or plt.subplot(4, 4, 13). How would you keep track of all those parameters inside the plt.subplot() function? The second approach frees you from these considerations by allowing it to pass in a single number like 2 or 15 and repeat the plot statement that many times.

However, an even better approach is to put this code in a proper function that has a little more intelligence to handle any number of plots and that refactors the plotting statements using a loop.

A Better Approach with a Clever Function

Consider the following code defining a function that can produce a panel with an arbitrary number of plots (always in a three-column format respecting the natural width of the webpage or a book), dynamically adjusting the number of rows with the number of total subplots:
def plot_panels(n):
    """
    Produce a panel consisting of variable number of rows and 3 columns
    """
    if n%3==0:
        nrows =  int(n/3)
    else:
        nrows = n//3+1
    ncols = 3
    fig, ax = plt.subplots(nrows, ncols, figsize=(15,nrows*3))
    axes = ax.ravel()
    for i in range(n):
        axes[i].plot(x, np.sin(x))

Here, you can change the variable n to any value. Internally, the function will always calculate the appropriate number of rows with the logic in the code and set ncols = 3. Here, ax is a (multi-dimensional) list of Matplotlib Axes objects (https://matplotlib.org/stable/api/axes_api.html) and therefore can be indexed with axes[i] within a loop after you flatten the list with an axes = ax.ravel() statement.

When you call this function with plot_panels(5), you get the result shown in Figure 3-11.
Figure 3-11

Matplotlib panel function output with five plots

Note the blank canvas in the last row. This is because the plots must be arranged in a rectangular grid and for placing five plots on a 3 x 2 grid, so the last one will be left blank. When you call the same function with plot_panels(15), you get the result shown in Figure 3-12.
Figure 3-12

Matplotlib panel function output with 15 plots

It is the object-oriented style of programming pattern you embraced in your function definition that resulted in this scalable and efficient mechanism of generating any number of plots without worrying about potential bugs. This type of practice makes the codebase productive in the long run.

Set and Control Image Quality

Matplotlib interacts with the user’s graphical output system (web browser or stand-alone window) in a complex manner and optimizes the output image quality with a balanced set of internal settings. However, it is possible to tweak those settings as per the user’s preference to get the most optimum quality that they desire.

This becomes particularly important for using Matplotlib in the Jupyter notebook environment, which is an extremely common scenario. The quality of the default image, rendered in a Jupyter notebook web browser, may not be good enough for publication in a book or further processing. Data scientists often spend additional time and effort enhancing the quality of the visualizations they produce as part of the data science tasks. However, Matplotlib provides a simple and intuitive workaround to accomplish the same.

Setting DPI Directly in plt.figure()

Setting the dots per inch is easily done with just one parameter:
plt.figure(figsize=(6,4),dpi=150)
plt.plot(x,y)

In a Jupyter notebook, the default DPI value is quite low. Depending on your system settings, it is generally between 70 and 100. When you increase it, your figure also gets bigger, so you have to be mindful of not clipping the image in your browser window.

Setting DPI and Output Format for Saving Figures

In addition to, or alternatively, you may also want to save the plot as a file object on your local disk for later use. You can choose the DPI and output format:
plt.figure(figsize=(6,4))
plt.plot(x,y)
plt.title("Parabola", fontsize=16)
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.savefig("Parabola.png",
            dpi=300,
            format = 'png')

When you choose JPEG as the output format, you can control a host of other settings related to the JPEG compression. However, PNG or PDF are better in terms of publication-worthy quality since they are lossless formats.

What is a good DPI to choose?

It depends on the intended usage, of course. For print, 150dpi is considered low-quality printing, even though 72dpi is considered the standard for the web (which is why it’s not easy printing quality images straight from the web). Low-resolution images will have blurring and pixelation (https://en.wikipedia.org/wiki/Pixilation) after printing. Medium-resolution images have between 200dpi - 300dpi. The industry standard for quality photographs and image is typically 300dpi.

Adjust Global Parameters

The Matplotlib back end provides the ultimate flexibility in terms of setting global parameters that control the look and feel of your visualization. The rcParams settings (https://matplotlib.org/stable/api/matplotlib_configuration_api.html#matplotlib.RcParams) have all the possible varieties you can think of. Here is a code example:
import matplotlib as mpl
# Data
x = np.arange(-10,10,0.1)
y = x**2
# Set all backend parameters
mpl.rcParams['lines.linewidth'] = 3
mpl.rcParams['text.color'] = 'red'
mpl.rcParams['lines.linestyle'] = '--'
mpl.rcParams['axes.facecolor'] = '#c3e2e6'
mpl.rcParams['figure.dpi'] = 120
mpl.rcParams['font.style'] = 'italic'
mpl.rcParams['font.weight'] = 'heavy'
# Plot
plt.plot(x,y)
plt.title("Parabola", fontsize=16)
plt.xlabel('x-axis')
plt.ylabel('y-axis')
Note how you had to import the Matplotlib module itself with the statement import matplotlib as mpl and not just use the matplotlib.pyplot as plt. Also note the figure.dpi as one of the many settings you set in this code. A typical result from this setting is shown in Figure 3-13.
Figure 3-13

Matplotlib global rcParams change illustration

If you have decided on a set of image quality and styling settings, you can store them in a local config file and just read the values at the beginning of your Jupyter notebook or Python script while importing Matplotlib. That way, every image produced by that script or in that Jupyter session will have the same look and feel. The output of the code above should look something like Figure 3-13.

Did you notice that the axes.facecolor was set to a hex string #c3e2e6 in the code above? Matplotlib accepts regular color names like red, green, or blue, or hex strings in its various internal settings. You can simply use an online color picker tool (https://imagecolorpicker.com/) and copy-paste the hex code for better styling of your image.

Tricks with Seaborn

Seaborn is a Python library built on top of Matplotlib with a concentrated focus on statistical visualizations like boxplots, histograms, and regression plots. Naturally, for data scientists, it is a great tool to use in a typical exploratory data analysis (EDA) phase. However, using Seaborn with a couple of simple tricks can improve the productivity of your EDA tasks.

Use Sampled Data for Large Datasets

Seaborn provides excellent APIs/methods to generate beautiful visualizations on all features/variables of your dataset:
  • Pairwise plots (relating every variable in a dataset to another one)

  • Histograms

  • Boxplots

It might be tempting to generate all these plots for all the features and their pairwise combination (for the pair plot). However, depending on the amount of data and possible combination for the pairwise plot, the number of raw visual elements can be overwhelming for your system to handle.

One quick fix to this situation is to use random sample (a small fraction) of the dataset for generating all these plots. If the data is not too skewed, then by looking at a random sample (or a few of them), you should get a good feeling about the pattern and distributions from a typical EDA anyway.

A boilerplate code will look like the following:
N = 100
df_sample = df.sample(N)
plot_seaborn(df_sample)
<more code ...>

Here you pass on only 100 samples from the original DataFrame to the plotting function. Note that to maintain readability and data structure integrity, you should not randomly sample 100 rows from the DataFrame but use a built-in function to return another DataFrame and pass that along to the plotting function.

Use pandas Correlation with Seaborn heatmap

This is a trick to quickly visualize the correlation strengths between multiple features of your dataset with just two lines of code. This kind of trick should be standard part of your efficient data science toolkit.

Here is a code snippet:
df_mpg = sns.load_dataset('mpg')
mpg_corr = df_mpg.corr()
sns.heatmap(mpg_corr,cbar=True,cmap='plasma')
plt.show()
This loads the famous Auto MPG dataset (https://archive.ics.uci.edu/ml/datasets/auto+mpg) and produces the correlation heatmap shown in Figure 3-14, demonstrating the positive and negative correlation strengths between various numerical features of the dataset. The bright colors and italic/bold axis names of this plot are the result of the Matplotlib style settings you did in the previous section. Unless you change them explicitly or start a new Jupyter notebook session, they remain in effect.
Figure 3-14

Using the pandas correlation function with a Seaborn heatmap to get the correlation visualization quickly for any dataset

Use Special Seaborn Methods to Reduce Work

Seaborn provides some special method/plotting utilities that can reduce the work for a data scientist in common tasks and thereby improve productivity. These utilities should be put to use at every opportunity. Examples include
  • Doing a linear regression and creating the plots of residuals with residplot

  • Counting the occurrence of categorical variables and plotting them using countplot

  • Using clustermap to create a hierarchical colored diagram from a matrix dataset

Summary

In this chapter, I started by describing how NumPy is faster than native Python code and enumerated its speed efficiency in simple scenarios. I talked about the pros and cons of converting Python objects like lists and tuples to NumPy arrays before doing numerical processing. Then, I discussed the importance of vectorizing operations as much as possible for efficient data science pipelines. I also discussed some of the reading utilities that NumPy offers and how they can make your code compact and productive.

Next, I delved into the efficient use of the pandas framework by discussing various methods to iterate over DataFrames and accessing or setting values. Usage of modern, optimized file storage formats like Parquet (in the context of Apache Arrow and column-oriented data storage) were discussed at length. Some miscellaneous ideas like chaining and cleaning up orphan DataFrame were talked about next.

Finally, I showed some tips and tricks to be used with popular visualization libraries Matplotlib and Seaborn. The object-oriented layered structure of Matplotlib was shown to be a strong foundation for building efficient data science code for plotting. I also demonstrated various methods of controlling image quality and plot settings in a global manner (i.e., for a Python or Jupyter session). Sampled data was discussed as an idea to control the explosion of plots that can happen with large datasets.

These kind of tips and tricks are developed over time based on data analysis, numerical computing, and exploratory data visualization needs that arise from handling real-life datasets in projects that need to be efficient and productive from time and computing resources points of view. As a regular practitioner of data science, you will also develop your own tricks and make your data analysis and modeling code efficient. The ideas in this chapter are just introductory guiding pointers to get you to think in that direction.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.82.78