17. Life Outside of Pandas

17.1 The (Scientific) Computing Stack

When Jake VanderPlas1 gave the SciPy2 2015 keynote address,3 he titled his talk as “The State of the Stack.” In his speech, he described how the community of packages that surround the core Python language developed. Python the language was created in the 1980s. Numerical computing began in 1995 and eventually evolved into the NumPy library in 2006. The NumPy library was the basis of the Pandas Series objects that we have worked with throughout this book. The core plotting library, Matplotlib, was created in 2002 and is also used within Pandas in the plot method. Pandas’s ability to work with heterogeneous data allows the analyst to clean different types of data for subsequent analysis using the scikits, which stemmed from the ScpiPy package in 2000.

1. Jake VanderPlas: https://staff.washington.edu/jakevdp/

2. SciPy Conference: https://conference.scipy.org/

3. Jake’s SciPy 2015 keynote address: https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote

There have also been advances in how we interface with Python. In 2001, IPython was created to provide more interactivity with the language and the shell. In 2012, Project Jupyter created the interactive notebook for Python, which further solidified the language as a scientific computing platform, as this tool provides a easy and highly extensible way to do literate programming and much more.

However, the Python ecosystem includes more than just these few libraries and tools. SymPy4 is a fully functional computer algebra system (CAS) in Python that can do symbolic manipulation of mathematical formulas and equations. While Pandas is great for working with rectangular flat files and has support for hierarchical indices, the xarray library5 gives Python the ability to work with n-dimensional arrays. Thinking of Pandas as a two-dimensional dataframe—that is, as an array—gives us an n-dimensional dataframe. These types of data are frequently encountered within the scientific community. If you often have to work with various data input and output types, you might want to take a look at the odo library (Appendix T).

4. SymPy: www.sympy.org/en/index.html

5. xarray: http://xarray.pydata.org/en/stable/

17.2 Performance

“Premature optimization is the root of all evil.” Write your Python code in a way that works first, and that gives you a result which you can test. If it’s not fast enough, then you can work on optimizing the code. The SciPy ecosystem has libraries that make Python faster: cython and numba.

17.2.1 Timing Your Code

IPython also comes with “magic commands”6 that provide even more features to enhance the language. For example, the timeit magic times the execution of a Python statement or expression. You can use this function to benchmark your code to see which aspects are slowing your performance. To see how it works, let’s use the examples from Section 9.5.

6. IPython built-in magic commands: http://ipython.readthedocs.io/en/stable/interactive/magics.html

We begin by applying a function with axix=1.

import pandas as pd
import
numpy as np

df = pd.DataFrame({'a': [10, 20, 30],
                   'b': [20, 30, 40]})

def avg_2_apply(row):
    x = row[0]
    y = row[1]
    if (x == 20):
        return np.nan
    else:
        return (x + y) / 2

Then we vectorize our function using numpy.

%%timeit
df.apply(avg_2_apply, axis=1)

475 μs ± 7.37 μs per loop (mean ± std. dev. of 7 runs, 1000 loops
each)

@np.vectorize
def v_avg_2_mod(x, y):
    if (x == 20):
        return(np.NaN)
    else:
        return (x + y) / 2

%%timeit
v_avg_2_mod(df['a'], df['b'])

91.5 μs ± 2.73 μs per loop (mean ± std. dev. of 7 runs, 10000 loops
each)

Finally, we time our calculations using numba.

importnumba

@numba.vectorize
def v_avg_2_numba(x, y):
    if (int(x) == 20):
        return(np.NaN)
    else:
        return (x + y) / 2

%%timeit
v_avg_2_numba(df['a'].values, df['b'].values)

10.9 μs ± 70.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops
each)

You can see how much faster the performance is by looking at the amount of time per loop that each method takes. numba is clearly the fastest method in this example.

17.2.2 Profiling Your Code

Other tools such as cProfile7 and snakevis8 can help you time entire scripts and blocks of code and give a line-by-line breakdown of their execution. Additionally, snakevis comes with an IPython snakevis extension!

7. cProfile: https://docs.python.org/3.4/library/profile.html#module-cProfile

8. Snakevis: https://jiffyclub.github.io/snakeviz/

17.3 Going Bigger and Faster

Many different libraries and frameworks are available to help scale up your computation. concurrent.futures9 allows you to to essentially rewrite the function calls into the built-in map function.10 Dask11 is another library that is geared toward working with large data sets. It allows you to create a computational graph, in which only calculations that are “out of date” need to be recalculated. Dask also can help parallelize calculations on your own (single) machine or across multiple machines in a cluster. It creates a system in which you can write code on your laptop, and then quickly scale your code up to larger compute clusters. The nicest part of Dask is that its syntax aims to mimic the syntax from Pandas, which in turn lowers the overhead involved in learning to use this library. A great set of notebooks12 goes over these techniques.

9. concurrent.futures: https://docs.python.org/3/library/concurrent.futures.html

10. Python map: https://docs.python.org/3.6/library/functions.html#map

11. Dask: https://dask.pydata.org/en/latest/

12. Parallel Python tutorial: https://github.com/pydata/parallel-tutorial

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.27.209