17 Life Outside of Pandas

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

17. Life Outside of Pandas

17.1 The (Scientific) Computing Stack

When Jake VanderPlas¹ gave the SciPy² 2015 keynote address,³ he titled his talk as “The State of the Stack.” In his speech, he described how the community of packages that surround the core Python language developed. Python the language was created in the 1980s. Numerical computing began in 1995 and eventually evolved into the NumPy library in 2006. The NumPy library was the basis of the Pandas Series objects that we have worked with throughout this book. The core plotting library, Matplotlib, was created in 2002 and is also used within Pandas in the plot method. Pandas’s ability to work with heterogeneous data allows the analyst to clean different types of data for subsequent analysis using the scikits, which stemmed from the ScpiPy package in 2000.

1. Jake VanderPlas: https://staff.washington.edu/jakevdp/

2. SciPy Conference: https://conference.scipy.org/

3. Jake’s SciPy 2015 keynote address: https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote

There have also been advances in how we interface with Python. In 2001, IPython was created to provide more interactivity with the language and the shell. In 2012, Project Jupyter created the interactive notebook for Python, which further solidified the language as a scientific computing platform, as this tool provides a easy and highly extensible way to do literate programming and much more.

However, the Python ecosystem includes more than just these few libraries and tools. SymPy⁴ is a fully functional computer algebra system (CAS) in Python that can do symbolic manipulation of mathematical formulas and equations. While Pandas is great for working with rectangular flat files and has support for hierarchical indices, the xarray library⁵ gives Python the ability to work with n-dimensional arrays. Thinking of Pandas as a two-dimensional dataframe—that is, as an array—gives us an n-dimensional dataframe. These types of data are frequently encountered within the scientific community. If you often have to work with various data input and output types, you might want to take a look at the odo library (Appendix T).

4. SymPy: www.sympy.org/en/index.html

5. xarray: http://xarray.pydata.org/en/stable/

17.2 Performance

“Premature optimization is the root of all evil.” Write your Python code in a way that works first, and that gives you a result which you can test. If it’s not fast enough, then you can work on optimizing the code. The SciPy ecosystem has libraries that make Python faster: cython and numba.

17.2.1 Timing Your Code

IPython also comes with “magic commands”⁶ that provide even more features to enhance the language. For example, the timeit magic times the execution of a Python statement or expression. You can use this function to benchmark your code to see which aspects are slowing your performance. To see how it works, let’s use the examples from Section 9.5.

6. IPython built-in magic commands: http://ipython.readthedocs.io/en/stable/interactive/magics.html

We begin by applying a function with axix=1.

Click here to view code image

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': [10, 20, 30],
                   'b': [20, 30, 40]})

def avg_2_apply(row):
    x = row[0]
    y = row[1]
    if (x == 20):
        return np.nan
    else:
        return (x + y) / 2

Then we vectorize our function using numpy.

Click here to view code image

%%timeit
df.apply(avg_2_apply, axis=1)

475 μs ± 7.37 μs per loop (mean ± std. dev. of 7 runs, 1000 loops
each)

@np.vectorize
def v_avg_2_mod(x, y):
    if (x == 20):
        return(np.NaN)
    else:
        return (x + y) / 2

%%timeit
v_avg_2_mod(df['a'], df['b'])

91.5 μs ± 2.73 μs per loop (mean ± std. dev. of 7 runs, 10000 loops
each)

Finally, we time our calculations using numba.

Click here to view code image

importnumba

@numba.vectorize
def v_avg_2_numba(x, y):
    if (int(x) == 20):
        return(np.NaN)
    else:
        return (x + y) / 2

%%timeit
v_avg_2_numba(df['a'].values, df['b'].values)

10.9 μs ± 70.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops
each)

You can see how much faster the performance is by looking at the amount of time per loop that each method takes. numba is clearly the fastest method in this example.

17.2.2 Profiling Your Code

Other tools such as cProfile⁷ and snakevis⁸ can help you time entire scripts and blocks of code and give a line-by-line breakdown of their execution. Additionally, snakevis comes with an IPython snakevis extension!

7. cProfile: https://docs.python.org/3.4/library/profile.html#module-cProfile

8. Snakevis: https://jiffyclub.github.io/snakeviz/

17.3 Going Bigger and Faster

Many different libraries and frameworks are available to help scale up your computation. concurrent.futures⁹ allows you to to essentially rewrite the function calls into the built-in map function.¹⁰ Dask¹¹ is another library that is geared toward working with large data sets. It allows you to create a computational graph, in which only calculations that are “out of date” need to be recalculated. Dask also can help parallelize calculations on your own (single) machine or across multiple machines in a cluster. It creates a system in which you can write code on your laptop, and then quickly scale your code up to larger compute clusters. The nicest part of Dask is that its syntax aims to mimic the syntax from Pandas, which in turn lowers the overhead involved in learning to use this library. A great set of notebooks¹² goes over these techniques.

9. concurrent.futures: https://docs.python.org/3/library/concurrent.futures.html

10. Python map: https://docs.python.org/3.6/library/functions.html#map

11. Dask: https://dask.pydata.org/en/latest/

12. Parallel Python tutorial: https://github.com/pydata/parallel-tutorial

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 17 Life Outside of Pandas

Create new playlist

Sign In

Sign Up

17. Life Outside of Pandas

17.1 The (Scientific) Computing Stack

17.2 Performance

17.2.1 Timing Your Code

17.2.2 Profiling Your Code

17.3 Going Bigger and Faster

Table of Contents for
17 Life Outside of Pandas