19

Life Outside of Pandas

19.1 The (Scientific) Computing Stack

When Jake VanderPlas1 gave the SciPy2 2015 keynote address,3 he titled his talk “The State of the Stack”. Jake described how the community of packages that surround the core Python language developed. Python the language was created in the 1980s. Numerical computing began in 1995 and eventually evolved into the NumPy library in 2006. The NumPy library was the basis of the Pandas Series objects that we have worked with throughout this book. The core plotting library, Matplotlib, was created in 2002 and is also used within Pandas in the plot method. Pandas’ ability to work with heterogeneous data allows the analyst to clean different types of data for subsequent analysis using the scikits, which stemmed from the SciPy package in 2000.

1. Jake VanderPlas: http://vanderplas.com/

2. SciPy Conference: https://conference.scipy.org/

3. Jake’s SciPy 2015 keynote address: https://speakerdeck.com/jakevdp/the-state-of-the-stack-scipy-2015-keynote

There have also been advances in how we interface with Python. In 2001, IPython was created to provide more interactivity with the language and the shell. In 2012, Project Jupyter created the interactive notebook for Python, which further solidified the language as a scientific computing platform, as this tool provides an easy and highly extensible way to do literate programming and much more.

However, the Python ecosystem includes more than just these few libraries and tools. SymPy4 is a fully functional computer algebra system (CAS) in Python that can do symbolic manipulation of mathematical formulas and equations. While Pandas is great for working with rectangular flat files and has support for hierarchical indices, the xarray library5 gives Python the ability to work with n-dimensional arrays. Thinking of Pandas as a two-dimensional dataframe—that is, as an array—gives us an n-dimensional dataframe. These types of data are frequently encountered within the scientific community.

4. SymPy: https://www.sympy.org/

5. Xarray: http://xarray.pydata.org/

19.2 Performance

“Premature optimization is the root of all evil”. Write your Python code in a way that works first, and that gives you a result which you can test. If it’s not fast enough, then you can work on optimizing the code. The SciPy ecosystem has libraries that make Python faster: cython and numba.

19.2.1 Timing Your Code

Appendix V Gives an example of using the Jupyter %%timeit cell magic to time your code. This can be helpful just to compare different methods or implementations, but does not necessarily tell you where to focus your efforts.

19.2.2 Profiling Your Code

Other tools such as cProfile6 and snakevis7 can help you time entire scripts and blocks of code and give a line-by-line breakdown of their execution. Additionally, snakevis comes with an IPython snakevis extension!

6. cProfile: https://docs.python.org/3/library/profile.html#module-cProfile

7. Snakevis: https://jiffyclub.github.io/snakeviz/

19.2.3 Concurrent Futures

Many different libraries and frameworks are available to help scale up your computation. concurrent.futures8 allows you to essentially rewrite the function calls into the built-in map function.9

8. concurrent.futures: https://docs.python.org/3/library/concurrent.futures.html

9. Python map(): https://docs.python.org/3/library/functions.html#map

19.3 Dask

Dask is another library that is geared toward working with large data sets.10 It allows you to create a computational graph, in which only calculations that are out of date need to be recalculated. Dask also parallelizes calculations on your own (single) machine or across multiple machines in a cluster. It creates a system in which you can write code on your laptop, and then quickly scale your code up to larger compute clusters. The nicest part of Dask is that its syntax aims to mimic the syntax from Pandas, which in turn lowers the overhead involved in learning to use this library.

10. Dask: https://www.dask.org/

19.4 Siuba

The tidyverse set of packages for the R programming language tried to break down each step in the data processing pipeline a single step. This allowed each step to be turned into separate function calls (aka verbs). This is similar to how method chaining works in Pandas. Siuba builds on top of the Pandas library and tries to port the Tidyverse verbs into Pandas.11

11. Siuba documentation: https://siuba.readthedocs.io

19.5 Ibis

The Ibis project provides a high-level API over tabular data.12 The main benefits is that it gives the user a consistent way to interact with databases, Dask, and Pandas.

12. Ibis project: https://ibis-project.org

19.6 Polars

Polars is a Python (and Rust) dataframe library built on top of Apache Arrow.13 Its API is similar to Pandas, but relies heavily on method calls. It also removes Pandas indices, something this book has avoided for sake of simplicity. The Polars documentation contains a user’s guide that is worth looking into: https://polars.github.io/polars-book

13. Polars Library: https://www.pola.rs/

19.7 PyJanitor

pyjanitor is a Python library that extends Pandas DataFrame objects by providing additional DataFrame methods to make data processing a little easier.14 It is modeled after the R package, janitor, and has a lot of convenient methods for common data processing steps.

14. pyjanitor documentation: https://pyjanitor-devs.github.io/pyjanitor/

19.8 Pandera

Many of the steps in data process involve checking and validating data. The pandera provides a mechanism for you to test your data.15 For example, you can use it to make sure there are valid values for a particular column. The tools provided in pandera allow you to check your data and have the code fail when it does not meet assumptions before you model the data and make conclusions from it.

15. pandera documentation: https://pandera.readthedocs.io/

19.9 Machine Learning

This book aimed to lay a foundation to all the parts in the data science process. It’s hard to be completely inclusive and cover everything that a data scientist might need. Machine learning methods like XGBoost have become extremely popular for its ability to work with a wide variety of data sets and perform well in prediction tasks.16 We’ve mentioned a little bit of scikit-learn pipelines in Section 13.4.17

16. XGBoost: https://xgboost.readthedocs.io/

17. scikit-learn pipelines: https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

To use these machine learning models in production we need to be able to maintain, version control, deploy, and monitor them. This is where MLOps (Machine Learning Operations) come into play, and tools like vetiver can help with that.18

18. Vetiver: https://vetiver.rstudio.com/

19.10 Publishing

This book was written in a publishing system called Quarto.19 This allows you to do “literate programming”, where you can mix prose text with code and code output. Why I like Quarto is that it is a single program that lets me write reports, books, websites, presentations, etc. It also allows me to work in R and Python simultaneously, which this book also does in Appendix Z.

19. Quarto: https://quarto.org/docs/books

JupyterBook is another literate programming platform that builds on Jupyter Notebooks to create a book format.20

20. JupyterBook: https://jupyterbook.org/

19.11 Dashboards

Over the years many dashboard libraries have been created for Python. Dash,21 Streamlit,22 Panel,23 and Voilà24 are some of them. I’ve personally done a lot of my data science result communication work in the R ecosystem, so I’m happy that Shiny for Python25 was recently announced at the time of writing, since it is similar to what I already know. All the dashboard platforms have pros and cons and have tradeoffs with learning curve, scalability, and flexibility.

21. Dash: https://plotly.com/dash/

22. Streamlit: https://streamlit.io/

23. Panel: https://panel.holoviz.org/

24. Voilà: https://voila.readthedocs.io

25. Shiny for Python: https://shiny.rstudio.com/py/

Conclusion

Pandas is a popular data science library in Python. Its ubiquity has made it the go-to library when working with data in Python. However, it may not meet everyone’s needs and that is why so many other libraries have been built to parallel or extend Pandas. This book mainly focuses around Pandas as the tool to help you think about data processing and give you the foundation to explore other dataframe libraries.

Look out for additional chapters published for free with the book. Many of the libraries mentioned in this part of the book will be expanded upon and released online.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.211.21