7.1 Introduction

The NumPy (Numerical Python) library first appeared in 2006 and is the preferred Python array implementation. It offers a high-performance, richly functional n-dimensional array type called ndarray, which from this point forward we’ll refer to by its synonym, array. NumPy is one of the many open-source libraries that the Anaconda Python distribution installs. Operations on arrays are up to two orders of magnitude faster than those on lists. In a big-data world in which applications may do massive amounts of processing on vast amounts of array-based data, this performance advantage can be critical. According to libraries.io, over 450 Python libraries depend on NumPy. Many popular data science libraries such as Pandas, SciPy (Scientific Python) and Keras (for deep learning) are built on or depend on NumPy.

In this chapter, we explore array’s basic capabilities. Lists can have multiple dimensions. You generally process multi-dimensional lists with nested loops or list comprehensions with multiple for clauses. A strength of NumPy is “array-oriented programming,” which uses functional-style programming with internal iteration to make array manipulations concise and straightforward, eliminating the kinds of bugs that can occur with the external iteration of explicitly programmed loops.

In this chapter’s Intro to Data Science section, we begin our multi-section introduction to the pandas library that you’ll use in many of the data science case study chapters. Big data applications often need more flexible collections than NumPy’s arrays—collections that support mixed data types, custom indexing, missing data, data that’s not structured consistently and data that needs to be manipulated into forms appropriate for the databases and data analysis packages you use. We’ll introduce pandas array-like one-dimensional Series and two-dimensional DataFrames and begin demonstrating their powerful capabilities. After reading this chapter, you’ll be familiar with four array-like collections—lists, arrays, Series and DataFrames. We’ll add a fifth—tensors—in the “Deep Learning chapter.

tick mark Self Check

  1. (Fill-In) The NumPy library provides the data structure, which is typically much faster than lists.
    Answer: ndarray.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.