Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Valentina PorcuPython for Data Mining Quick Syntax Referencehttps://doi.org/10.1007/978-1-4842-4113-4_8

8. pandas

Valentina Porcu¹

(1)

Nuoro, Italy

In Chapter 7, we learned how to import a generic file using basic functions. Here we explore pandas—one of the most important libraries for dataset management.

Libraries for Data Mining

From this point onward, including the following chapters, we examine the most important and most used data mining libraries:

pandas: imports, manages, and manipulates data frames in various formats, extracts part of the data, combines two data frames, and also contains some basic statistical functions
NumPy: a package for scientific computing, contains several high-level mathematical and algebraic functions, and random number generation; and allows the creation of arrays
Matplotlib: a library that allows the creation charts from datasets
SciPy: contains more than 60 statistical functions related to mathematical and statistical analysis
scikit-learn: the most important tool for machine learning and data analysis.

pandas

In this chapter, we move away from discussions of Python structures and start to look at the most important data mining packages, beginning with the pandas library.

pandas is an open-source Python library that contains various tools for importing, managing, and manipulating data. It has a number of high-level features for manipulating, reorganizing and scanning structured data, including slicing, managing missing values, restructuring data, extracting dataset parts, and importing and data parsing from the Web. pandas is one of the most important data mining libraries. We can

Read and import structured data
Organize and manipulate them
Calculate some basic statistics

We can import the whole library:

>>> import pandas as pd

or import only primary structures in pandas, which are Series and DataFrame:

>>> from pandas import Series, DataFrame

As we shall see, pandas, NumPy, SciPy, and Matplotlib typically work together. I present them in the best possible way—separately at first—to effect clarity.

pandas: Series

As a first action, we always download and install the package or packages we need—in this case, pandas. pandas is part of the Anaconda suite, so you do not have to install it if you have Anaconda installed. Call it this way:

>>> import pandas as pd

As you can see, we import the package by creating a shortcut: pd. To call a pandas function at this point, we simply write

pd.function_name()

We can also import specific structures, such as Series and DataFrames:

>>> from pandas import Series, DataFrame

In this case, when we call up or create one of these two structures, we need not specify “pd” at the beginning.

The first pandas structure is Series—a one-dimensional array characterized by an index. Let’s create our first series:

>>> series1 = Series([25, 27, 28, 30],

index = ["student1", "student2", "student3", "student4"])

>>> print(series1)

student1 25

student2 27

student3 28

student4 30

dtype: int64

# we then use the .describe() function to get some statistical information about the series

>>> series1.describe()

count 4.000000

mean 27.500000

std 2.081666

min 25.000000

25% 26.500000

50% 27.500000

75% 28.500000

max 30.000000

dtype: float64

Let’s look at the first element of the series:

>>> series1[0]

We can also extract multiple elements:

>>> series1[[2,3]]

student3 28

student4 30

dtype: int64

Or elements from the index:

>>> series1["student1"]

>>> series1[["student1", "student4"]]

student1 25

student4 30

dtype: int64

The .index method allows us to index the series:

>>> series1.index

Index(['student1', 'student2', 'student3', 'student4'], dtype="object")

We can also verify the presence of an element in a series:

>>> 'student2' in series1

True

If we did not import series and data frames separately, we can create a series by specifying that we are using the pandas library:

>>> series2 = pd.Series([40, 20, 35, 70],

index = ["price1", "price2", "price3", "price4"])

# if we did not import pandas as pd, we can specify the full package name

>>> pandas.Series([40, 20, 35, 70],

index = ["price1", "price2", "price3", "price4"])

The pandas series are marked by indexes. If we do not specify the index, it is created automatically:

>>> series3 = Series([25, 27, 28, 30])

series3

0 25

1 27

2 28

3 30

dtype: int64

To replace some of the elements in a series, we use the .replace method.

>>> series3.replace([25, 27], [125, 127])

0 125

1 127

2 28

3 30

dtype: int64

Let’s create another series: one with random numbers. To do this, we need to load the NumPy library:

>>> import numpy as np

# we create a series of random numbers

>>> series4 = pd.Series(np.random.randn(7))

# we print it and examine the index

series4

0 -0.227393

1 -1.079208

2 0.101591

3 0.157502

4 1.541307

5 -0.182501

6 -0.247327

dtype: float64

# we can modify the index with the .index() method

>>> series4.index = ['a', 'b', 'c', 'd', 'e', 'f', 'g']

# and check the index again

series4

a -0.227393

b -1.079208

c 0.101591

d 0.157502

e 1.541307

f -0.182501

g -0.247327

dtype: float64

# we create another random series

>>> series5 = pd.Series(np.random.randn(7))

# and then merge the two series using the .concat() method

>>> s45 = pd.concat([series4, series5])

# we print the created series

>>> s45

a -0.227393

b -1.079208

c 0.101591

d 0.157502

e 1.541307

f -0.182501

g -0.247327

0 -0.610507

1 0.282318

2 -1.142692

3 -1.081449

4 -1.818420

5 -1.133354

6 0.804213

dtype: float64

# we check one of the created structures

>>> type(s45)

>>> pandas.core.series.Series

# we remove some elements of the series

>>> >>> del(s45[5])

>>> s45

a -0.227393

b -1.079208

c 0.101591

d 0.157502

e 1.541307

f -0.182501

g -0.247327

0 -0.610507

1 0.282318

2 -1.142692

3 -1.081449

4 -1.818420

6 0.804213

dtype: float64

>>> del(s45['f'])

>>> s45

b -1.079208

c 0.101591

e 1.541307

g -0.247327

0 -0.610507

1 0.282318

2 -1.142692

3 -1.081449

4 -1.818420

6 0.804213

dtype: float64

# we slice some elements from the series

>>> s45[0]

-0.61050680805211532

>>> s45[2:4]

e 1.541307

g -0.247327

dtype: float64

# we can perform object slicing up to the fourth position

>>> s45[:4]

# or from the fifth position onward

>>> s45[5:]

# or we can extract the last three items

>>> s45[-3:]

# we can extract the first and last elements using with the .head() and .tail() methods

>>> s45.head(2)

>>> s45.tail(3)

# we can reverse the series

>>> s45[::-1]

# last, we can create a copy of the series with the .copy() method

>>> copys45 = s45.copy()

pandas: Data Frames

The most important structure in pandas is the DataFrame, a structure which extends the capabilities of the Series and allows us to manage our datasets. Let’s create a small data frame with pandas:

# we import pandas

>>> import pandas as pd

# we use the DataFrame functions to create or import a dataset or file from the computer

# to get help with a particular function, we type

>>> help(pd.DataFrame())

Help on DataFrame in module pandas.core.frame object:

# class DataFrame(pandas.core.generic.NDFrame)

| Two-dimensional size-mutable, potentially heterogeneous

| tabular data structure with labeled axes (rows and columns).

| Arithmetic operations align on both row and column labels.

| Can be thought of as a dict-like container for Series objects.

| The primary pandas data structure

| Parameters

| _ _ _ _ _ _ _ _ _ _

| data : numpy ndarray (structured or homogeneous), dict, or

| DataFrame

| Dict can contain Series, arrays, constants, or list-like

| objects

| index : Index or array-like

| Index to use for resulting frame. Will default to

| np.arange(n) if no indexing information part of input

| data and no index provided

| columns : Index or array-like

| Column labels to use for resulting frame. Will default to

| np.arange(n) if no column labels are provided

| dtype : dtype, default None

# we create our first dataset

>>> df1 = pd.DataFrame({'Names': ['Simon', 'Kate', 'Francis', 'Laura', 'Mary', 'Julian', 'Rosie'],

'Height':[180, 165, 170, 164, 163, 175, 166],

'Weight':[85, 65, 68, 45, 43, 72, 46],

'Pref_food':['steak', 'pizza', 'pasta', 'pizza', 'vegetables','steak','seafood'],

'Sex':['m', 'f','m','f', 'f', 'm', 'f']})

# caution: we cannot create variables with names that contain a space; for instance, we can create the variable 'Var_1', but naming a variable 'Var 1' returns an error

# we print the data frame

>>> df1

Height Names Pref_food Sex Weight

0 180 Simon steak m 85

1 165 Kate pizza f 65

2 170 Francis pasta m 68

3 164 Laura pizza f 45

4 163 Mary vegetables f 43

5 175 Julian steak m 72

6 166 Rosie seafood f 46

# it may be useful to conduct an analysis of our data , which we can do using the .describe() method

>>> df1.describe()

Height Weight

count 7.000000 7.000000

mean 169.000000 60.571429

std 6.377042 16.154021

min 163.000000 43.000000

25% 164.500000 45.500000

50% 166.000000 65.000000

75% 172.500000 70.000000

max 180.000000 85.000000

# we can recall variable names

>>> df1.columns

Index(['Height', 'Names', 'Pref_food', 'Sex', 'Weight'],

dtype='object')

At this point, we can continue by setting a variable—in this case, ‘Names’—as index in this way:

>>> df1.set_index('Names')

../images/469457_1_En_8_Chapter/469457_1_En_8_Figb_HTML.jpg

# to consolidate the index, we use the argument inplace = True

>>> df1.set_index('Names', inplace = True)

# we can reset the index as follows:

>>> df1.reset_index()

We can continue to explore our data.

# we check the number of cases and variables

>>> print(df1.shape)

(7, 4)

# and determine variable type

>>> print(df1.dtypes)

Height int64

Pref_food object

Sex object

Weight int64

dtype: object

Here, too, we can see the first and the last elements using the head()and tail() functions. We may want to specify the number of cases we want to display. In this case, we would use parentheses:

>>> df1.head()

>>> df1.tail(3)

We can also get information about variables using the .info() method:

>>> df1.info()

RangeIndex: 7 entries, 0 to 6

Data columns (total 5 columns):

Height 7 non-null int64

Names 7 non-null object

Pref_food 7 non-null object

Sex 7 non-null object

Weight 7 non-null int64

dtypes: int64(2), object(3)

memory usage: 360.0+ bytes

We can then rearrange the data according to one of the variables:

>>> df1.sort_values(by = 'Weight')

../images/469457_1_En_8_Chapter/469457_1_En_8_Figc_HTML.jpg

# we rearranged our data based on weight, from the lowest to the highest value. To reverse these values, from highest to lowest, we specify the ascending argument as False

>>> df1.sort_values(by = 'Weight', ascending = False)

../images/469457_1_En_8_Chapter/469457_1_En_8_Figd_HTML.jpg

# if there are any missing values in our dataset (in particular, in the variable used as the index), we can put them all at the beginning or end of our display in the following ways

>>> df1.sort_values(by = 'Weight', na_position = "last")

>>> df1.sort_values(by = 'Weight', na_position = "first")

# to rearrange the dataset by index, we can use sort_index, instead of sort_values, to sort the data in an ascending or descending way

>>> df1.sort_index()

>>> df1.sort_index(ascending= False)

../images/469457_1_En_8_Chapter/469457_1_En_8_Fige_HTML.jpg

Let’s look at some examples of slicing items from a data frame:

# from the first column (Python starts counting from 0, so we use the 0 column), let's slice the first three cases:

>>> df1[:3]

Names Height Pref_food Sex Weight

0 Simon 180 steak m 85

1 Kate 165 pizza f 65

2 Francis 170 pasta m 68

# or slice from the fourth case to the end

>>> df1[4:]

Names Heigth Pref_food Sex Weight

4 Mary 163 vegetables f 43

5 Julian 175 steak m 72

6 Rosie 166 seafood f 46

# or slice one of the columns

>>> df1['Names']

0 Simon

1 Kate

2 Francis

3 Laura

4 Mary

5 Julian

6 Rosie

Name: Names, dtype: object

# by using single square brackets , as we just did, we can extract a variable like a Series

# here's another way to do this

>>> df1.Names

0 Simon

1 Kate

2 Francis

3 Laura

4 Mary

5 Julian

6 Rosie

Name: Names, dtype: object

# a third way is to use double square brackets to we select a data frame

>>> df1[['Names']]

../images/469457_1_En_8_Chapter/469457_1_En_8_Figf_HTML.jpg

# we can also select more columns

>>> df1[['Names', 'Sex']]

We can select the value of a variable for a particular element—for example, the third value:

>>> df1['Sex'][2]

'm'

# or some values of some variables

>>> df1['Names'][1:4]

1 Kate

2 Francis

3 Laura

Name: Names, dtype: object

To select items from a data frame we can also use the .loc, .iloc, and .ix methods:

.loc: works through the index
.iloc: extracts via position
.ix: takes both into account

# .loc

>>> df1.loc[1]

Names Kate

Height 165

Pref_food pizza

Sex f

Weight 65

Name: 1, dtype: object

# we can extract the elements from the first to the fourth

>>> df1.loc[:3]

# .iloc

# we can extract the element found in the first case and in the third variable

>>> df1.iloc[0,2]

'steak'

# an alternative to .iloc is the .iat method

>>> df1.iat[1,3]

'f'

# the .at and .iat methods are based on the index

>>> df1.at[2,'Sex']

'm'

# to the left of the comma, we indicate the cases to be extracted; to the right, the columns or variables (table not shown). To extract all elements, we use a colon

# what if we wanted to extract every other case?

>>> df1[::2]

../images/469457_1_En_8_Chapter/469457_1_En_8_Figg_HTML.jpg

Through Boolean operators we can specify extracting conditions. For instance, we can extract the cases of the dataset in which the height of the subjects (Height) is greater than 170:

>>> df1[df1.Height > 170]

Names Height Pref_food Sex Weight

0 Simon 180 steak m 85

5 Julian 175 steak m 72

We can also specify multiple conditions, such as extracting all females who weigh more than 163:

>>> df1[(df1.Height > 163) & (df1.Sex == 'f')]

Alternatively, we can also use the .query() method:

>>> df1.query("Sex != 'f'")

Names Height Pref_food Sex Weight

0 Simon 180 steak m 85

2 Francis 170 pasta m 68

5 Julian 175 steak m 72

Now let’s look at how to rename a column:

>>> df1 = df1.rename(columns = {'Pref_food': 'Food'})

>>> df1.head(2)

Names Height Food Sex Weight

0 Simon 180 steak m 85

1 Kate 165 pizza f 65

We can create a copy of our data frame:

>>> df2 = df1.copy()

>>> df2

../images/469457_1_En_8_Chapter/469457_1_En_8_Figh_HTML.jpg

We can also add a column. For example, let’s add a column to the df2 dataset that contains height in centimeters:

>>> df2['new_col'] = df2.Height/100

>>> df2

Names Heigth Food Sex Weight new_col

0 Simon 180 steak m 85 1.80

1 Kate 165 pizza f 65 1.65

2 Francis 170 pasta m 68 1.70

3 Laura 164 pizza f 45 1.64

4 Mary 163 vegetables f 43 1.63

5 Julian 175 steak m 72 1.75

6 Rosie 166 seafood f 46 1.66

We can also add a column using the .insert() method. The first element, 5, marks the location where we want to insert the new column (the sixth position), the second is the name of the new column, and the third is the value of the new column:

>>> df2.insert(5, column = 'new_col2', value = df2.Height/100)

>>> print(df2.head(2))

Names Heigth Pref_food Sex Weight new_col2 new_col

0 Simon 180 steak m 85 1.80 1.80

1 Kate 165 pizza f 65 1.65 1.65

Let’s return to the df1 dataset without additions. By using .groupby(), we can also aggregate the dataset around one or more variables.

>>> df1.groupby('Sex')

<pandas.core.groupby.DataFrameGroupBy object at 0x10c8b2be0>

We can create an object that contains aggregated groups according to the ‘Sex’ variable:

>>> grouped1 = df1.groupby('Sex')

# to view the groups, we use group_name.groups

>>> grouped1.groups

{'f': [1, 3, 4, 6], 'm': [0, 2, 5]}

We can visualize the number of cases labeled in one way or another.

# to see the cases, proceed as follows

>>> for names, groups in grouped1:

... print(names)

... print(groups)

../images/469457_1_En_8_Chapter/469457_1_En_8_Figi_HTML.jpg

# note that we only get data belonging to a group

>>> grouped1.get_group('f')

# we can aggregate cases according to two variables

>>> grouped2 = df1.groupby(['Sex', 'Pref_food'])

>>> grouped2.groups

{('f', 'pizza'): [1, 3],

('f', 'seafood'): [6],

('f', 'vegetables'): [4],

('m', 'pasta'): [2],

('m', 'steak'): [0, 5]}

# we can also determine how many cases fall into each of the groups

>>> grouped2.size()

Sex Pref_food

f pizza 2

seafood 1

vegetables 1

m pasta 1

steak 2

dtype: int64

# we can obtain some descriptive statistics about the data

>>> grouped2.describe()

../images/469457_1_En_8_Chapter/469457_1_En_8_Figj_HTML.jpg

We can also count the frequency of a variable using the .value_counts() method.

>>> df1['Sex'].value_counts()

f 4

m 3

Name: Sex, dtype: int64

Returning to the df1 dataset, we can acquire information about the variables using the .info() method.

>>> print(df1)

>>> df1.info()

With the .astype() method, we can change the nature of a variable. For example, we can overwrite a variable from int (a numeric variable) to an object variable, such as a string:

>>> df1['Weight'] = df1['Weight'].astype('object')

>>> df1.info()

RangeIndex: 7 entries, 0 to 6

Data columns (total 5 columns):

Height 7 non-null int64

Names 7 non-null object

Pref_food 7 non-null object

Sex 7 non-null object

Weight 7 non-null object

dtypes: int64(1), object(4)

memory usage: 360.0+ bytes

We can delete cases and variables with the .drop() method:

>>> df1.drop(0)

../images/469457_1_En_8_Chapter/469457_1_En_8_Figk_HTML.jpg

# if, on the other hand, we set and consolidate the index, we eliminate the case based on the value of the index

>>> df1.set_index('Names', inplace = True)

>>> df1.drop("Laura")

Height Pref_food Sex Weight

Names

Simon 180 steak m 85

Kate 165 pizza f 65

Francis 170 pasta m 68

Mary 163 vegetables f 43

Julian 175 steak m 72

Rosie 166 seafood f 46

# we can remove more than one element at a time

>>> print(df1.drop(["Mary", "Francis"]))

# or remove an entire column

# to do this, we must specify axis 1 (the default axis 0 indicates rows)

>>> df1.drop("Height", axis = 1)

Names Pref_food Sex Weight

Simon steak m 85

Kate pizza f 65

Francis pasta m 68

Laura pizza f 45

Mary vegetables f 43

Julian steak m 72

Rosie seafood f 46

To remove multiple columns, include them in a list:

>>> df.drop(["column1", "column2"], axis = 1)

# instead of specifying the axis as 1, we can also specify "columns"

>>> df.drop(["column1", "column2"], axis = "columns")

The pandas package also allows us to create cross-tabs with the same function. For instance, in the df1 dataset, we can create a cross-tab that correlates gender with favorite food:

>>> pd.crosstab(df1.Sex, df1.Pref_food)

Pref_food pasta pizza seafood steak vegetables

Sex

f 0 2 1 0 1

m 1 0 0 2 0

When we add the margins parameter, we also get the row with the totals:

>>> pd.crosstab(df1.Sex, df1.Pref_food, margins = True)

Pref_food pasta pizza seafood steak vegetables All

Sex

f 0 2 1 0 1 4

m 1 0 0 2 0 3

All 1 2 1 2 1 7

pandas: Importing and Exporting Data

The pandas library is also very important with regard to importing files from your computer in various formats, including .csv:

import pandas as pd

>>> df2 = pd.read_csv("file_address.csv")

# we can import a file by specifying that the first line does NOT contain variable names

>>> df2 = pd.read_csv("file_address.csv", header = None)

# or by specifying which row contains variable names

>>> df2 = pd.read_csv("file_address.csv", header = 0)

>>> df2 = pd.read_csv("file_address.csv", header = 1)

# we can also specify column names

>>> df2 = pd.read_csv("file_address.csv", names = ['var1', 'var2, 'var3, 'var4])

# or we can import only a part of the columns

>>> df2 = pd.read_csv("file_address.csv", usecols = [1,2,3])

# we can also specify the element separating our data

>>> df2 = pd.read_csv("file_address.csv", sep = " ")

# we can now take a look at the first records of the data frame

>>> df2.head()

# or the last

>>> df2.tail()

# we use parentheses to specify the number of cases to be displayed

>>> df2.head(9)

>>> df2.tail(5)

# and we can see a summary of the dataset

>>> df2.describe()

>>> df2.info()

>>> df.dtypes() # it tells us our data type

The pd.read_csv() function can also be used to import files from the Web. The UCI machine learning repository ( https://archive.ics.uci.edu/ml/index.php ) features several data sets used for machine learning.

Let’s look at an iris dataset. Figures 8-1 through 8-3 and the following steps help us navigate to its tab and copy the link featuring the dataset.

../images/469457_1_En_8_Chapter/469457_1_En_8_Fig1_HTML.jpg — Figure 8-1
Iris dataset on the UCI machine learning web site click on ‘Iris’ to get description page

../images/469457_1_En_8_Chapter/469457_1_En_8_Fig2_HTML.jpg — Figure 8-2
*Iris dataset on the UCI machine learning web site*

../images/469457_1_En_8_Chapter/469457_1_En_8_Fig3_HTML.jpg — Figure 8-3
Iris dataset on the UCI machine learning web site click on ‘Data Folder’ to access the page containing the dataset

First, we copy the link featuring the dataset.

# we import pandas

>>> import pandas as pd

# we create an object featuring the dataset, which we are importing with pd.read_csv()

>>> iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")

# we check data accuracy by displaying some occurrences using the function .head()

>>> iris.head(3)

5.1 3.5 1.4 0.2 Iris-setosa

0 4.9 3.0 1.4 0.2 Iris-setosa

1 4.7 3.2 1.3 0.2 Iris-setosa

2 4.6 3.1 1.5 0.2 Iris-setosa

The structure of the pandas function we used, pd.read_csv(), also includes other arguments we did not use for this import:

pd.read_csv( filepath, sep = ',', dtype = None, skiprows = None, index_col = None, skip_blank_lines = True, na_filter = True)

filepath the address of the file (on the computer or externally)
sep = ',' the separator dividing data, such as a comma or semicolon
dtype= a means of specifying column format
header= names of variables in the first line, if any
skiprows= a means of importing only one part of cases—for example, skiprows = 50 reads data from the 51st case onward
index_col= a means of setting a column as a data index
skip_blank_lines= a means of removing any blank lines in the dataset
na_filter= a means of identifying the missing values in the dataset and, if set to False, removing them

In addition, the usecols= argument can be used to import only a few columns in a dataset. For instance, let’s say we have the small dataset shown in Figure 8-4, from which we want to extract the following variables:

>>> df3 = pd.read_csv("~students.csv",usecols =["ID", "mark1"], index_col= ["ID"])

# we verify data

df3

mark1

1 17.0

2 24.0

3 17.0

4 27.0

5 30.0

6 30.0

7 23.0

8 17.0

9 21.0

10 24.0

11 24.0

12 25.0

13 24.0

14 NaN

15 22.0

16 30.0

17 29.0

18 29.0

19 NaN

# correctly, we only loaded the 'mark1' variable

../images/469457_1_En_8_Chapter/469457_1_En_8_Fig4_HTML.jpg — Figure 8-4
Dataset students.csv

One specular function is df.to_csv(“filename”), which allows us to write a .csv file into our work directory. We can specify the index=False argument to avoid downloading the index together with data.

As an alternative to this file import function, we can use the pd.read_table(filepath, sep = ‘,’) function , which sets both the file path (filepath) and the separator.

Other pandas features allow us to read Excel .xls or .xlsx files. To do this, we use a generic formula:

df = pd.read_excel(filepath, "sheet_name")

In this case, not only must we specify the address of the file, but also whether the Excel file features more than one data sheet, and the name of the sheet from which we want to read data. As with .csv, for Excel formats we also have a formula that allows us to write an Excel file in the work directory of our computer: df.to_excel(). The pandas package also contains a function for reading files in JSON—pd.read_json()—and also allows us to access Web data via the pd.read_html(url) function and to convert a data frame to an HTML table via pd.to_html().

Let’s look at an example of creating and exporting a file in JSON:

>>> jf = pd.DataFrame(np.arange(16).reshape(4,4), index = ['A', 'B', 'C', 'D'], columns = ['var1', 'var2', 'var3', 'var4'])

>>> jf

var1 var2 var3 var4

A 0 1 2 3

B 4 5 6 7

C 8 9 10 11

D 12 13 14 15

>>> jf.to_json('jf.json')

# all that is left is to check that the file has been created correctly in our work directory

pandas: Data Manipulation

We have seen how to create groups and manipulate data frames. Now let’s look at more manipulation elements through pandas. First, let’s import pandas and create a small dataset:

>>> df2 = pd.DataFrame({'Names': ['Simon', 'Kate', 'Francis', 'Laura', 'Mary', 'Julian', 'Rosie', 'Simon', 'Laura'],

'Height':[180, 165, 170, 164, 163, 175, 166, 180, 164],

'Weight':[85, 65, 68, 45, 43, 72, 46, 85, 45],

'Pref_food': ['steak', 'pizza', 'pasta', 'pizza', 'vegetables', 'steak', 'seafood', 'steak', 'pizza'],

'Sex': ['m','f','m','f','f','m','f', 'm', 'f']})

Starting with the ‘Sex’ variable, let’s create two dummy variables and continue with pandas:

>>> df_dummy = pd.get_dummies(df2['Sex'], prefix = 'Sex')

>>> df_dummy

Sex_f Sex_m

0 0.0 1.0

1 1.0 0.0

2 0.0 1.0

3 1.0 0.0

4 1.0 0.0

5 0.0 1.0

6 1.0 0.0

7 0.0 1.0

8 1.0 0.0

At this point, we can join these two dummy variables to the original dataset:

>>> df2.join(df_dummy)

Unnamed: 0 Height Names Food Sex Weight Sex_f Sex_m

0 0 180 Simon steak m 85 0.0 1.0

1 1 165 Kate pizza f 65 1.0 0.0

2 2 170 Francis pasta m 68 0.0 1.0

3 3 164 Laura pizza f 45 1.0 0.0

4 4 163 Mary vegetables f 43 1.0 0.0

5 5 175 Julian steak m 72 0.0 1.0

6 6 166 Rosie seafood f 46 1.0 0.0

7 0 180 Simon steak m 85 0.0 1.0

8 3 164 Laura pizza f 45 1.0 0.0

We save the joined dataset featuring the dummy variables in a new object:

>>> df3 = df2.join(df_dummy)

# let's check

>>> print(df3.head(2))

Height Names Pref_food Sex Weight Sex_f Sex_m

0 180 Simon steak m 85 0.0 1.0

1 165 Kate pizza f 65 1.0 0.0

Now let’s remove the original ‘Sex’ variable:

>>> del df3['Sex']

# double-check

>>> print(df3.head(2))

Height Names Pref_food Weight Sex_f Sex_m

0 180 Simon steak 85 0.0 1.0

1 165 Kate pizza 65 1.0 0.0

As you can see, there are a couple of duplicate cases inside our dataset. First, we need to identify them:

>>> df2.duplicated()

0 False

1 False

2 False

3 False

4 False

5 False

6 False

7 True

8 True

dtype: bool

# our dataset has two duplicate cases

To remove duplicate cases, we use .drop_duplicates():

>>> df2.drop_duplicates()

To get a dataset without duplicates, we overwrite the old one or create another object:

>>> df3 = df2.drop_duplicates()

We can also delete a case, this time using drop:

>>> df2.drop(2)

Height Names Pref_food Sex Weight

0 180 Simon steak m 85

1 165 Kate pizza f 65

3 164 Laura pizza f 45

4 163 Mary vegetables f 43

5 175 Julian steak m 72

6 166 Rosie seafood f 46

7 180 Simon steak m 85

8 164 Laura pizza f 45

# in this case, we delete the third case

The stack() and unstack() functions allow us to reorganize our data in a different way:

>>> df3.stack()

0 Height 180

Names Simon

Pref_food steak

Sex m

Weight 85

1 Height 165

Names Kate

Pref_food pizza

Sex f

Weight 65

2 Height 170

Names Francis

Pref_food pasta

Sex m

Weight 68

3 Height 164

Names Laura

Pref_food pizza

Sex f

Weight 45

4 Height 163

Names Mary

Pref_food vegetables

Sex f

Weight 43

[...]

# we save the result in an object

>>> stacked = df3.stack()

# and now we can return it to its original format using unstacked()

>>> unstacked = stacked.unstack()

>>> print(unstacked)

Height Names Pref_food Sex Weight

0 180 Simon steak m 85

1 165 Kate pizza f 65

2 170 Francis pasta m 68

3 164 Laura pizza f 45

4 163 Mary vegetables f 43

5 175 Julian steak m 72

6 166 Rosie seafood f 46

We can also reorganize our data via the melt() function:

>>> pd.melt(df3)

variable value

0 Height 180

1 Height 165

2 Height 170

3 Height 164

4 Height 163

5 Height 175

6 Height 166

7 Names Simon

8 Names Kate

9 Names Francis

10 Names Laura

11 Names Mary

12 Names Julian

13 Names Rosie

[...]

Let’s go back to the initial dataset

>>> df2.head(2)

Height Names Pref_food Sex Weight

0 180 Simon steak m 85

1 165 Kate pizza f 65

and use the .T() function to transpose rows and columns:

# we invert rows and columns

>>> df2.T

../images/469457_1_En_8_Chapter/469457_1_En_8_Figl_HTML.jpg

We can also extract a random sample of our data using the .sample() function:

>>> df2.sample(n=2)

Height Names Pref_food Sex Weight

6 166 Rosie seafood f 46

4 163 Mary vegetables f 43

In parentheses (n=2), we inserted the number of cases to be extracted. How can we always extract the same cases randomly, so that extraction can be repeated? We use the np.random.seed() function. The number included in parentheses in this function does not really matter, but if two people use the same dataset and use the same number, they will extract the same cases.

To do this, we need to import the NumPy package:

>>> import numpy as np

>>> np.random.seed(1)

>>> df2.sample(n=2)

Height Names Pref_food Sex Weight

3 164 Laura pizza f 45

7 180 Simon steak m 85

>>> np.random.seed(1)

>>> df2.sample(n=2)

Height Names Pref_food Sex Weight

3 164 Laura pizza f 45

7 180 Simon steak m 85

Instead of extracting a number of cases, we can extract a percentage, with the argument frac= instead of n=

>>> df2.sample(frac = .1)

>>> df2.sample(frac = .5)

Using the argument frac =, we specified the percentage; 0.1 is 10%, 0.5 is 50%, and so on.

# we can view cases with the highest values for a certain variable by specifying the number of cases (in this case, 3) and the column (in this case, Weight)

# let's consider the df3 dataset, which does not feature duplicates

>>> df3.nlargest(3, "Weight")

Height Names Pref_food Sex Weight

0 180 Simon steak m 85

5 175 Julian steak m 72

2 170 Francis pasta m 68

Let’s now call the cases with the lowest values for a certain variable, and specify the number of cases:

>>> df3.nsmallest(4, "Weight")

Height Names Pref_food Sex Weight

4 163 Mary vegetables f 43

3 164 Laura pizza f 45

6 166 Rosie seafood f 46

1 165 Kate pizza f 65

Last, we can create a small dataset and reorganize data with the pivot_table() function:

>>> df4 = pd.DataFrame({'Date': ['2017-01-01', '2017-01-01', '2017-01-02', '2017-01-01', '2017-01-02', '2017-01-02', '2017-01-03', '2017-01-02', '2017-01-03', '2017-01-03'],

'Type':['x', 'x', 'y', 'x', 'y', 'x', 'z', 'y', 'z', 'y'],

'Value':[185, 265, 168, 245, 143, 172, 346, 285, 145, 128],

})

>>> print(df4)

Date Type Value

0 2017-01-01 x 185

1 2017-01-01 x 265

2 2017-01-02 y 168

3 2017-01-01 x 245

4 2017-01-02 y 143

5 2017-01-02 x 172

6 2017-01-03 z 346

7 2017-01-02 y 285

8 2017-01-03 z 145

9 2017-01-03 y 128

>>> print(pd.pivot_table(df4, index = "Date", values = "Value", columns = "Type"))

Type x y z

Date

2017-01-01 231.666667 NaN NaN

2017-01-02 172.000000 198.666667 NaN

2017-01-03 NaN 128.000000 245.5

# we aggregate the values around the dates

pandas: Missing Values

Now let’s examine how we can manage datasets with missing data. After importing pandas, we load a dataset that has missing values:

>>> import pandas as pd

>>> df_missing = pd.read_csv('df_missing.csv')

# I created this tiny dataset with missing data, as you can see

>>> df_missing

A B C D E

0 15 96.0 74.0 31.0 50

1 41 27.0 74.0 279.0 57

2 21 32.0 NaN 99.0 96

3 48 97.0 50.0 NaN 69

4 63 98.0 74.0 44.0 55

5 43 11.0 80.0 74.0 33

6 39 38.0 81.0 20.0 41

7 58 31.0 NaN 76.0 91

8 85 94.0 37.0 65.0 60

9 98 NaN 19.0 43.0 32

We can verify the number of cases and variables:

>>> print(df_missing.shape)

(10, 5)

We can check the complete cases:

>>> print(df_missing.dropna().shape)

(6, 5)

We can also use .isnull(), which answers the question: Is a value missing?

>>> pd.isnull(df_missing)

A B C D E

0 False False False False False

1 False False False False False

2 False False True False False

3 False False False True False

4 False False False False False

5 False False False False False

6 False False False False False

7 False False True False False

8 False False False False False

9 False True False False False

The converse, .notnull(), answers the question: Is a value not missing?

>>> pd.notnull(df_missing)

A B C D E

0 True True True True True

1 True True True True True

2 True True False True True

3 True True True False True

4 True True True True True

5 True True True True True

6 True True True True True

7 True True False True True

8 True True True True True

9 True False True True True

We can add the missing values per column:

>>> df_missing.isnull().sum()

A 0

B 1

C 2

D 1

E 0

# or determine the total of missing data

>>> df_missing.isnull().sum().sum()

Missing values can be treated by deletion (or by deleting the cases that contain them) or imputation (that is, by replacing missing values with other values, such as a fixed value or an average). To perform a deletion, proceed as follows:

>>> df_missing.dropna()

A B C D E

0 15 96.0 74.0 31.0 50

1 41 27.0 74.0 279.0 57

4 63 98.0 74.0 44.0 55

5 43 11.0 80.0 74.0 33

6 39 38.0 81.0 20.0 41

8 85 94.0 37.0 65.0 60

We eliminated the lines containing missing values. We can also delete the columns with missing values:

>>> df_missing.dropna(axis = 1, how = 'any')

A E

0 15 50

1 41 57

2 21 96

3 48 69

4 63 55

5 43 33

6 39 41

7 58 91

8 85 60

9 98 32

As mentioned, using imputation methods, we can replace a missing value with another value. Let’s replace all missing values with a fixed value—in this case, zero:

>>> df_missing.fillna(0)

A B C D E

0 15 96.0 74.0 31.0 50

1 41 27.0 74.0 279.0 57

2 21 32.0 0.0 99.0 96

3 48 97.0 50.0 0.0 69

4 63 98.0 74.0 44.0 55

5 43 11.0 80.0 74.0 33

6 39 38.0 81.0 20.0 41

7 58 31.0 0.0 76.0 91

8 85 94.0 37.0 65.0 60

9 98 0.0 19.0 43.0 32

We can also replace missing values with a word, such as “missing”:

>>> df_missing.fillna('missing')

A B C D E

0 15 96 74 31 50

1 41 27 74 279 57

2 21 32 missing 99 96

3 48 97 50 missing 69

4 63 98 74 44 55

5 43 11 80 74 33

6 39 38 81 20 41

7 58 31 missing 76 91

8 85 94 37 65 60

9 98 missing 19 43 32

Or impute missing values to the average:

>>> df_missing.fillna(df_missing.mean())

A B C D E

0 15 96.000000 74.000 31.000000 50

1 41 27.000000 74.000 279.000000 57

2 21 32.000000 61.125 99.000000 96

3 48 97.000000 50.000 81.222222 69

4 63 98.000000 74.000 44.000000 55

5 43 11.000000 80.000 74.000000 33

6 39 38.000000 81.000 20.000000 41

7 58 31.000000 61.125 76.000000 91

8 85 94.000000 37.000 65.000000 60

9 98 58.222222 19.000 43.000000 32

Let’s consider a variable, such as variable ‘C’. The following line of code will replace every missing value in the variable C with the mean of C:

>>> df_missing['C'].fillna(df_missing['C'].mean())

0 74.000

1 74.000

2 61.125

3 50.000

4 74.000

5 80.000

6 81.000

7 61.125

8 37.000

9 19.000

Name: C, dtype: float64

Two important .fillna methods can be used to impute missing values: ffill and backfill:

>>> df_missing['C'].fillna(method = 'ffill')

0 74.0

1 74.0

2 74.0

3 50.0

4 74.0

5 80.0

6 81.0

7 81.0

8 37.0

9 19.0

Name: C, dtype: float64

>>> df_missing['C'].fillna(method = 'backfill')

0 74.0

1 74.0

2 50.0

3 50.0

4 74.0

5 80.0

6 81.0

7 37.0

8 37.0

9 19.0

Name: C, dtype: float64

# ffill replaces the missing value with the default value preceding it, whereas backfill replaces it with the nonmissing value that follows it. To save the results of one of these insertion or replacement operations, we must create a new data frame or overwrite the initial one.

pandas: Merging Two Datasets

There are many ways to combine two data frames. We can delete overlapping cases, juxtapose two files, and join new cases or new variables. Let’s look at some pandas examples. First, we need to create sample data frames:

>>> df1 = pd.DataFrame({'id': ['A', 'B', 'C', 'D'], 'var1' : [37, 36, 43, 23], 'var2': [120, 117, 230, 315]})

>>> df1

id var1 var2

0 A 37 120

1 B 36 117

2 C 43 230

3 D 23 315

>>> df2 = pd.DataFrame({'id': ['A', 'B', 'C', 'D'], 'var3' : [12, 16, 13, 18], 'var4': [75, 54, 21, 36]})

>>> df2

id var3 var4

0 A 12 75

1 B 16 54

2 C 13 21

3 D 18 36

>>> df3 = pd.DataFrame({'id': ['A', 'A', 'B', 'B'], 'var3' : [2, 6, 3, 8], 'var4': [7, 5, 2, 6]})

id var3 var4

0 A 2 7

1 A 6 5

2 B 3 2

3 B 8 6

The easiest way to merge two datasets is to use the .append() method:

>>> df1.append(df2)

id var1 var2 var3 var4

0 A 37.0 120.0 NaN NaN

1 B 36.0 117.0 NaN NaN

2 C 43.0 230.0 NaN NaN

3 D 23.0 315.0 NaN NaN

0 A NaN NaN 12.0 75.0

1 B NaN NaN 16.0 54.0

2 C NaN NaN 13.0 21.0

3 D NaN NaN 18.0 36.0

>>> df3.append(df1)

id var1 var2 var3 var4

0 A NaN NaN 2.0 7.0

1 A NaN NaN 6.0 5.0

2 B NaN NaN 3.0 2.0

3 B NaN NaN 8.0 6.0

0 A 37.0 120.0 NaN NaN

1 B 36.0 117.0 NaN NaN

2 C 43.0 230.0 NaN NaN

3 D 23.0 315.0 NaN NaN

As you can see, the two datasets are juxtaposed every time, without the system taking into account and adjusting the index . To merge more than two datasets, we can use .concat():

>>> pd.concat([df1, df2, df3])

id var1 var2 var3 var4

0 A 37.0 120.0 NaN NaN

1 B 36.0 117.0 NaN NaN

2 C 43.0 230.0 NaN NaN

3 D 23.0 315.0 NaN NaN

0 A NaN NaN 12.0 75.0

1 B NaN NaN 16.0 54.0

2 C NaN NaN 13.0 21.0

3 D NaN NaN 18.0 36.0

0 A NaN NaN 2.0 7.0

1 A NaN NaN 6.0 5.0

2 B NaN NaN 3.0 2.0

3 B NaN NaN 8.0 6.0

In this way, we concatenated the datasets horizontally. We can also concatenate them vertically by specifying the axis as 1 (in the previous example, the axis was not specified so it defaulted to zero):

>>> pd.concat([df1, df2, df3], axis = 1)

# the code above and this line below will give us the same result

>>> pd.concat([df1, df2, df3], axis = 'columns')

id var1 var2 id var3 var4 id var3 var4

0 A 37 120 A 12 75 A 2 7

1 B 36 117 B 16 54 A 6 5

2 C 43 230 C 13 21 B 3 2

3 D 23 315 D 18 36 B 8 6

We can combine two datasets in an even more advanced way, starting from two tables that have some rows and columns in common, and merging them through shared data. The pandas function that allows us to make this type of join is merge(). First, we need to create two datasets:

>>> stud1 = pd.DataFrame({'ID': [1,2,3,4,5], 'names' : ['John', 'Greta', 'Francis', 'Laura', 'Charles'], 'Logic': [28, 27, 23, 25, 30]})

>>> stud2 = pd.DataFrame({'ID': [1,6,3,4,7], 'names' : ['John', 'Kate', 'Francis', 'Laura', 'Simon'], 'Analysis': [23, 24, 28, 29, 22]})

# by default, the merge() function extracts common cases to the two datasets, so it is 'inner' by default. Inner join selects records that have matching values in both datasets

>>> pd.merge(stud1, stud2, on = 'ID')

ID Logic names_x Analysis names_y

0 1 28 John 23 John

1 3 23 Francis 28 Francis

2 4 25 Laura 29 Laura

There are other means of merging two datasets: ‘left’, ‘right’, and ‘outer’. We use the dataset index—in this case, the ID column—as a key (index) to merge the datasets.

>>> print(pd.merge(stud1, stud2, on = 'ID', how = 'left'))

ID Logic names_x Analysis names_y

0 1 28 John 23.0 John

1 2 27 Greta NaN NaN

2 3 23 Francis 28.0 Francis

3 4 25 Laura 29.0 Laura

4 5 30 Charles NaN NaN

The ‘left’ parameter allows us to merge the second dataset into the first, taking only the cases of the first dataset and the cases that are common to the second and the first dataset.

>>> print(pd.merge(stud1, stud2, on = 'ID', how = 'right'))

ID Logic names_x Analysis names_y

0 1 28.0 John 23 John

1 3 23.0 Francis 28 Francis

2 4 25.0 Laura 29 Laura

3 6 NaN NaN 24 Kate

4 7 NaN NaN 22 Simon

The ‘right’ parameter allows to merge the first dataset into the second, taking only the cases of the second and those that are common to the second and the first at the same time.

>>> print(pd.merge(stud1, stud2, on = 'ID', how = 'outer'))

ID Logic names_x Analysis names_y

0 1 28.0 John 23.0 John

1 2 27.0 Greta NaN NaN

2 3 23.0 Francis 28.0 Francis

3 4 25.0 Laura 29.0 Laura

4 5 30.0 Charles NaN NaN

5 6 NaN NaN 24.0 Kate

6 7 NaN NaN 22.0 Simon

When we use ‘outer’, we merge the cases of the two data frames, holding only a copy of the double cases, but without losing data from one of the two initial datasets.

pandas: Basic Statistics

The pandas library has a lot of methods for statistics, which we can use to get descriptive statistical information. As always, we use our data frame:

>>> df = pd.DataFrame({'Names': ['Simon', 'Kate', 'Francis', 'Laura', 'Mary', 'Julian', 'Rosie'],

'Height':[180, 165, 170, 164, 163, 175, 166],

'Weight':[85, 65, 68, 45, 43, 72, 46],

'Pref_food':['steak', 'pizza', 'pasta', 'pizza', 'vegetables','steak','seafood'],

'Sex':['m', 'f','m','f', 'f', 'm', 'f']})

# some information on numeric data

>>> df.describe()

Height Weight

count 7.000000 7.000000

mean 169.000000 60.571429

std 6.377042 16.154021

min 163.000000 43.000000

25% 164.500000 45.500000

50% 166.000000 65.000000

75% 172.500000 70.000000

max 180.000000 85.000000

# we can also rearrange the statistics with the .transpose method to make it more readable

>>> df.describe().transpose()

count mean std min 25% 50% 75% max

Height 7.0 169.000000 6.377042 163.0 164.5 166.0 172.5 180.0

Weight 7.0 60.571429 16.154021 43.0 45.5 65.0 70.0 85.0

# value counts

>>> df.count()

Height 7

Names 7

Food 7

Sex 7

Weight 7

dtype: int64

# median calculation

>>> df['Height'].median()

166.0

# average calculation

>>> df['Height'].mean()

169.0

# minimum value

>>> df['Height'].min()

163

# maximum value

>>> df['Height'].max()

180

Table 8-1 provides a summary of statistical methods.

Table 8-1

Statistical Methods

Method	Description
.describe	Provides some descriptive statistics
.count	Returns the number of values per variable
.mean	Returns the average
.median	Returns the median
.mode	Returns the mode
.min	Returns the lowest value
.max	Returns the highest value
.std	Returns the standard deviation
.var	Returns the variance
.skew	Returns skewness
.kurt	Returns kurtosis

Summary

In this chapter, we learned some easier ways to import and manage our data using pandas—one of the most important libraries for data manipulation and data science. In Chapter 9, we examine another package that is important for data manipulation: NumPy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8. pandas

Create new playlist

Sign In

Sign Up

8. pandas

Libraries for Data Mining

pandas

pandas: Series

pandas: Data Frames

pandas: Importing and Exporting Data

pandas: Data Manipulation

pandas: Missing Values

pandas: Merging Two Datasets

pandas: Basic Statistics

Summary

Table of Contents for
8. pandas