© Valentina Porcu 2018
Valentina PorcuPython for Data Mining Quick Syntax Referencehttps://doi.org/10.1007/978-1-4842-4113-4_8

8. pandas

Valentina Porcu1 
(1)
Nuoro, Italy
 

In Chapter 7, we learned how to import a generic file using basic functions. Here we explore pandas—one of the most important libraries for dataset management.

Libraries for Data Mining

From this point onward, including the following chapters, we examine the most important and most used data mining libraries:
  • pandas: imports, manages, and manipulates data frames in various formats, extracts part of the data, combines two data frames, and also contains some basic statistical functions

  • NumPy: a package for scientific computing, contains several high-level mathematical and algebraic functions, and random number generation; and allows the creation of arrays

  • Matplotlib: a library that allows the creation charts from datasets

  • SciPy: contains more than 60 statistical functions related to mathematical and statistical analysis

  • scikit-learn: the most important tool for machine learning and data analysis.

pandas

In this chapter, we move away from discussions of Python structures and start to look at the most important data mining packages, beginning with the pandas library.

pandas is an open-source Python library that contains various tools for importing, managing, and manipulating data. It has a number of high-level features for manipulating, reorganizing and scanning structured data, including slicing, managing missing values, restructuring data, extracting dataset parts, and importing and data parsing from the Web. pandas is one of the most important data mining libraries. We can
  • Read and import structured data

  • Organize and manipulate them

  • Calculate some basic statistics

We can import the whole library:
>>> import pandas as pd
or import only primary structures in pandas, which are Series and DataFrame:
>>> from pandas import Series, DataFrame

As we shall see, pandas, NumPy, SciPy, and Matplotlib typically work together. I present them in the best possible way—separately at first—to effect clarity.

pandas: Series

As a first action, we always download and install the package or packages we need—in this case, pandas. pandas is part of the Anaconda suite, so you do not have to install it if you have Anaconda installed. Call it this way:
>>> import pandas as pd
As you can see, we import the package by creating a shortcut: pd. To call a pandas function at this point, we simply write
pd.function_name()
We can also import specific structures, such as Series and DataFrames:
>>> from pandas import Series, DataFrame

In this case, when we call up or create one of these two structures, we need not specify “pd” at the beginning.

The first pandas structure is Series—a one-dimensional array characterized by an index. Let’s create our first series:
>>> series1 = Series([25, 27, 28, 30],
          index = ["student1", "student2", "student3", "student4"])
>>> print(series1)
student1    25
student2    27
student3    28
student4    30
dtype: int64
# we then use the .describe() function to get some statistical information about the series
>>> series1.describe()
count     4.000000
mean     27.500000
std       2.081666
min      25.000000
25%      26.500000
50%      27.500000
75%      28.500000
max      30.000000
dtype: float64
Let’s look at the first element of the series:
>>> series1[0]

25

We can also extract multiple elements:
>>> series1[[2,3]]
student3    28
student4    30
dtype: int64
Or elements from the index:
>>> series1["student1"]
25
>>> series1[["student1", "student4"]]
student1    25
student4    30
dtype: int64
The .index method allows us to index the series:
>>> series1.index
Index(['student1', 'student2', 'student3', 'student4'], dtype="object")
We can also verify the presence of an element in a series:
>>> 'student2' in series1
True
If we did not import series and data frames separately, we can create a series by specifying that we are using the pandas library:
>>> series2 = pd.Series([40, 20, 35, 70],
          index = ["price1", "price2", "price3", "price4"])
# if we did not import pandas as pd, we can specify the full package name
>>> pandas.Series([40, 20, 35, 70],
          index = ["price1", "price2", "price3", "price4"])
The pandas series are marked by indexes. If we do not specify the index, it is created automatically:
>>> series3 = Series([25, 27, 28, 30])
series3
0    25
1    27
2    28
3    30
dtype: int64
To replace some of the elements in a series, we use the .replace method.
>>> series3.replace([25, 27], [125, 127])
0    125
1    127
2     28
3     30
dtype: int64
Let’s create another series: one with random numbers. To do this, we need to load the NumPy library:
>>> import numpy as np
# we create a series of random numbers
>>> series4 = pd.Series(np.random.randn(7))
# we print it and examine the index
series4
0   -0.227393
1   -1.079208
2    0.101591
3    0.157502
4    1.541307
5   -0.182501
6   -0.247327
dtype: float64
# we can modify the index with the .index() method
>>> series4.index = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
# and check the index again
series4
a   -0.227393
b   -1.079208
c    0.101591
d    0.157502
e    1.541307
f   -0.182501
g   -0.247327
dtype: float64
# we create another random series
>>> series5 = pd.Series(np.random.randn(7))
# and then merge the two series using the .concat() method
>>> s45 = pd.concat([series4, series5])
# we print the created series
>>> s45
a   -0.227393
b   -1.079208
c    0.101591
d    0.157502
e    1.541307
f   -0.182501
g   -0.247327
0   -0.610507
1    0.282318
2   -1.142692
3   -1.081449
4   -1.818420
5   -1.133354
6    0.804213
dtype: float64
# we check one of the created structures
>>> type(s45)
>>> pandas.core.series.Series
# we remove some elements of the series
>>> >>> del(s45[5])
>>> s45
a   -0.227393
b   -1.079208
c    0.101591
d    0.157502
e    1.541307
f   -0.182501
g   -0.247327
0   -0.610507
1    0.282318
2   -1.142692
3   -1.081449
4   -1.818420
6    0.804213
dtype: float64
>>> del(s45['f'])
>>> s45
b   -1.079208
c    0.101591
e    1.541307
g   -0.247327
0   -0.610507
1    0.282318
2   -1.142692
3   -1.081449
4   -1.818420
6    0.804213
dtype: float64
# we slice some elements from the series
>>> s45[0]
-0.61050680805211532
>>> s45[2:4]
e    1.541307
g   -0.247327
dtype: float64
# we can perform object slicing up to the fourth position
>>> s45[:4]
# or from the fifth position onward
>>> s45[5:]
# or we can extract the last three items
>>> s45[-3:]
# we can extract the first and last elements using with the .head() and .tail() methods
>>> s45.head(2)
>>> s45.tail(3)
# we can reverse the series
>>> s45[::-1]
# last, we can create a copy of the series with the .copy() method
>>> copys45 = s45.copy()

pandas: Data Frames

The most important structure in pandas is the DataFrame, a structure which extends the capabilities of the Series and allows us to manage our datasets. Let’s create a small data frame with pandas:
# we import pandas
>>> import pandas as pd
# we use the DataFrame functions to create or import a dataset or file from the computer
# to get help with a particular function, we type
>>> help(pd.DataFrame())
Help on DataFrame in module pandas.core.frame object:
# class DataFrame(pandas.core.generic.NDFrame)
  |  Two-dimensional size-mutable, potentially heterogeneous
  |  tabular data structure with labeled axes (rows and columns).
  |  Arithmetic operations align on both row and column labels.
  |  Can be thought of as a dict-like container for Series objects.
  |  The primary pandas data structure
  |  
  |  Parameters
  |  _ _ _ _ _ _ _ _ _ _
  |  data : numpy ndarray (structured or homogeneous), dict, or
  |  DataFrame
  |      Dict can contain Series, arrays, constants, or list-like
  |      objects
  |  index : Index or array-like
  |      Index to use for resulting frame. Will default to
  |      np.arange(n) if no indexing information part of input
  |      data and no index provided
  |  columns : Index or array-like
  |      Column labels to use for resulting frame. Will default to
  |      np.arange(n) if no column labels are provided
  |  dtype : dtype, default None
# we create our first dataset
>>> df1 = pd.DataFrame({'Names': ['Simon', 'Kate', 'Francis', 'Laura', 'Mary', 'Julian', 'Rosie'],
        'Height':[180, 165, 170, 164, 163, 175, 166],
        'Weight':[85, 65, 68, 45, 43, 72, 46],
        'Pref_food':['steak', 'pizza', 'pasta', 'pizza', 'vegetables','steak','seafood'],
        'Sex':['m', 'f','m','f', 'f', 'm', 'f']})
# caution: we cannot create variables with names that contain a space; for instance, we can create the variable 'Var_1', but naming a variable 'Var 1' returns an error
# we print the data frame
>>> df1
   Height    Names   Pref_food Sex  Weight
0     180    Simon       steak   m      85
1     165     Kate       pizza   f      65
2     170  Francis       pasta   m      68
3     164    Laura       pizza   f      45
4     163     Mary  vegetables   f      43
5     175   Julian       steak   m      72
6     166    Rosie     seafood   f      46
# it may be useful to conduct an analysis of our data , which we can do using the .describe() method
>>> df1.describe()
           Height     Weight
count    7.000000   7.000000
mean   169.000000  60.571429
std      6.377042  16.154021
min    163.000000  43.000000
25%    164.500000  45.500000
50%    166.000000  65.000000
75%    172.500000  70.000000
max    180.000000  85.000000
# we can recall variable names
>>> df1.columns
Index(['Height', 'Names', 'Pref_food', 'Sex', 'Weight'],
dtype='object')
At this point, we can continue by setting a variable—in this case, ‘Names’—as index in this way:
>>> df1.set_index('Names')

../images/469457_1_En_8_Chapter/469457_1_En_8_Figb_HTML.jpg

# to consolidate the index, we use the argument inplace = True
>>> df1.set_index('Names', inplace = True)
# we can reset the index as follows:
>>> df1.reset_index()
We can continue to explore our data.
# we check the number of cases and variables
>>> print(df1.shape)
(7, 4)
# and determine variable type
>>> print(df1.dtypes)
Height        int64
Pref_food    object
Sex          object
Weight        int64
dtype: object
Here, too, we can see the first and the last elements using the head()and tail() functions. We may want to specify the number of cases we want to display. In this case, we would use parentheses:
>>> df1.head()
>>> df1.tail(3)
We can also get information about variables using the .info() method:
>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
Height       7 non-null int64
Names        7 non-null object
Pref_food    7 non-null object
Sex          7 non-null object
Weight       7 non-null int64
dtypes: int64(2), object(3)
memory usage: 360.0+ bytes
We can then rearrange the data according to one of the variables:
>>> df1.sort_values(by = 'Weight')

../images/469457_1_En_8_Chapter/469457_1_En_8_Figc_HTML.jpg

# we rearranged our data based on weight, from the lowest to the highest value. To reverse these values, from highest to lowest, we specify the ascending argument as False
>>> df1.sort_values(by = 'Weight', ascending = False)

../images/469457_1_En_8_Chapter/469457_1_En_8_Figd_HTML.jpg

# if there are any missing values in our dataset (in particular, in the variable used as the index), we can put them all at the beginning or end of our display in the following ways
>>> df1.sort_values(by = 'Weight', na_position = "last")
>>> df1.sort_values(by = 'Weight', na_position = "first")
# to rearrange the dataset by index, we can use sort_index, instead of sort_values, to sort the data in an ascending or descending way
>>> df1.sort_index()
>>> df1.sort_index(ascending= False)

../images/469457_1_En_8_Chapter/469457_1_En_8_Fige_HTML.jpg

Let’s look at some examples of slicing items from a data frame:
# from the first column (Python starts counting from 0, so we use the 0 column), let's slice the first three cases:
>>> df1[:3]
      Names    Height Pref_food Sex  Weight
0     Simon      180      steak   m      85
1      Kate      165      pizza   f      65
2   Francis      170      pasta   m      68
# or slice from the fourth case to the end
>>> df1[4:]
    Names   Heigth   Pref_food Sex  Weight
4      Mary    163  vegetables   f      43
5      Julian  175       steak   m      72
6      Rosie   166     seafood   f      46
# or slice one of the columns
>>> df1['Names']
0      Simon
1       Kate
2    Francis
3      Laura
4       Mary
5     Julian
6      Rosie
Name: Names, dtype: object
# by using single square brackets , as we just did, we can extract a variable like a Series
# here's another way to do this
>>> df1.Names
0      Simon
1       Kate
2    Francis
3      Laura
4       Mary
5     Julian
6      Rosie
Name: Names, dtype: object
# a third way is to use double square brackets to  we select a data frame
>>> df1[['Names']]

../images/469457_1_En_8_Chapter/469457_1_En_8_Figf_HTML.jpg

# we can also select more columns
>>> df1[['Names', 'Sex']]
We can select the value of a variable for a particular element—for example, the third value:
>>> df1['Sex'][2]
'm'
# or some values of some variables
>>> df1['Names'][1:4]
1       Kate
2    Francis
3      Laura
Name: Names, dtype: object
To select items from a data frame we can also use the .loc, .iloc, and .ix methods:
  • .loc: works through the index

  • .iloc: extracts via position

  • .ix: takes both into account

# .loc
>>> df1.loc[1]
Names         Kate
Height         165
Pref_food    pizza
Sex              f
Weight          65
Name: 1, dtype: object
# we can extract the elements from the first to the fourth
>>> df1.loc[:3]
# .iloc
# we can extract the element found in the first case and in the third variable
>>> df1.iloc[0,2]
'steak'
# an alternative to .iloc is the .iat method
>>> df1.iat[1,3]
'f'
# the .at and .iat methods are based on the index
>>> df1.at[2,'Sex']
'm'
#  to the left of the comma, we indicate the cases to be extracted; to the right, the columns or variables (table not shown). To extract all elements, we use a colon
# what if we wanted to extract every other case?
>>> df1[::2]

../images/469457_1_En_8_Chapter/469457_1_En_8_Figg_HTML.jpg

Through Boolean operators we can specify extracting conditions. For instance, we can extract the cases of the dataset in which the height of the subjects (Height) is greater than 170:
>>> df1[df1.Height > 170]
     Names  Height  Pref_food  Sex  Weight
0    Simon  180         steak    m      85
5   Julian  175         steak    m      72
We can also specify multiple conditions, such as extracting all females who weigh more than 163:
>>> df1[(df1.Height > 163) & (df1.Sex == 'f')]
Alternatively, we can also use the .query() method:
>>> df1.query("Sex != 'f'")
      Names  Height  Pref_food Sex  Weight
0     Simon     180      steak   m      85
2   Francis     170      pasta   m      68
5    Julian     175      steak   m      72
Now let’s look at how to rename a column:
>>> df1 = df1.rename(columns = {'Pref_food': 'Food'})
>>> df1.head(2)
    Names  Height   Food  Sex  Weight
0   Simon  180      steak   m      85
1   Kate   165      pizza   f      65
We can create a copy of our data frame:
>>> df2 = df1.copy()
>>> df2

../images/469457_1_En_8_Chapter/469457_1_En_8_Figh_HTML.jpg

We can also add a column. For example, let’s add a column to the df2 dataset that contains height in centimeters:
>>> df2['new_col'] = df2.Height/100
>>> df2
     Names  Heigth        Food Sex  Weight  new_col
0    Simon     180       steak   m      85     1.80
1    Kate      165       pizza   f      65     1.65
2    Francis   170       pasta   m      68     1.70
3    Laura     164       pizza   f      45     1.64
4    Mary      163  vegetables   f      43     1.63
5    Julian    175       steak   m      72     1.75
6    Rosie     166     seafood   f      46     1.66
We can also add a column using the .insert() method. The first element, 5, marks the location where we want to insert the new column (the sixth position), the second is the name of the new column, and the third is the value of the new column:
>>> df2.insert(5, column = 'new_col2', value = df2.Height/100)
>>> print(df2.head(2))
    Names  Heigth Pref_food Sex  Weight  new_col2  new_col
0   Simon  180        steak   m      85      1.80     1.80
1   Kate   165        pizza   f      65      1.65     1.65
Let’s return to the df1 dataset without additions. By using .groupby(), we can also aggregate the dataset around one or more variables.
>>> df1.groupby('Sex')
<pandas.core.groupby.DataFrameGroupBy object at 0x10c8b2be0>
We can create an object that contains aggregated groups according to the ‘Sex’ variable:
>>> grouped1 = df1.groupby('Sex')
# to view the groups, we use group_name.groups
>>> grouped1.groups
{'f': [1, 3, 4, 6], 'm': [0, 2, 5]}
We can visualize the number of cases labeled in one way or another.
# to see the cases, proceed as follows
>>> for names, groups in grouped1:
...    print(names)
...    print(groups)

../images/469457_1_En_8_Chapter/469457_1_En_8_Figi_HTML.jpg

# note that we only get data belonging to a group
>>> grouped1.get_group('f')
# we can aggregate cases according to two variables
>>> grouped2 = df1.groupby(['Sex', 'Pref_food'])
>>> grouped2.groups
{('f', 'pizza'): [1, 3],
 ('f', 'seafood'): [6],
 ('f', 'vegetables'): [4],
 ('m', 'pasta'): [2],
 ('m', 'steak'): [0, 5]}
# we can also determine how many cases fall into each of the groups
>>> grouped2.size()
Sex  Pref_food
f    pizza         2
     seafood       1
     vegetables    1
m    pasta         1
     steak         2
dtype: int64
# we can obtain some descriptive statistics about the data
>>> grouped2.describe()

../images/469457_1_En_8_Chapter/469457_1_En_8_Figj_HTML.jpg

We can also count the frequency of a variable using the .value_counts() method.
>>> df1['Sex'].value_counts()
f    4
m    3
Name: Sex, dtype: int64
Returning to the df1 dataset, we can acquire information about the variables using the .info() method.
>>> print(df1)
>>> df1.info()
With the .astype() method, we can change the nature of a variable. For example, we can overwrite a variable from int (a numeric variable) to an object variable, such as a string:
>>> df1['Weight'] = df1['Weight'].astype('object')
>>> df1.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 5 columns):
Height       7 non-null int64
Names        7 non-null object
Pref_food    7 non-null object
Sex          7 non-null object
Weight       7 non-null object
dtypes: int64(1), object(4)
memory usage: 360.0+ bytes
We can delete cases and variables with the .drop() method:
>>> df1.drop(0)

../images/469457_1_En_8_Chapter/469457_1_En_8_Figk_HTML.jpg

# if, on the other hand, we set and consolidate the index, we eliminate the case based on the value of the index
>>> df1.set_index('Names', inplace = True)
>>> df1.drop("Laura")
         Height   Pref_food Sex  Weight
Names
Simon       180       steak   m      85
Kate        165       pizza   f      65
Francis     170       pasta   m      68
Mary        163  vegetables   f      43
Julian      175       steak   m      72
Rosie       166     seafood   f      46
# we can remove more than one element at a time
>>> print(df1.drop(["Mary", "Francis"]))
# or remove an entire column
# to do this, we must specify axis 1 (the default axis 0 indicates rows)
>>> df1.drop("Height", axis = 1)
Names     Pref_food Sex  Weight
Simon         steak   m      85
Kate          pizza   f      65
Francis       pasta   m      68
Laura         pizza   f      45
Mary     vegetables   f      43
Julian        steak   m      72
Rosie       seafood   f      46
To remove multiple columns, include them in a list:
>>> df.drop(["column1", "column2"], axis = 1)
# instead of specifying the axis as 1, we can also specify "columns"
>>> df.drop(["column1", "column2"], axis = "columns")
The pandas package also allows us to create cross-tabs with the same function. For instance, in the df1 dataset, we can create a cross-tab that correlates gender with favorite food:
>>> pd.crosstab(df1.Sex, df1.Pref_food)
Pref_food  pasta  pizza  seafood  steak  vegetables
Sex
f              0      2        1      0           1
m              1      0        0      2           0
When we add the margins parameter, we also get the row with the totals:
>>> pd.crosstab(df1.Sex, df1.Pref_food, margins = True)
Pref_food  pasta  pizza  seafood  steak  vegetables  All
Sex
f              0      2        1      0           1    4
m              1      0        0      2           0    3
All            1      2        1      2           1    7

pandas: Importing and Exporting Data

The pandas library is also very important with regard to importing files from your computer in various formats, including .csv:
import pandas as pd
>>> df2 = pd.read_csv("file_address.csv")
# we can import a file by specifying that the first line does NOT contain variable names
>>> df2 = pd.read_csv("file_address.csv", header = None)
# or by specifying which row contains variable names
>>> df2 = pd.read_csv("file_address.csv", header = 0)
>>> df2 = pd.read_csv("file_address.csv", header = 1)
# we can also specify column names
>>> df2 = pd.read_csv("file_address.csv", names = ['var1', 'var2, 'var3, 'var4])
# or we can import only a part of the columns
>>> df2 = pd.read_csv("file_address.csv", usecols = [1,2,3])
# we can also specify the element separating our data
>>> df2 = pd.read_csv("file_address.csv", sep = " ")
# we can now take a look at the first records of the data frame
>>> df2.head()
# or the last
>>> df2.tail()
# we use parentheses to specify the number of cases to be displayed
>>> df2.head(9)
>>> df2.tail(5)
# and we can see a summary of the dataset
>>> df2.describe()
>>> df2.info()
>>> df.dtypes() # it tells us our data type

The pd.read_csv() function can also be used to import files from the Web. The UCI machine learning repository ( https://archive.ics.uci.edu/ml/index.php ) features several data sets used for machine learning.

Let’s look at an iris dataset. Figures 8-1 through 8-3 and the following steps help us navigate to its tab and copy the link featuring the dataset.
../images/469457_1_En_8_Chapter/469457_1_En_8_Fig1_HTML.jpg
Figure 8-1

Iris dataset on the UCI machine learning web site click on ‘Iris’ to get description page

../images/469457_1_En_8_Chapter/469457_1_En_8_Fig2_HTML.jpg
Figure 8-2

Iris dataset on the UCI machine learning web site

../images/469457_1_En_8_Chapter/469457_1_En_8_Fig3_HTML.jpg
Figure 8-3

Iris dataset on the UCI machine learning web site click on ‘Data Folder’ to access the page containing the dataset

First, we copy the link featuring the dataset.
# we import pandas
>>> import pandas as pd
# we create an object featuring the dataset, which we are importing with pd.read_csv()
>>> iris = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data")
# we check data accuracy by displaying some occurrences using the function .head()
>>> iris.head(3)
   5.1  3.5  1.4  0.2  Iris-setosa
0  4.9  3.0  1.4  0.2  Iris-setosa
1  4.7  3.2  1.3  0.2  Iris-setosa
2  4.6  3.1  1.5  0.2  Iris-setosa
The structure of the pandas function we used, pd.read_csv(), also includes other arguments we did not use for this import:
pd.read_csv( filepath, sep = ',', dtype = None, skiprows = None, index_col = None, skip_blank_lines = True, na_filter = True)
  • filepath the address of the file (on the computer or externally)

  • sep = ',' the separator dividing data, such as a comma or semicolon

  • dtype= a means of specifying column format

  • header= names of variables in the first line, if any

  • skiprows= a means of importing only one part of cases—for example, skiprows = 50 reads data from the 51st case onward

  • index_col= a means of setting a column as a data index

  • skip_blank_lines= a means of removing any blank lines in the dataset

  • na_filter= a means of identifying the missing values in the dataset and, if set to False, removing them

In addition, the usecols= argument can be used to import only a few columns in a dataset. For instance, let’s say we have the small dataset shown in Figure 8-4, from which we want to extract the following variables:
>>> df3 = pd.read_csv("~students.csv",usecols =["ID", "mark1"], index_col= ["ID"])
# we verify data
df3
    mark1
ID
1    17.0
2    24.0
3    17.0
4    27.0
5    30.0
6    30.0
7    23.0
8    17.0
9    21.0
10   24.0
11   24.0
12   25.0
13   24.0
14    NaN
15   22.0
16   30.0
17   29.0
18   29.0
19    NaN
# correctly, we only loaded the 'mark1' variable
../images/469457_1_En_8_Chapter/469457_1_En_8_Fig4_HTML.jpg
Figure 8-4

Dataset students.csv

One specular function is df.to_csv(“filename”), which allows us to write a .csv file into our work directory. We can specify the index=False argument to avoid downloading the index together with data.

As an alternative to this file import function, we can use the pd.read_table(filepath, sep = ‘,’) function , which sets both the file path (filepath) and the separator.

Other pandas features allow us to read Excel .xls or .xlsx files. To do this, we use a generic formula:
df = pd.read_excel(filepath, "sheet_name")

In this case, not only must we specify the address of the file, but also whether the Excel file features more than one data sheet, and the name of the sheet from which we want to read data. As with .csv, for Excel formats we also have a formula that allows us to write an Excel file in the work directory of our computer: df.to_excel(). The pandas package also contains a function for reading files in JSON—pd.read_json()—and also allows us to access Web data via the pd.read_html(url) function and to convert a data frame to an HTML table via pd.to_html().

Let’s look at an example of creating and exporting a file in JSON:
>>> jf = pd.DataFrame(np.arange(16).reshape(4,4), index = ['A', 'B', 'C', 'D'], columns = ['var1', 'var2', 'var3', 'var4'])
>>> jf
   var1  var2  var3  var4
A     0     1     2     3
B     4     5     6     7
C     8     9    10    11
D    12    13    14    15
>>> jf.to_json('jf.json')
# all that is left is to check that the file has been created correctly in our work directory

pandas: Data Manipulation

We have seen how to create groups and manipulate data frames. Now let’s look at more manipulation elements through pandas. First, let’s import pandas and create a small dataset:
>>> df2 = pd.DataFrame({'Names': ['Simon', 'Kate', 'Francis', 'Laura', 'Mary', 'Julian', 'Rosie', 'Simon', 'Laura'],
                   'Height':[180, 165, 170, 164, 163, 175, 166, 180, 164],
                   'Weight':[85, 65, 68, 45, 43, 72, 46, 85, 45],
                   'Pref_food': ['steak', 'pizza', 'pasta', 'pizza', 'vegetables', 'steak', 'seafood', 'steak', 'pizza'],
                   'Sex': ['m','f','m','f','f','m','f', 'm', 'f']})
Starting with the ‘Sex’ variable, let’s create two dummy variables and continue with pandas:
>>> df_dummy = pd.get_dummies(df2['Sex'], prefix = 'Sex')
>>> df_dummy
   Sex_f  Sex_m
0    0.0    1.0
1    1.0    0.0
2    0.0    1.0
3    1.0    0.0
4    1.0    0.0
5    0.0    1.0
6    1.0    0.0
7    0.0    1.0
8    1.0    0.0
At this point, we can join these two dummy variables to the original dataset:
>>> df2.join(df_dummy)
   Unnamed: 0  Height    Names        Food Sex  Weight  Sex_f  Sex_m
0           0     180    Simon       steak   m      85    0.0    1.0
1           1     165     Kate       pizza   f      65    1.0    0.0
2           2     170  Francis       pasta   m      68    0.0    1.0
3           3     164    Laura       pizza   f      45    1.0    0.0
4           4     163     Mary  vegetables   f      43    1.0    0.0
5           5     175   Julian       steak   m      72    0.0    1.0
6           6     166    Rosie     seafood   f      46    1.0    0.0
7           0     180    Simon       steak   m      85    0.0    1.0
8           3     164    Laura       pizza   f      45    1.0    0.0
We save the joined dataset featuring the dummy variables in a new object:
>>> df3 = df2.join(df_dummy)
# let's check
>>> print(df3.head(2))
   Height  Names Pref_food Sex  Weight  Sex_f  Sex_m
0     180  Simon     steak   m      85    0.0    1.0
1     165   Kate     pizza   f      65    1.0    0.0
Now let’s remove the original ‘Sex’ variable:
>>> del df3['Sex']
# double-check
>>> print(df3.head(2))
   Height  Names  Pref_food  Weight  Sex_f  Sex_m
0     180  Simon      steak      85    0.0    1.0
1     165   Kate      pizza      65    1.0    0.0
As you can see, there are a couple of duplicate cases inside our dataset. First, we need to identify them:
>>> df2.duplicated()
0    False
1    False
2    False
3    False
4    False
5    False
6    False
7     True
8     True
dtype: bool
# our dataset has two duplicate cases
To remove duplicate cases, we use .drop_duplicates():
>>> df2.drop_duplicates()
To get a dataset without duplicates, we overwrite the old one or create another object:
>>> df3 = df2.drop_duplicates()
We can also delete a case, this time using drop:
>>> df2.drop(2)
   Height   Names   Pref_food Sex  Weight
0     180   Simon       steak   m      85
1     165    Kate       pizza   f      65
3     164   Laura       pizza   f      45
4     163    Mary  vegetables   f      43
5     175  Julian       steak   m      72
6     166   Rosie     seafood   f      46
7     180   Simon       steak   m      85
8     164   Laura       pizza   f      45
# in this case, we delete the third case
The stack() and unstack() functions allow us to reorganize our data in a different way:
>>> df3.stack()
0  Height              180
   Names             Simon
   Pref_food         steak
   Sex                   m
   Weight               85
1  Height              165
   Names              Kate
   Pref_food         pizza
   Sex                   f
   Weight               65
2  Height              170
   Names           Francis
   Pref_food         pasta
   Sex                   m
   Weight               68
3  Height              164
   Names             Laura
   Pref_food         pizza
   Sex                   f
   Weight               45
4  Height              163
   Names              Mary
   Pref_food    vegetables
   Sex                   f
   Weight               43
[...]
# we save the result in an object
>>> stacked = df3.stack()
# and now we can return it to its original format using unstacked()
>>> unstacked = stacked.unstack()
>>> print(unstacked)
  Height    Names   Pref_food Sex Weight
0    180    Simon       steak   m     85
1    165     Kate       pizza   f     65
2    170  Francis       pasta   m     68
3    164    Laura       pizza   f     45
4    163     Mary  vegetables   f     43
5    175   Julian       steak   m     72
6    166    Rosie     seafood   f     46
We can also reorganize our data via the melt() function:
>>> pd.melt(df3)
     variable       value
0      Height         180
1      Height         165
2      Height         170
3      Height         164
4      Height         163
5      Height         175
6      Height         166
7       Names       Simon
8       Names        Kate
9       Names     Francis
10      Names       Laura
11      Names        Mary
12      Names      Julian
13      Names       Rosie
[...]
Let’s go back to the initial dataset
>>> df2.head(2)
   Height  Names Pref_food Sex  Weight
0     180  Simon     steak   m      85
1     165   Kate     pizza   f      65
and use the .T() function to transpose rows and columns:
# we invert rows and columns
>>> df2.T

../images/469457_1_En_8_Chapter/469457_1_En_8_Figl_HTML.jpg

We can also extract a random sample of our data using the .sample() function:
>>> df2.sample(n=2)
   Height  Names   Pref_food Sex  Weight
6     166  Rosie     seafood   f      46
4     163   Mary  vegetables   f      43

In parentheses (n=2), we inserted the number of cases to be extracted. How can we always extract the same cases randomly, so that extraction can be repeated? We use the np.random.seed() function. The number included in parentheses in this function does not really matter, but if two people use the same dataset and use the same number, they will extract the same cases.

To do this, we need to import the NumPy package:
>>> import numpy as np
>>> np.random.seed(1)
>>> df2.sample(n=2)
   Height  Names Pref_food Sex  Weight
3     164  Laura     pizza   f      45
7     180  Simon     steak   m      85
>>> np.random.seed(1)
>>> df2.sample(n=2)
   Height  Names Pref_food Sex  Weight
3     164  Laura     pizza   f      45
7     180  Simon     steak   m      85
Instead of extracting a number of cases, we can extract a percentage, with the argument frac= instead of n=
>>> df2.sample(frac = .1)
>>> df2.sample(frac = .5)
Using the argument frac =, we specified the percentage; 0.1 is 10%, 0.5 is 50%, and so on.
# we can view cases with the highest values for a certain variable by specifying the number of cases (in this case, 3) and the column (in this case, Weight)
# let's consider the df3 dataset, which does not feature duplicates
>>> df3.nlargest(3, "Weight")
   Height    Names Pref_food Sex  Weight
0     180    Simon     steak   m      85
5     175   Julian     steak   m      72
2     170  Francis     pasta   m      68
Let’s now call the cases with the lowest values for a certain variable, and specify the number of cases:
>>> df3.nsmallest(4, "Weight")
   Height  Names   Pref_food Sex  Weight
4     163   Mary  vegetables   f      43
3     164  Laura       pizza   f      45
6     166  Rosie     seafood   f      46
1     165   Kate       pizza   f      65
Last, we can create a small dataset and reorganize data with the pivot_table() function:
>>> df4 = pd.DataFrame({'Date': ['2017-01-01', '2017-01-01', '2017-01-02', '2017-01-01', '2017-01-02', '2017-01-02', '2017-01-03', '2017-01-02', '2017-01-03', '2017-01-03'],
                   'Type':['x', 'x', 'y', 'x', 'y', 'x', 'z', 'y', 'z', 'y'],
                   'Value':[185, 265, 168, 245, 143, 172, 346, 285, 145, 128],
                   })
>>> print(df4)
         Date Type  Value
0  2017-01-01    x    185
1  2017-01-01    x    265
2  2017-01-02    y    168
3  2017-01-01    x    245
4  2017-01-02    y    143
5  2017-01-02    x    172
6  2017-01-03    z    346
7  2017-01-02    y    285
8  2017-01-03    z    145
9  2017-01-03    y    128
>>> print(pd.pivot_table(df4, index = "Date", values = "Value", columns = "Type"))
Type                 x           y      z
Date
2017-01-01  231.666667         NaN    NaN
2017-01-02  172.000000  198.666667    NaN
2017-01-03         NaN  128.000000  245.5
# we aggregate the values around the dates

pandas: Missing Values

Now let’s examine how we can manage datasets with missing data. After importing pandas, we load a dataset that has missing values:
>>> import pandas as pd
>>> df_missing = pd.read_csv('df_missing.csv')
# I created this tiny dataset with missing data, as you can see
>>> df_missing
    A     B     C      D   E
0  15  96.0  74.0   31.0  50
1  41  27.0  74.0  279.0  57
2  21  32.0   NaN   99.0  96
3  48  97.0  50.0    NaN  69
4  63  98.0  74.0   44.0  55
5  43  11.0  80.0   74.0  33
6  39  38.0  81.0   20.0  41
7  58  31.0   NaN   76.0  91
8  85  94.0  37.0   65.0  60
9  98   NaN  19.0   43.0  32
We can verify the number of cases and variables:
>>> print(df_missing.shape)
(10, 5)
We can check the complete cases:
>>> print(df_missing.dropna().shape)
(6, 5)
We can also use .isnull(), which answers the question: Is a value missing?
>>> pd.isnull(df_missing)
       A      B      C      D      E
0  False  False  False  False  False
1  False  False  False  False  False
2  False  False   True  False  False
3  False  False  False   True  False
4  False  False  False  False  False
5  False  False  False  False  False
6  False  False  False  False  False
7  False  False   True  False  False
8  False  False  False  False  False
9  False   True  False  False  False
The converse, .notnull(), answers the question: Is a value not missing?
>>> pd.notnull(df_missing)
      A      B      C      D     E
0  True   True   True   True  True
1  True   True   True   True  True
2  True   True  False   True  True
3  True   True   True  False  True
4  True   True   True   True  True
5  True   True   True   True  True
6  True   True   True   True  True
7  True   True  False   True  True
8  True   True   True   True  True
9  True  False   True   True  True
We can add the missing values per column:
>>> df_missing.isnull().sum()
A    0
B    1
C    2
D    1
E    0
# or determine the total of missing data
>>> df_missing.isnull().sum().sum()
4
Missing values can be treated by deletion (or by deleting the cases that contain them) or imputation (that is, by replacing missing values with other values, such as a fixed value or an average). To perform a deletion, proceed as follows:
>>> df_missing.dropna()
    A     B     C      D   E
0  15  96.0  74.0   31.0  50
1  41  27.0  74.0  279.0  57
4  63  98.0  74.0   44.0  55
5  43  11.0  80.0   74.0  33
6  39  38.0  81.0   20.0  41
8  85  94.0  37.0   65.0  60
We eliminated the lines containing missing values. We can also delete the columns with missing values:
>>> df_missing.dropna(axis = 1, how = 'any')
    A   E
0  15  50
1  41  57
2  21  96
3  48  69
4  63  55
5  43  33
6  39  41
7  58  91
8  85  60
9  98  32
As mentioned, using imputation methods, we can replace a missing value with another value. Let’s replace all missing values with a fixed value—in this case, zero:
>>> df_missing.fillna(0)
    A     B     C      D   E
0  15  96.0  74.0   31.0  50
1  41  27.0  74.0  279.0  57
2  21  32.0   0.0   99.0  96
3  48  97.0  50.0    0.0  69
4  63  98.0  74.0   44.0  55
5  43  11.0  80.0   74.0  33
6  39  38.0  81.0   20.0  41
7  58  31.0   0.0   76.0  91
8  85  94.0  37.0   65.0  60
9  98   0.0  19.0   43.0  32
We can also replace missing values with a word, such as “missing”:
>>> df_missing.fillna('missing')
    A        B        C        D   E
0  15       96       74       31  50
1  41       27       74      279  57
2  21       32  missing       99  96
3  48       97       50  missing  69
4  63       98       74       44  55
5  43       11       80       74  33
6  39       38       81       20  41
7  58       31  missing       76  91
8  85       94       37       65  60
9  98  missing       19       43  32
Or impute missing values to the average:
>>> df_missing.fillna(df_missing.mean())
    A          B       C           D   E
0  15  96.000000  74.000   31.000000  50
1  41  27.000000  74.000  279.000000  57
2  21  32.000000  61.125   99.000000  96
3  48  97.000000  50.000   81.222222  69
4  63  98.000000  74.000   44.000000  55
5  43  11.000000  80.000   74.000000  33
6  39  38.000000  81.000   20.000000  41
7  58  31.000000  61.125   76.000000  91
8  85  94.000000  37.000   65.000000  60
9  98  58.222222  19.000   43.000000  32
Let’s consider a variable, such as variable ‘C’. The following line of code will replace every missing value in the variable C with the mean of C:
>>> df_missing['C'].fillna(df_missing['C'].mean())
0    74.000
1    74.000
2    61.125
3    50.000
4    74.000
5    80.000
6    81.000
7    61.125
8    37.000
9    19.000
Name: C, dtype: float64
Two important .fillna methods can be used to impute missing values: ffill and backfill:
>>> df_missing['C'].fillna(method = 'ffill')
0    74.0
1    74.0
2    74.0
3    50.0
4    74.0
5    80.0
6    81.0
7    81.0
8    37.0
9    19.0
Name: C, dtype: float64
>>> df_missing['C'].fillna(method = 'backfill')
0    74.0
1    74.0
2    50.0
3    50.0
4    74.0
5    80.0
6    81.0
7    37.0
8    37.0
9    19.0
Name: C, dtype: float64
# ffill replaces the missing value with the default value preceding it, whereas backfill replaces it with the nonmissing value that follows it. To save the results of one of these insertion or replacement operations, we must create a new data frame or overwrite the initial one.

pandas: Merging Two Datasets

There are many ways to combine two data frames. We can delete overlapping cases, juxtapose two files, and join new cases or new variables. Let’s look at some pandas examples. First, we need to create sample data frames:
>>> df1 = pd.DataFrame({'id': ['A', 'B', 'C', 'D'], 'var1' : [37, 36, 43, 23], 'var2': [120, 117, 230, 315]})
>>> df1
  id  var1  var2
0  A    37   120
1  B    36   117
2  C    43   230
3  D    23   315
>>> df2 = pd.DataFrame({'id': ['A', 'B', 'C', 'D'], 'var3' : [12, 16, 13, 18], 'var4': [75, 54, 21, 36]})
>>> df2
  id  var3  var4
0  A    12    75
1  B    16    54
2  C    13    21
3  D    18    36
>>> df3 = pd.DataFrame({'id': ['A', 'A', 'B', 'B'], 'var3' : [2, 6, 3, 8], 'var4': [7, 5, 2, 6]})
  id  var3  var4
0  A     2     7
1  A     6     5
2  B     3     2
3  B     8     6
The easiest way to merge two datasets is to use the .append() method:
>>> df1.append(df2)
  id  var1   var2  var3  var4
0  A  37.0  120.0   NaN   NaN
1  B  36.0  117.0   NaN   NaN
2  C  43.0  230.0   NaN   NaN
3  D  23.0  315.0   NaN   NaN
0  A   NaN    NaN  12.0  75.0
1  B   NaN    NaN  16.0  54.0
2  C   NaN    NaN  13.0  21.0
3  D   NaN    NaN  18.0  36.0
>>> df3.append(df1)
  id  var1   var2  var3  var4
0  A   NaN    NaN   2.0   7.0
1  A   NaN    NaN   6.0   5.0
2  B   NaN    NaN   3.0   2.0
3  B   NaN    NaN   8.0   6.0
0  A  37.0  120.0   NaN   NaN
1  B  36.0  117.0   NaN   NaN
2  C  43.0  230.0   NaN   NaN
3  D  23.0  315.0   NaN   NaN
As you can see, the two datasets are juxtaposed every time, without the system taking into account and adjusting the index . To merge more than two datasets, we can use .concat():
>>> pd.concat([df1, df2, df3])
  id  var1   var2  var3  var4
0  A  37.0  120.0   NaN   NaN
1  B  36.0  117.0   NaN   NaN
2  C  43.0  230.0   NaN   NaN
3  D  23.0  315.0   NaN   NaN
0  A   NaN    NaN  12.0  75.0
1  B   NaN    NaN  16.0  54.0
2  C   NaN    NaN  13.0  21.0
3  D   NaN    NaN  18.0  36.0
0  A   NaN    NaN   2.0   7.0
1  A   NaN    NaN   6.0   5.0
2  B   NaN    NaN   3.0   2.0
3  B   NaN    NaN   8.0   6.0
In this way, we concatenated the datasets horizontally. We can also concatenate them vertically by specifying the axis as 1 (in the previous example, the axis was not specified so it defaulted to zero):
>>> pd.concat([df1, df2, df3], axis = 1)
# the code above and this line below will give us the same result
>>> pd.concat([df1, df2, df3], axis = 'columns')
  id  var1  var2 id  var3  var4 id  var3  var4
0  A    37   120  A    12    75  A     2     7
1  B    36   117  B    16    54  A     6     5
2  C    43   230  C    13    21  B     3     2
3  D    23   315  D    18    36  B     8     6
We can combine two datasets in an even more advanced way, starting from two tables that have some rows and columns in common, and merging them through shared data. The pandas function that allows us to make this type of join is merge(). First, we need to create two datasets:
>>> stud1 = pd.DataFrame({'ID': [1,2,3,4,5], 'names' : ['John', 'Greta', 'Francis', 'Laura', 'Charles'], 'Logic': [28, 27, 23, 25, 30]})
>>> stud2 = pd.DataFrame({'ID': [1,6,3,4,7], 'names' : ['John', 'Kate', 'Francis', 'Laura', 'Simon'], 'Analysis': [23, 24, 28, 29, 22]})
# by default, the merge() function extracts common cases to the two datasets, so it is 'inner' by default. Inner join selects records that have matching values in both datasets
>>> pd.merge(stud1, stud2, on = 'ID')
   ID  Logic  names_x  Analysis  names_y
0   1     28     John        23     John
1   3     23  Francis        28  Francis
2   4     25    Laura        29    Laura
There are other means of merging two datasets: ‘left’, ‘right’, and ‘outer’. We use the dataset index—in this case, the ID column—as a key (index) to merge the datasets.
>>> print(pd.merge(stud1, stud2, on = 'ID', how = 'left'))
   ID  Logic  names_x  Analysis  names_y
0   1     28     John      23.0     John
1   2     27    Greta       NaN      NaN
2   3     23  Francis      28.0  Francis
3   4     25    Laura      29.0    Laura
4   5     30  Charles       NaN      NaN
The ‘left’ parameter allows us to merge the second dataset into the first, taking only the cases of the first dataset and the cases that are common to the second and the first dataset.
>>> print(pd.merge(stud1, stud2, on = 'ID', how = 'right'))
   ID  Logic  names_x  Analysis  names_y
0   1   28.0     John        23     John
1   3   23.0  Francis        28  Francis
2   4   25.0    Laura        29    Laura
3   6    NaN      NaN        24     Kate
4   7    NaN      NaN        22    Simon
The ‘right’ parameter allows to merge the first dataset into the second, taking only the cases of the second and those that are common to the second and the first at the same time.
>>> print(pd.merge(stud1, stud2, on = 'ID', how = 'outer'))
   ID  Logic  names_x  Analysis  names_y
0   1   28.0     John      23.0     John
1   2   27.0    Greta       NaN      NaN
2   3   23.0  Francis      28.0  Francis
3   4   25.0    Laura      29.0    Laura
4   5   30.0  Charles       NaN      NaN
5   6    NaN      NaN      24.0     Kate
6   7    NaN      NaN      22.0    Simon

When we use ‘outer’, we merge the cases of the two data frames, holding only a copy of the double cases, but without losing data from one of the two initial datasets.

pandas: Basic Statistics

The pandas library has a lot of methods for statistics, which we can use to get descriptive statistical information. As always, we use our data frame:
>>> df = pd.DataFrame({'Names': ['Simon', 'Kate', 'Francis', 'Laura', 'Mary', 'Julian', 'Rosie'],
        'Height':[180, 165, 170, 164, 163, 175, 166],
        'Weight':[85, 65, 68, 45, 43, 72, 46],
        'Pref_food':['steak', 'pizza', 'pasta', 'pizza', 'vegetables','steak','seafood'],
        'Sex':['m', 'f','m','f', 'f', 'm', 'f']})
# some information on numeric data
>>> df.describe()
           Height     Weight
count    7.000000   7.000000
mean   169.000000  60.571429
std      6.377042  16.154021
min    163.000000  43.000000
25%    164.500000  45.500000
50%    166.000000  65.000000
75%    172.500000  70.000000
max    180.000000  85.000000
# we can also rearrange the statistics with the .transpose method to make it more readable
>>> df.describe().transpose()
        count        mean        std    min    25%    50%    75%    max
Height    7.0  169.000000   6.377042  163.0  164.5  166.0  172.5  180.0
Weight    7.0   60.571429  16.154021   43.0   45.5   65.0   70.0   85.0
# value counts
>>> df.count()
Height    7
Names     7
Food      7
Sex       7
Weight    7
dtype: int64
# median calculation
>>> df['Height'].median()
166.0
# average calculation
>>> df['Height'].mean()
169.0
# minimum value
>>> df['Height'].min()
163
# maximum value
>>> df['Height'].max()
180
Table 8-1 provides a summary of statistical methods.
Table 8-1

Statistical Methods

Method

Description

.describe

Provides some descriptive statistics

.count

Returns the number of values per variable

.mean

Returns the average

.median

Returns the median

.mode

Returns the mode

.min

Returns the lowest value

.max

Returns the highest value

.std

Returns the standard deviation

.var

Returns the variance

.skew

Returns skewness

.kurt

Returns kurtosis

Summary

In this chapter, we learned some easier ways to import and manage our data using pandas—one of the most important libraries for data manipulation and data science. In Chapter 9, we examine another package that is important for data manipulation: NumPy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.188.238