Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

A. DanialPython for MATLAB Developmenthttps://doi.org/10.1007/978-1-4842-7223-7_13

13. Tables and Dataframes

Albert Danial¹

(1)

Redondo Beach, CA, USA

The Pandas module for Python [2] came out in 2008, five years before the MathWorks added tables to MATLAB in the R2013b release. Since its release, Pandas has become a core tool for data scientists, financial analysts, statisticians, and machine learning specialists. A comprehensive discussion of Pandas is far beyond the scope of this book; instead, here we show the functional similarities and notational differences between Pandas dataframes and MATLAB tables.

Comma-separated value files are a frequent source of inputs for tables and dataframes, so we begin with a pair of CSV files, pets.csv and pets_ns.csv (“ns” for “no spaces”), containing information on a few animals. The files have 11 lines—1 line of column names followed by 10 lines of data—and differ only in whitespace:

pets_ns.csv

type,MF,weight,birthday

dog,M,9.82,2023-11-26

cat,F,4.31,2023-11-18

bunny,F,8.40,2023-11-23

bunny,M,4.65,2023-11-24

dog,F,9.82,2023-11-25

cat,M,5.57,2023-11-24

bunny,M,1.12,2023-11-17

cat,M,2.96,2023-11-21

cat,M,5.08,2023-11-21

bunny,F,1.53,2023-11-15

pets.csv

type , MF, weight, birthday

dog , M, 9.82 , 2023-11-26

cat , F, 4.31 , 2023-11-18

bunny , F, 8.40 , 2023-11-23

bunny , M, 4.65 , 2023-11-24

dog , F, 9.82 , 2023-11-25

cat , M, 5.57 , 2023-11-24

bunny , M, 1.12 , 2023-11-17

cat , M, 2.96 , 2023-11-21

cat , M, 5.08 , 2023-11-21

bunny , F, 1.53 , 2023-11-15

Although the two files contain identical information, the file readers’ default behaviors differ:

Pandas ignores leading and trailing whitespace for row data but preserves whitespace for the column titles.
MATLAB preserves whitespace throughout.
Pandas does not automatically recognize date fields and so populates the fourth column with string literals.
MATLAB recognizes dates and times in a variety of formats and correctly sets the type of the fourth column to datetime.

Additional data manipulation and cleaning is often necessary after loading a file. An aphorism in data science is that data cleaning typically takes four times as much effort as the actual data analysis. Frequently performed data cleaning tasks are covered in Section 13.3.

13.1 Loading Tables from Files

MATLAB and Pandas can load tables and dataframes from CSV, Excel, and text files. Pandas can additionally read HDF5, HTML, JSON, and Python pickle files and pull data from websites or database connections through SQLAlchemy. These commands show how to create a MATLAB table and Pandas dataframe from our pets.csv file :

MATLAB:	Python:
>> tb = readtable('pets.csv'); >> tb tb = 10×6 table Var1 Var2 Var3 Var4 Var5 Var6 _______ _____ ____ ____ ____ _____ {'dog' }{','}{'M,'}9.82{','}2023-11-26 {'cat' }{','}{'F,'}4.31{','}2023-11-18 {'bunny'}{','}{'F,'} 8.4{','}2023-11-23 {'bunny'}{','}{'M,'}4.65{','}2023-11-24 {'dog' }{','}{'F,'}9.82{','}2023-11-25 {'cat' }{','}{'M,'}5.57{','}2023-11-24 {'bunny'}{','}{'M,'}1.12{','}2023-11-17 {'cat' }{','}{'M,'}2.96{','}2023-11-21 {'cat' }{','}{'M,'}5.08{','}2023-11-21 {'bunny'}{','}{'F,'}1.53{','}2023-11-15	In : import pandas as pd In : df = pd.read_csv('pets.csv') In : df Out: type MF weight birthday 0 dog M 9.82 2023-11-26 1 cat F 4.31 2023-11-18 2 bunny F 8.40 2023-11-23 3 bunny M 4.65 2023-11-24 4 dog F 9.82 2023-11-25 5 cat M 5.57 2023-11-24 6 bunny M 1.12 2023-11-17 7 cat M 2.96 2023-11-21 8 cat M 5.08 2023-11-21 9 bunny F 1.53 2023-11-15

MATLAB:

Python:

>> tb = readtable('pets.csv');

>> tb

tb =

10×6 table

Var1 Var2 Var3 Var4 Var5 Var6

_______ _____ ____ ____ ____ _____

{'dog' }{','}{'M,'}9.82{','}2023-11-26

{'cat' }{','}{'F,'}4.31{','}2023-11-18

{'bunny'}{','}{'F,'} 8.4{','}2023-11-23

{'bunny'}{','}{'M,'}4.65{','}2023-11-24

{'dog' }{','}{'F,'}9.82{','}2023-11-25

{'cat' }{','}{'M,'}5.57{','}2023-11-24

{'bunny'}{','}{'M,'}1.12{','}2023-11-17

{'cat' }{','}{'M,'}2.96{','}2023-11-21

{'cat' }{','}{'M,'}5.08{','}2023-11-21

{'bunny'}{','}{'F,'}1.53{','}2023-11-15

In : import pandas as pd

In : df = pd.read_csv('pets.csv')

In : df

Out:

type MF weight birthday

0 dog M 9.82 2023-11-26

1 cat F 4.31 2023-11-18

2 bunny F 8.40 2023-11-23

3 bunny M 4.65 2023-11-24

4 dog F 9.82 2023-11-25

5 cat M 5.57 2023-11-24

6 bunny M 1.12 2023-11-17

7 cat M 2.96 2023-11-21

8 cat M 5.08 2023-11-21

9 bunny F 1.53 2023-11-15

MATLAB’s table is unexpected; instead of four columns, it returned six and also failed to recognize column names. There are two alternatives: either work with a file like pets_ns.csv whose entries have been trimmed of whitespace or use explicit column-specific format options via detectImportOptions followed by setvaropts. Here, we’ll use the CSV file trimmed of whitespace:

MATLAB:	Python:
>> tb = readtable('pets_ns.csv'); >> tb tb = 10×4 table type MF weight birthday _________ _____ ______ __________ {'dog' } {'M'} 9.82 2023-11-26 {'cat' } {'F'} 4.31 2023-11-18 {'bunny'} {'F'} 8.4 2023-11-23 {'bunny'} {'M'} 4.65 2023-11-24 {'dog' } {'F'} 9.82 2023-11-25 {'cat' } {'M'} 5.57 2023-11-24 {'bunny'} {'M'} 1.12 2023-11-17 {'cat' } {'M'} 2.96 2023-11-21 {'cat' } {'M'} 5.08 2023-11-21 {'bunny'} {'F'} 1.53 2023-11-15	In : import pandas as pd In : import numpy as np In : df = pd.read_csv('pets_ns.csv') In : df Out: type MF weight birthday 0 dog M 9.82 2023-11-26 1 cat F 4.31 2023-11-18 2 bunny F 8.40 2023-11-23 3 bunny M 4.65 2023-11-24 4 dog F 9.82 2023-11-25 5 cat M 5.57 2023-11-24 6 bunny M 1.12 2023-11-17 7 cat M 2.96 2023-11-21 8 cat M 5.08 2023-11-21 9 bunny F 1.53 2023-11-15

MATLAB:

Python:

>> tb = readtable('pets_ns.csv');

>> tb

tb =

10×4 table

type MF weight birthday

_________ _____ ______ __________

{'dog' } {'M'} 9.82 2023-11-26

{'cat' } {'F'} 4.31 2023-11-18

{'bunny'} {'F'} 8.4 2023-11-23

{'bunny'} {'M'} 4.65 2023-11-24

{'dog' } {'F'} 9.82 2023-11-25

{'cat' } {'M'} 5.57 2023-11-24

{'bunny'} {'M'} 1.12 2023-11-17

{'cat' } {'M'} 2.96 2023-11-21

{'cat' } {'M'} 5.08 2023-11-21

{'bunny'} {'F'} 1.53 2023-11-15

In : import pandas as pd

In : import numpy as np

In : df = pd.read_csv('pets_ns.csv')

In : df

Out:

type MF weight birthday

0 dog M 9.82 2023-11-26

1 cat F 4.31 2023-11-18

2 bunny F 8.40 2023-11-23

3 bunny M 4.65 2023-11-24

4 dog F 9.82 2023-11-25

5 cat M 5.57 2023-11-24

6 bunny M 1.12 2023-11-17

7 cat M 2.96 2023-11-21

8 cat M 5.08 2023-11-21

9 bunny F 1.53 2023-11-15

String and floating-point types are obvious, but the birthday column type is not. Let’s investigate by checking the data type of the first row in birthday (table and dataframe indexing is covered in Section 13.6.3). The .iloc() dataframe attribute, covered in more detail in Section 13.6, allows us to select one or more rows based on their (zero-based) indices.

MATLAB:	Python:
>> tb{1,'birthday'} datetime 2023-11-26	In : df.iloc[0]['birthday'] Out: '2023-11-26' In : type(df.iloc[0]['birthday']) Out: str

MATLAB:

Python:

>> tb{1,'birthday'}

datetime

2023-11-26

In : df.iloc[0]['birthday']

Out: '2023-11-26'

In : type(df.iloc[0]['birthday'])

Out: str

MATLAB’s birthday column is recognized as a datetime, but Python’s is just a string—not ideal. The Pandas read_csv() function has 48 (!) optional keyword arguments, two of which govern date handling: parse_dates takes a list of column indices to interpret as dates, and infer_datetime_format=True enables the automatic date parser. This invocation of read_csv() will give us a time-aware type for birthday:

Python:

In : df = pd.read_csv('pets_ns.csv',parse_dates=[3],

...: infer_datetime_format=True)

In : type(df['birthday'].loc[0])

Out: pandas._libs.tslibs.timestamps.Timestamp

For now, we’ll load the dataframe df using only default arguments to read_csv() and keep the birthday entries as strings. We show how to convert the strings to time-aware objects in the data cleaning section, Section 13.3.4.

13.2 Table Summaries

Most real-world tables have far more rows than can fit on a page. MATLAB and Pandas have summary functions that give rapid glimpses into the table contents.

13.2.1 Table Size, Column Names, Column Types

Table dimensions can be found the same way as with numeric arrays:

MATLAB:	Python:
>> height(tb) 10 >> width(tb) 4 >> size(tb) 10 4	In : len(df) Out: 10 In : df.shape Out: (10, 4)

MATLAB:

Python:

>> height(tb)

>> width(tb)

>> size(tb)

10 4

In : len(df)

Out: 10

In : df.shape

Out: (10, 4)

Column names are stored in a cell array in MATLAB and an index array in Pandas:

MATLAB:	Python:
>> tb.Properties.VariableNames 1×4 cell array {'type'} {'MF'} {'weight'} {'birthday'}	In : df.columns Out: Index(['type', 'MF', 'weight', 'birthday'], dtype="object")

MATLAB:

Python:

>> tb.Properties.VariableNames

1×4 cell array

{'type'} {'MF'} {'weight'}

{'birthday'}

In : df.columns

Out: Index(['type', 'MF', 'weight',

'birthday'], dtype="object")

Column type information is a bit more obscured; the clearest method is to iterate over columns and explicitly print the type. MATLAB’s column types are revealed by invoking class() on each column. Pandas dataframes have a type attribute, .dtypes, which contains column information, but the types often come back as just object, not at all helpful. In Pandas, too, we therefore explicitly loop over columns and report the type of the entry in the first row:

MATLAB:	Python:
cols=tb.Properties.VariableNames; for C = cols fprintf('%-8s %s ', ... C{1}, class(tb.(C{1}))) end type cell MF cell weight double birthday datetime	In : df.dtypes Out: type object MF object weight float64 birthday object dtype: object In : for C in df.columns: ...: tp = type(df.iloc[0][C]) ...: print(f'{C:<8s} {tp}') type <class 'str'> MF <class 'str'> weight <class 'numpy.float64'> birthday <class 'str'>

MATLAB:

Python:

cols=tb.Properties.VariableNames;

for C = cols

fprintf('%-8s %s ', ...

C{1}, class(tb.(C{1})))

end

type cell

MF cell

weight double

birthday datetime

In : df.dtypes

Out:

type object

MF object

weight float64

birthday object

dtype: object

In : for C in df.columns:

...: tp = type(df.iloc[0][C])

...: print(f'{C:<8s} {tp}')

type <class 'str'>

MF <class 'str'>

weight <class 'numpy.float64'>

birthday <class 'str'>

13.2.2 summary() and .info()/.describe()

MATLAB’s summary() function and Pandas’s .info() method for dataframes give information on each column:

MATLAB:	Python:
>> summary(tb) Variables: type: 10×1 cell array of character vectors MF: 10×1 cell array of character vectors weight: 10×1 double Values: Min 1.12 Median 4.865 Max 9.82 birthday: 10×1 datetime Values: Min 2023-11-15 Median 2023-11-22 Max 2023-11-26	In : df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 4 columns): # Column Non-Null Count Dtype - ------ -------------- ----- 0 type 10 non-null object 1 MF 10 non-null object 2 weight 10 non-null float64 3 birthday 10 non-null object dtypes: float64(1), object(3) memory usage: 448.0+ bytes

MATLAB:

Python:

>> summary(tb)

Variables:

type: 10×1 cell array

of character vectors

MF: 10×1 cell array of

character vectors

weight: 10×1 double

Values:

Min 1.12

Median 4.865

Max 9.82

birthday: 10×1 datetime

Values:

Min 2023-11-15

Median 2023-11-22

Max 2023-11-26

In : df.info()

RangeIndex: 10 entries, 0 to 9

Data columns (total 4 columns):

# Column Non-Null Count Dtype

- ------ -------------- -----

0 type 10 non-null object

1 MF 10 non-null object

2 weight 10 non-null float64

3 birthday 10 non-null object

dtypes: float64(1), object(3)

memory usage: 448.0+ bytes

Pandas dataframes also have a .describe() method that gives value distributions for columns holding numeric values:

Python:

In : df.describe()

Out:

weight

count 10.000000

mean 5.326000

std 3.147289

min 1.120000

25% 3.297500

50% 4.865000

75% 7.692500

max 9.820000

The MATLAB summary information is more comprehensive as it gives extremal values for dates as well as numeric columns.

13.2.3 groupsummary() and .value_counts()

Another typical summary operation is obtaining unique counts of entries in desired columns. This can be done with groupsummary() in MATLAB and a column’s .value_counts() method in Pandas. Counts of each type of animal are obtained as follows:

MATLAB:	Python:
>> groupsummary(tb,'type') type GroupCount _________ __________ {'bunny'} 4 {'cat' } 4 {'dog' } 2	In : df['type'].value_counts() Out: cat 4 bunny 4 dog 2

MATLAB:

Python:

>> groupsummary(tb,'type')

type GroupCount

_________ __________

{'bunny'} 4

{'cat' } 4

{'dog' } 2

In : df['type'].value_counts()

Out:

cat 4

bunny 4

dog 2

13.2.4 head() and tail()

Head and tail functions print the desired number of rows at the beginning and end of the table:

MATLAB:	Python:
>> head(tb,3) type MF weight birthday _________ _____ ______ __________ {'dog' } {'M'} 9.82 2023-11-26 {'cat' } {'F'} 4.31 2023-11-18 {'bunny'} {'F'} 8.4 2023-11-23 >> tail(tb,2) type MF weight birthday _________ _____ ______ __________ {'cat' } {'M'} 5.08 2023-11-21 {'bunny'} {'F'} 1.53 2023-11-15	In : df.head(3) Out: type MF weight birthday 0 dog M 9.82 2023-11-26 1 cat F 4.31 2023-11-18 2 bunny F 8.40 2023-11-23 In : df.tail(2) Out: type MF weight birthday 8 cat M 5.08 2023-11-21 9 bunny F 1.53 2023-11-15

MATLAB:

Python:

>> head(tb,3)

type MF weight birthday

_________ _____ ______ __________

{'dog' } {'M'} 9.82 2023-11-26

{'cat' } {'F'} 4.31 2023-11-18

{'bunny'} {'F'} 8.4 2023-11-23

>> tail(tb,2)

type MF weight birthday

_________ _____ ______ __________

{'cat' } {'M'} 5.08 2023-11-21

{'bunny'} {'F'} 1.53 2023-11-15

In : df.head(3)

Out:

type MF weight birthday

0 dog M 9.82 2023-11-26

1 cat F 4.31 2023-11-18

2 bunny F 8.40 2023-11-23

In : df.tail(2)

Out:

type MF weight birthday

8 cat M 5.08 2023-11-21

9 bunny F 1.53 2023-11-15

13.3 Cleaning Data

Cleaning data refers to operations like renaming, adding, and deleting columns; editing strings to remove spurious whitespace and normalize spelling, hyphenation, and case; changing data types; and deleting duplicate rows or those with missing values.

13.3.1 Renaming Columns

Data sources may not have suitable or convenient column names. Renaming columns is simple, though. A dataframe’s .rename() method takes a map of before:after names for keyword columns. In MATLAB, one simply updates the table’s Properties. This example renames type to species and weight to kg:

MATLAB:	Python:
>> head(tb,1) type MF weight birthday _______ _____ ______ __________ {'dog'} {'M'} 9.82 2023-11-26 >> tb.Properties.VariableNames(... [1 3]) = {'species' 'kg'}; >> head(tb,1) species MF kg birthday _______ _____ ____ __________ {'dog'} {'M'} 9.82 2023-11-26	In : df.head(1) Out: type MF weight birthday 0 dog M 9.82 2023-11-26 In : df.rename(columns={ 'type':'species', 'weight':'kg'}).head(1) Out: species MF kg birthday 0 dog M 9.82 2023-11-26

MATLAB:

Python:

>> head(tb,1)

type MF weight birthday

_______ _____ ______ __________

{'dog'} {'M'} 9.82 2023-11-26

>> tb.Properties.VariableNames(...

[1 3]) = {'species' 'kg'};

>> head(tb,1)

species MF kg birthday

_______ _____ ____ __________

{'dog'} {'M'} 9.82 2023-11-26

In : df.head(1)

Out:

type MF weight birthday

0 dog M 9.82 2023-11-26

In : df.rename(columns={

'type':'species',

'weight':'kg'}).head(1)

Out:

species MF kg birthday

0 dog M 9.82 2023-11-26

Modifying tb.Properties.VariableNames causes MATLAB to change the table in-place. This differs from the .rename() method in Pandas which returns a new dataframe unless the option inplace=True is also given.

13.3.2 Changing Column Data Types

Operations involving multiple columns can typically only be performed if the columns have consistent data types. For example, if you have home prices for one year stored as strings in one column and floats in another year, you won’t be able to compute differences between the years until the strings are converted to floats or integers.

Joins between multiple tables, covered in Section 13.11, also require matching data types in addition to common column names.

Column types can be changed with the .astype() method in Pandas and by conventional typecasting in MATLAB. Here, we change the “male/female” column MF from cells to strings in MATLAB and from the vague “object” type to a character array in Pandas.

MATLAB:	Python:
>> head(tb,2) type MF weight birthday _________ ___ ______ __________ {'dog' } {M} 9.82 2023-11-26 {'cat' } {F} 4.31 2023-11-18 >> tb.MF = string(tb.MF); >> head(tb,2) type MF weight birthday _________ ___ ______ __________ {'dog' } "M" 9.82 2023-11-26 {'cat' } "F" 4.31 2023-11-18	In : df.head(2) Out: type MF weight birthday 0 dog M 9.82 2023-11-26 1 cat F 4.31 2023-11-18 In : df.MF = df.MF.astype(bytes) In : df.head(2) Out: type MF weight birthday 0 dog b'M' 9.82 2023-11-26 1 cat b'F' 4.31 2023-11-18

MATLAB:

Python:

>> head(tb,2)

type MF weight birthday

_________ ___ ______ __________

{'dog' } {M} 9.82 2023-11-26

{'cat' } {F} 4.31 2023-11-18

>> tb.MF = string(tb.MF);

>> head(tb,2)

type MF weight birthday

_________ ___ ______ __________

{'dog' } "M" 9.82 2023-11-26

{'cat' } "F" 4.31 2023-11-18

In : df.head(2)

Out:

type MF weight birthday

0 dog M 9.82 2023-11-26

1 cat F 4.31 2023-11-18

In : df.MF = df.MF.astype(bytes)

In : df.head(2)

Out:

type MF weight birthday

0 dog b'M' 9.82 2023-11-26

1 cat b'F' 4.31 2023-11-18

13.3.3 Changing Column Data

Dataframe columns have an .apply() method which invokes a user-supplied function on each value in the column. This example defines a function caps() which makes the first letter of a string uppercase, then applies this function to the type column:

Python:

In : df['type'].head(5)

Out:

0 dog

1 cat

2 bunny

3 bunny

4 dog

Name: type, dtype: object

In : def caps(x):

...: return x.capitalize()

In : df['type'].apply(caps).head(5)

Out:

0 Dog

1 Cat

2 Bunny

3 Bunny

4 Dog

Name: type, dtype: object

The type column is not modified unless it is reassigned to the output of the .apply() with

Python:

df['type'] = df['type'].apply(caps)

13.3.4 Making Timestamps from Strings

String-to-time conversions are handled directly with the Pandas to_datetime() function , provided the string follows conventional time formats.

Python:

In : df.iloc[0]['birthday']

Out: '2023-11-26'

In : type(df.iloc[0]['birthday'])

Out: str

In : df['birthday'] = pd.to_datetime(df['birthday'])

In : df.iloc[0]['birthday']

Out: Timestamp('2023-11-26 00:00:00')

In : type(df.iloc[0]['birthday'])

Out: pandas._libs.tslibs.timestamps.Timestamp

13.4 Creating Tables Programmatically

Tables and dataframes need not come solely from files; they can also be created programmatically. MATLAB has a collection of X2table() functions—array2table(), cell2table(), struct2table()—that perform the expected conversions. The Pandas DataFrame() function accepts lists and dictionaries and returns a dataframe.

struct2table() and DataFrame() with a dictionary argument are convenient because the structure attributes and dictionary keys become column names:

MATLAB:	Python:
>> x.year = [1867 1776 1810]'; >> x.n_prov = [9 50 31]'; >> x.country = [{'Canada'},... {'USA'}, {'Mexico'}]'; >> tb = struct2table(x) 3×3 table year n_prov country ____ ______ __________ 1867 9 {'Canada'} 1776 50 {'USA' } 1810 31 {'Mexico'}	In : x = { ...: 'year' : [1867,1776,1810], ...: 'nprov' : [9,50,31], ...: 'country' : ['Canada', ...: 'USA', 'Mexico'] } In : df = pd.DataFrame(x) In : df Out: year nprov country 0 1867 9 Canada 1 1776 50 USA 2 1810 31 Mexico

MATLAB:

Python:

>> x.year = [1867 1776 1810]';

>> x.n_prov = [9 50 31]';

>> x.country = [{'Canada'},...

{'USA'}, {'Mexico'}]';

>> tb = struct2table(x)

3×3 table

year n_prov country

____ ______ __________

1867 9 {'Canada'}

1776 50 {'USA' }

1810 31 {'Mexico'}

In : x = {

...: 'year' : [1867,1776,1810],

...: 'nprov' : [9,50,31],

...: 'country' : ['Canada',

...: 'USA', 'Mexico'] }

In : df = pd.DataFrame(x)

In : df

Out:

year nprov country

0 1867 9 Canada

1 1776 50 USA

2 1810 31 Mexico

13.5 Sorting Rows

Head and tail results of a sorted table reveal extremal values. Here, we find the lightest and heaviest animals by sorting on weight—and where weights are equal, we do a secondary sort on birthday.

A side note on sorting dates and times: Ideally, the column stores these with a time-aware data type. However, proper chronological sorting is possible with strings if the date is formatted from most significant to least significant time entries. This is the case with our year-month-day string; a lexical sort is equivalent to a chronological sort.

MATLAB:	Python:
>> head(sortrows(tb,{'weight',... 'birthday'}),2) type MF weight birthday _________ _____ ______ __________ {'bunny'} {'M'} 1.12 2023-11-17 {'bunny'} {'F'} 1.53 2023-11-15 >> tail(sortrows(tb,{'weight',... 'birthday'}),2) type MF weight birthday _______ _____ ______ __________ {'dog'} {'F'} 9.82 2023-11-25 {'dog'} {'M'} 9.82 2023-11-26	In : df.sort_values(by=['weight', ...: 'birthday']).head(2) Out: type MF weight birthday 6 bunny M 1.12 2023-11-17 9 bunny F 1.53 2023-11-15 In : df.sort_values(by=['weight', ...: 'birthday']).tail(2) Out: type MF weight birthday 4 dog F 9.82 2023-11-25 0 dog M 9.82 2023-11-26

MATLAB:

Python:

>> head(sortrows(tb,{'weight',...

'birthday'}),2)

type MF weight birthday

_________ _____ ______ __________

{'bunny'} {'M'} 1.12 2023-11-17

{'bunny'} {'F'} 1.53 2023-11-15

>> tail(sortrows(tb,{'weight',...

'birthday'}),2)

type MF weight birthday

_______ _____ ______ __________

{'dog'} {'F'} 9.82 2023-11-25

{'dog'} {'M'} 9.82 2023-11-26

In : df.sort_values(by=['weight',

...: 'birthday']).head(2)

Out:

type MF weight birthday

6 bunny M 1.12 2023-11-17

9 bunny F 1.53 2023-11-15

In : df.sort_values(by=['weight',

...: 'birthday']).tail(2)

Out:

type MF weight birthday

4 dog F 9.82 2023-11-25

0 dog M 9.82 2023-11-26

13.6 Table Subsets

Portions of tables can be referenced through row and column slices using indexing techniques similar to those for MATLAB matrices and NumPy ndarrays. As with numeric arrays (Section 11.1.15), table indexing in MATLAB is a bit more convenient than in Python. MATLAB accepts any integer array as a table index, but Pandas needs special attributes .loc[] for label and logical indexing and .iloc[] for arbitrary numeric indexing.

The biggest differences between .loc[] and .iloc[] are that .loc[] permits Boolean expressions to filter on rows (in contrast, .iloc[] only accepts numeric indices), and .iloc[] enables one to select row and column subsets (.loc[] only works on rows).

13.6.1 All Rows, Selected Columns

New tables containing only weight and type can be created by specifying these column names in the indexing notation shown as follows:

MATLAB:	Python:
>> tb(:,{'weight','type'}) weight type ______ _________ 9.82 {'dog' } 4.31 {'cat' } 8.4 {'bunny'} 4.65 {'bunny'} 9.82 {'dog' } 5.57 {'cat' } 1.12 {'bunny'} 2.96 {'cat' } 5.08 {'cat' } 1.53 {'bunny'}	In : df[['weight','type']] Out: weight type 0 9.82 dog 1 4.31 cat 2 8.40 bunny 3 4.65 bunny 4 9.82 dog 5 5.57 cat 6 1.12 bunny 7 2.96 cat 8 5.08 cat 9 1.53 bunny

MATLAB:

Python:

>> tb(:,{'weight','type'})

weight type

______ _________

9.82 {'dog' }

4.31 {'cat' }

8.4 {'bunny'}

4.65 {'bunny'}

9.82 {'dog' }

5.57 {'cat' }

1.12 {'bunny'}

2.96 {'cat' }

5.08 {'cat' }

1.53 {'bunny'}

In : df[['weight','type']]

Out:

weight type

0 9.82 dog

1 4.31 cat

2 8.40 bunny

3 4.65 bunny

4 9.82 dog

5 5.57 cat

6 1.12 bunny

7 2.96 cat

8 5.08 cat

9 1.53 bunny

13.6.2 All Columns, Selected Rows

New tables containing all columns but only selected rows are produced using the following indexing:

MATLAB:	Python:
>> tb([2,3],:) type MF weight birthday _________ _____ ______ __________ {'cat' } {'F'} 4.31 2023-11-18 {'bunny'} {'F'} 8.4 2023-11-23	In : df.loc[[1,2]] Out: type MF weight birthday 1 cat F 4.31 2023-11-18 2 bunny F 8.40 2023-11-23

MATLAB:

Python:

>> tb([2,3],:)

type MF weight birthday

_________ _____ ______ __________

{'cat' } {'F'} 4.31 2023-11-18

{'bunny'} {'F'} 8.4 2023-11-23

In : df.loc[[1,2]]

Out:

type MF weight birthday

1 cat F 4.31 2023-11-18

2 bunny F 8.40 2023-11-23

13.6.3 Selected Rows, Selected Columns

A subset of rows can be similarly targeted using a row slice with either a strided range or an arbitrary list of integers. These select every third row starting with the first:

MATLAB:	Python:
>> tb(1:3:end,{'weight','type'}) weight type ______ _________ 9.82 {'dog' } 4.65 {'bunny'} 1.12 {'bunny'} 1.53 {'bunny'}	In : df[::3][['weight', 'type']] Out: weight type 0 9.82 dog 3 4.65 bunny 6 1.12 bunny 9 1.53 bunny

MATLAB:

Python:

>> tb(1:3:end,{'weight','type'})

weight type

______ _________

9.82 {'dog' }

4.65 {'bunny'}

1.12 {'bunny'}

1.53 {'bunny'}

In : df[::3][['weight', 'type']]

Out:

weight type

0 9.82 dog

3 4.65 bunny

6 1.12 bunny

9 1.53 bunny

Selecting arbitrary rows of a dataframe requires the .iloc[] attribute:

MATLAB:	Python:
> tb([1,6,2],{'weight','type'}) weight type ______ _______ 9.82 {'dog'} 5.57 {'cat'} 4.31 {'cat'}	In : df[['weight','type']].iloc[[0,5,1]] Out: weight type 0 9.82 dog 5 5.57 cat 1 4.31 cat

MATLAB:

Python:

> tb([1,6,2],{'weight','type'})

weight type

______ _______

9.82 {'dog'}

5.57 {'cat'}

4.31 {'cat'}

In : df[['weight','type']].iloc[[0,5,1]]

Out:

weight type

0 9.82 dog

5 5.57 cat

1 4.31 cat

A unique feature of dataframes is that row and column indexing can be given in any order:

Python:	Python:
In : df.iloc[[0,5,1]][ ['weight','type']] Out: weight type 0 9.82 dog 5 5.57 cat 1 4.31 cat	In : df[['weight','type'] ].iloc[[0,5,1]] Out: weight type 0 9.82 dog 5 5.57 cat 1 4.31 cat

Python:

In : df.iloc[[0,5,1]][

['weight','type']]

Out:

weight type

0 9.82 dog

5 5.57 cat

1 4.31 cat

In : df[['weight','type']

].iloc[[0,5,1]]

Out:

weight type

0 9.82 dog

5 5.57 cat

1 4.31 cat

This is possible because each indexing component returns a new table which can subsequently be indexed independently. df.iloc[[0,5,1]] returns the first, sixth, and second rows, and df.iloc[[0,5,1]][['weight','type']] returns the weight and size columns from the three specified rows. Similarly, df[['weight','type']] is a table containing all rows of weight and type columns, and df[['weight','type']].iloc[[0,5,1]] returns the first, sixth, and second rows from those two columns.

MATLAB tables are indexed via {columns} or {rows, columns}.

13.6.4 Filter Rows by Conditional Operations

In addition to numeric indexing, rows of MATLAB tables and Pandas dataframes can be indexed with Boolean expressions on columns. This example returns males weighing more than four units:

MATLAB:	Python:
>> tb((tb.weight > 4) & (... cell2mat(tb.MF) == 'M'),:) 4×4 table type MF weight birthday _________ _____ ______ __________ {'dog' } {'M'} 9.82 2023-11-26 {'bunny'} {'M'} 4.65 2023-11-24 {'cat' } {'M'} 5.57 2023-11-24 {'cat' } {'M'} 5.08 2023-11-21	In : df[(df.weight > 4) & ...: (df.MF == 'M')] Out: type MF weight birthday 0 dog M 9.82 2023-11-26 3 bunny M 4.65 2023-11-24 5 cat M 5.57 2023-11-24 8 cat M 5.08 2023-11-21

MATLAB:

Python:

>> tb((tb.weight > 4) & (...

cell2mat(tb.MF) == 'M'),:)

4×4 table

type MF weight birthday

_________ _____ ______ __________

{'dog' } {'M'} 9.82 2023-11-26

{'bunny'} {'M'} 4.65 2023-11-24

{'cat' } {'M'} 5.57 2023-11-24

{'cat' } {'M'} 5.08 2023-11-21

In : df[(df.weight > 4) &

...: (df.MF == 'M')]

Out:

type MF weight birthday

0 dog M 9.82 2023-11-26

3 bunny M 4.65 2023-11-24

5 cat M 5.57 2023-11-24

8 cat M 5.08 2023-11-21

Note that in MATLAB, string values (as in column MF) must be converted to a matrix to allow a string equivalence test.

13.7 Iterating over Rows

Interaction with tables generally happens with entire columns of data. Occasionally though, access to individual rows is necessary. Rows of MATLAB tables are indexed using brace subscripts. Pandas rows may be indexed numerically using the .iloc[] accessor (Section 13.6.3), but when iterating over an entire table, the dataframe’s .iterrows() generator invoked in a for loop is more convenient. Column values within a row may be indexed numerically or with column names:

MATLAB:	Python:
>> for i = 1:height(tb) fprintf("%d %s %s ",i,string(... tb{i,1}),string(tb{i,'MF'})) end 1 dog M 2 cat F 3 bunny F 4 bunny M 5 dog F 6 cat M 7 bunny M 8 cat M 9 cat M 10 bunny F	In : for i,row in df.iterrows(): ...: print(i,row[0],row['MF']) 0 dog M 1 cat F 2 bunny F 3 bunny M 4 dog F 5 cat M 6 bunny M 7 cat M 8 cat M 9 bunny F

MATLAB:

Python:

>> for i = 1:height(tb)

fprintf("%d %s %s ",i,string(...

tb{i,1}),string(tb{i,'MF'}))

end

1 dog M

2 cat F

3 bunny F

4 bunny M

5 dog F

6 cat M

7 bunny M

8 cat M

9 cat M

10 bunny F

In : for i,row in df.iterrows():

...: print(i,row[0],row['MF'])

0 dog M

1 cat F

2 bunny F

3 bunny M

4 dog F

5 cat M

6 bunny M

7 cat M

8 cat M

9 bunny F

13.8 Pivot Tables

Pivot tables allow one to aggregate data sharing common values. They help answer questions such as “what were sales by region?” or “how many points did home teams score compared to visiting teams?” Pandas has explicit functions for these—pd.pivot_table() and pd.groupby()—while MATLAB relies on different techniques depending on the complexity of the desired aggregation.

13.8.1 Single-Level Aggregation

MATLAB can produce a pivot table grouped by a single variable through its generic varfun() function with "GroupingVariables" as the third argument and the column name to group by as the fourth argument. The Pandas pd.pivot_table() function, which can also be invoked as a dataframe method, takes a list of column names to group by with the keyword argument index. This example returns a new table or dataframe with a count of animals based on type. It duplicates the result of .value_counts() from Section 13.2.3:

MATLAB:	Python:
>> varfun(@length,tb,... "GroupingVariables","type",... "InputVariables","type") type GroupCount length_type _________ __________ ___________ {'bunny'} 4 4 {'cat' } 4 4 {'dog' } 2 2	In : pd.pivot_table(df, index=['type'], aggfunc="count") Out: MF birthday weight type bunny 4 4 4 cat 4 4 4 dog 2 2 2

MATLAB:

Python:

>> varfun(@length,tb,...

"GroupingVariables","type",...

"InputVariables","type")

type GroupCount length_type

_________ __________ ___________

{'bunny'} 4 4

{'cat' } 4 4

{'dog' } 2 2

In : pd.pivot_table(df,

index=['type'], aggfunc="count")

Out:

MF birthday weight

type

bunny 4 4 4

cat 4 4 4

dog 2 2 2

The aggregation function is provided in MATLAB as a function handle and in Pandas with the optional keyword argument aggfunc. Pandas uses mean() if this argument is not supplied.

varfun() and pd.pivot_table() return a new table and dataframe which can be sliced, sorted, and so on. The next example sums the weight of males and females using the method form of pivot_table():

MATLAB:	Python:
>> varfun(@sum,tb,... "GroupingVariables","MF",... "InputVariables","weight") MF GroupCount sum_weight _____ __________ __________ {'F'} 4 24.06 {'M'} 6 29.2	In : df.pivot_table(index=['MF'], ...: values=['weight'],aggfunc='sum') Out: weight MF F 24.06 M 29.20

MATLAB:

Python:

>> varfun(@sum,tb,...

"GroupingVariables","MF",...

"InputVariables","weight")

MF GroupCount sum_weight

_____ __________ __________

{'F'} 4 24.06

{'M'} 6 29.2

In : df.pivot_table(index=['MF'],

...: values=['weight'],aggfunc='sum')

Out:

weight

F 24.06

M 29.20

The Pandas .groupby() function returns the same result, but only because weight is the only numeric column in df. Had there been others, sums would be returned for those too.

Python:

In : df.groupby(['MF']).sum()

Out:

weight

F 24.06

M 29.20

13.8.2 Multilevel Aggregation

Aggregation may be nested to multiple levels in Pandas simply by adding more columns to the index= keyword argument. Despite the plural form of "GroupingVariables", the same is not true for MATLAB’s varfun(); it only allows grouping by one column. Multilevel pivot tables in MATLAB require the three-step “split-apply-combine” method involving several intermediate variables. Here, we’ll get a more detailed view into animal weights by separating the aggregation by animal type in addition to gender:

MATLAB:	Python:
>> [Idx, MF, Type] = ... findgroups(tb.MF, tb.type); >> sum_weight = splitapply(... @sum, tb.weight, Idx); >> table(MF, Type, sum_weight) 6×3 table MF Type sum_weight _____ _________ __________ {'F'} {'bunny'} 9.93 {'F'} {'cat' } 4.31 {'F'} {'dog' } 9.82 {'M'} {'bunny'} 5.77 {'M'} {'cat' } 13.61 {'M'} {'dog' } 9.82	In : pd.pivot_table(df, index=['MF','type'], values=['weight'], aggfunc='sum') Out: weight MF type F bunny 9.93 cat 4.31 dog 9.82 M bunny 5.77 cat 13.61 dog 9.82

MATLAB:

Python:

>> [Idx, MF, Type] = ...

findgroups(tb.MF, tb.type);

>> sum_weight = splitapply(...

@sum, tb.weight, Idx);

>> table(MF, Type, sum_weight)

6×3 table

MF Type sum_weight

_____ _________ __________

{'F'} {'bunny'} 9.93

{'F'} {'cat' } 4.31

{'F'} {'dog' } 9.82

{'M'} {'bunny'} 5.77

{'M'} {'cat' } 13.61

{'M'} {'dog' } 9.82

In : pd.pivot_table(df,

index=['MF','type'],

values=['weight'],

aggfunc='sum')

Out:

weight

MF type

F bunny 9.93

cat 4.31

dog 9.82

M bunny 5.77

cat 13.61

dog 9.82

13.9 Adding Columns

New columns derived from data in existing columns can be helpful in subsequent aggregation and join operations. They can be added simply by indexing the table or dataframe with a new column name in an assignment statement. For example, tb.New = 4 and df['New' = 4] create column “New” with a value of 4 for all rows in table tb and dataframe df. In the following, we’ll add a new column, wd, containing the integer weekday of each animal’s birthday. The Pandas birthday column needs to be a timestamp rather than string (Section 13.3.4).

One complication is that MATLAB weekday values are Sunday = 1 to Saturday = 7, while Python’s are Monday = 0 to Sunday = 6. We’ll map the Python values to match MATLAB’s with mod (i + 1, 7) + 1 where i is Python’s weekday value.

MATLAB:	Python:
>> tb.wd = weekday(tb{:,'birthday'}) type MF weight birthday wd _________ ____ ______ __________ __ {'dog' } {'M'} 9.82 2023-11-26 1 {'cat' } {'F'} 4.31 2023-11-18 7 {'bunny'} {'F'} 8.4 2023-11-23 5 {'bunny'} {'M'} 4.65 2023-11-24 6 {'dog' } {'F'} 9.82 2023-11-25 7 {'cat' } {'M'} 5.57 2023-11-24 6 {'bunny'} {'M'} 1.12 2023-11-17 6 {'cat' } {'M'} 2.96 2023-11-21 3 {'cat' } {'M'} 5.08 2023-11-21 3 {'bunny'} {'F'} 1.53 2023-11-15 4	In : i = df['birthday'].dt.weekday In : df['wd'] = ((i+1) % 7) + 1 In : df type MF weight birthday wd 0 dog M 9.82 2023-11-26 1 1 cat F 4.31 2023-11-18 7 2 bunny F 8.40 2023-11-23 5 3 bunny M 4.65 2023-11-24 6 4 dog F 9.82 2023-11-25 7 5 cat M 5.57 2023-11-24 6 6 bunny M 1.12 2023-11-17 6 7 cat M 2.96 2023-11-21 3 8 cat M 5.08 2023-11-21 3 9 bunny F 1.53 2023-11-15 4

MATLAB:

Python:

>> tb.wd = weekday(tb{:,'birthday'})

type MF weight birthday wd

_________ ____ ______ __________ __

{'dog' } {'M'} 9.82 2023-11-26 1

{'cat' } {'F'} 4.31 2023-11-18 7

{'bunny'} {'F'} 8.4 2023-11-23 5

{'bunny'} {'M'} 4.65 2023-11-24 6

{'dog' } {'F'} 9.82 2023-11-25 7

{'cat' } {'M'} 5.57 2023-11-24 6

{'bunny'} {'M'} 1.12 2023-11-17 6

{'cat' } {'M'} 2.96 2023-11-21 3

{'cat' } {'M'} 5.08 2023-11-21 3

{'bunny'} {'F'} 1.53 2023-11-15 4

In : i = df['birthday'].dt.weekday

In : df['wd'] = ((i+1) % 7) + 1

In : df

type MF weight birthday wd

0 dog M 9.82 2023-11-26 1

1 cat F 4.31 2023-11-18 7

2 bunny F 8.40 2023-11-23 5

3 bunny M 4.65 2023-11-24 6

4 dog F 9.82 2023-11-25 7

5 cat M 5.57 2023-11-24 6

6 bunny M 1.12 2023-11-17 6

7 cat M 2.96 2023-11-21 3

8 cat M 5.08 2023-11-21 3

9 bunny F 1.53 2023-11-15 4

If we count the number of animals born by weekday using a pivot table, sort the result by counts, then tail that, we’ll see the weekday on which most animals were born. The selection of MF as the “value” column to sum over is arbitrary; the row count is the same regardless of the column chosen to count:

MATLAB:	Python:
>> tail(sortrows(varfun(@length,... tb,"GroupingVariables","wd",... "InputVariables","MF"),... "GroupCount"),1) wd GroupCount length_MF __ __________ _________ 6 3 3	In : pd.pivot_table(df,index=['wd'], values=['MF'],aggfunc='count' ).sort_values(by=['MF']).tail(1) Out: MF wd 6 3

MATLAB:

Python:

>> tail(sortrows(varfun(@length,...

tb,"GroupingVariables","wd",...

"InputVariables","MF"),...

"GroupCount"),1)

wd GroupCount length_MF

__ __________ _________

6 3 3

In : pd.pivot_table(df,index=['wd'],

values=['MF'],aggfunc='count'

).sort_values(by=['MF']).tail(1)

Out:

6 3

We find that three of the animals were born on wd = 6, that is, Friday.

13.10 Deleting Columns

Columns can be deleted as easily as they are added. A column can be removed from a MATLAB table by reassigning it to an empty array. In Pandas, a column is deleted the same way as a dictionary entry. Here, we remove the weekday column added in the previous section:

MATLAB:	Python:
>> head(tb,2) type MF weight birthday wd _______ _____ ______ __________ __ {'dog'} {'M'} 9.82 2023-11-26 1 {'cat'} {'F'} 4.31 2023-11-18 7 >> tb.wd = []; % remove wd column >> head(tb,2) type MF weight birthday _______ _____ ______ __________ {'dog'} {'M'} 9.82 2023-11-26 {'cat'} {'F'} 4.31 2023-11-18	In : df.head(2) Out: type MF weight birthday wd 0 dog M 9.82 2023-11-26 1 1 cat F 4.31 2023-11-18 7 In : del df['wd'] # remove wd column In : df.head(2) Out: type MF weight birthday 0 dog M 9.82 2023-11-26 1 cat F 4.31 2023-11-18

MATLAB:

Python:

>> head(tb,2)

type MF weight birthday wd

_______ _____ ______ __________ __

{'dog'} {'M'} 9.82 2023-11-26 1

{'cat'} {'F'} 4.31 2023-11-18 7

>> tb.wd = []; % remove wd column

>> head(tb,2)

type MF weight birthday

_______ _____ ______ __________

{'dog'} {'M'} 9.82 2023-11-26

{'cat'} {'F'} 4.31 2023-11-18

In : df.head(2)

Out:

type MF weight birthday wd

0 dog M 9.82 2023-11-26 1

1 cat F 4.31 2023-11-18 7

In : del df['wd'] # remove wd column

In : df.head(2)

Out:

type MF weight birthday

0 dog M 9.82 2023-11-26

1 cat F 4.31 2023-11-18

Multiple columns can be deleted from a dataframe with the .drop() method and removevars() in MATLAB:

MATLAB:	Python:
>> head(tb,2) type MF weight birthday _______ _____ ______ __________ {'dog'} {'M'} 9.82 2023-11-26 {'cat'} {'F'} 4.31 2023-11-18 >> head(removevars(tb,... {'MF','weight'}),2) type birthday _______ __________ {'dog'} 2023-11-26 {'cat'} 2023-11-18	In : df.head(2) Out: type MF weight birthday 0 dog M 9.82 2023-11-26 1 cat F 4.31 2023-11-18 In df.drop(columns=['MF','weight']).head(2) Out: type birthday 0 dog 2023-11-26 1 cat 2023-11-18

MATLAB:

Python:

>> head(tb,2)

type MF weight birthday

_______ _____ ______ __________

{'dog'} {'M'} 9.82 2023-11-26

{'cat'} {'F'} 4.31 2023-11-18

>> head(removevars(tb,...

{'MF','weight'}),2)

type birthday

_______ __________

{'dog'} 2023-11-26

{'cat'} 2023-11-18

In : df.head(2)

Out:

type MF weight birthday

0 dog M 9.82 2023-11-26

1 cat F 4.31 2023-11-18

In df.drop(columns=['MF','weight']).head(2)

Out:

type birthday

0 dog 2023-11-26

1 cat 2023-11-18

In both cases, a new table is returned; the original tb and df remain unchanged.

13.11 Joins Across Tables

Complex datasets may span multiple tables. MATLAB and Pandas support join operations that yield results from data related by indices common to the tables. Joins come in four flavors—inner, full, left, right—that are the Boolean equivalents of intersection, union, and subtraction operations between two sets. Inner joins are seen frequently because they return results only where both tables have common indices.

As a simple example, say we have to feed our animals. We’ll introduce a second table that has the cost of a unit of food for each animal type:

MATLAB:	Python:
>> food.type = {'dog','cat',... 'bunny'}'; >> food.price = [1.2 3.5 0.7]'; >> cost_tb = struct2table(food) 3×2 table type price _________ _____ {'dog' } 1.2 {'cat' } 3.5 {'bunny'} 0.7	In : food = { ...: 'type' : ['dog','cat','bunny'], ...: 'price' : [ 1.2, 3.5, 0.7]} In : cost_df = pd.DataFrame(food) In : cost_df Out: type price 0 dog 1.2 1 cat 3.5 2 bunny 0.7

MATLAB:

Python:

>> food.type = {'dog','cat',...

'bunny'}';

>> food.price = [1.2 3.5 0.7]';

>> cost_tb = struct2table(food)

3×2 table

type price

_________ _____

{'dog' } 1.2

{'cat' } 3.5

{'bunny'} 0.7

In : food = {

...: 'type' : ['dog','cat','bunny'],

...: 'price' : [ 1.2, 3.5, 0.7]}

In : cost_df = pd.DataFrame(food)

In : cost_df

Out:

type price

0 dog 1.2

1 cat 3.5

2 bunny 0.7

We can create a new table containing all the animal entries plus the type-specific price by joining the two tables on type. Joins are done in MATLAB with join(), innerjoin(), or outerjoin() and in Pandas with pd.merge(). pd.merge()’s default join type, “inner,” can be changed with the how= keyword option.

MATLAB:	Python:
>> J = innerjoin(tb,cost_tb, 'Keys',... {'type'}) type MF weight birthday price ________ _____ ______ _________ ______ {'bunny'} {'F'} 8.4 2023-11-23 0.7 {'bunny'} {'M'} 4.65 2023-11-24 0.7 {'bunny'} {'M'} 1.12 2023-11-17 0.7 {'bunny'} {'F'} 1.53 2023-11-15 0.7 {'cat' } {'F'} 4.31 2023-11-18 3.5 {'cat' } {'M'} 5.57 2023-11-24 3.5 {'cat' } {'M'} 2.96 2023-11-21 3.5 {'cat' } {'M'} 5.08 2023-11-21 3.5 {'dog' } {'M'} 9.82 2023-11-26 1.2 {'dog' } {'F'} 9.82 2023-11-25 1.2	In : J = pd.merge(df,cost_df,on='type') In : J Out: type MF weight birthday price 0 dog M 9.82 2023-11-26 1.2 1 dog F 9.82 2023-11-25 1.2 2 cat F 4.31 2023-11-18 3.5 3 cat M 5.57 2023-11-24 3.5 4 cat M 2.96 2023-11-21 3.5 5 cat M 5.08 2023-11-21 3.5 6 bunny F 8.40 2023-11-23 0.7 7 bunny M 4.65 2023-11-24 0.7 8 bunny M 1.12 2023-11-17 0.7 9 bunny F 1.53 2023-11-15 0.7

MATLAB:

Python:

>> J = innerjoin(tb,cost_tb, 'Keys',...

{'type'})

type MF weight birthday price

________ _____ ______ _________ ______

{'bunny'} {'F'} 8.4 2023-11-23 0.7

{'bunny'} {'M'} 4.65 2023-11-24 0.7

{'bunny'} {'M'} 1.12 2023-11-17 0.7

{'bunny'} {'F'} 1.53 2023-11-15 0.7

{'cat' } {'F'} 4.31 2023-11-18 3.5

{'cat' } {'M'} 5.57 2023-11-24 3.5

{'cat' } {'M'} 2.96 2023-11-21 3.5

{'cat' } {'M'} 5.08 2023-11-21 3.5

{'dog' } {'M'} 9.82 2023-11-26 1.2

{'dog' } {'F'} 9.82 2023-11-25 1.2

In : J = pd.merge(df,cost_df,on='type')

In : J

Out:

type MF weight birthday price

0 dog M 9.82 2023-11-26 1.2

1 dog F 9.82 2023-11-25 1.2

2 cat F 4.31 2023-11-18 3.5

3 cat M 5.57 2023-11-24 3.5

4 cat M 2.96 2023-11-21 3.5

5 cat M 5.08 2023-11-21 3.5

6 bunny F 8.40 2023-11-23 0.7

7 bunny M 4.65 2023-11-24 0.7

8 bunny M 1.12 2023-11-17 0.7

9 bunny F 1.53 2023-11-15 0.7

The tables match although rows appear in different orders. The joined table J lets us compute the cost to feed each animal. If each animal eats its weight in food, the costs are

MATLAB:	Python:
>> J.weight .* J.price ans= 5.8800 3.2550 0.7840 1.0710 15.0850 19.4950 10.3600 17.7800 11.7840 11.7840 >> sum(J.weight .* J.price) ans= 97.2780	In : J.weightJ.price Out: 0 11.784 1 11.784 2 15.085 3 19.495 4 10.360 5 17.780 6 5.880 7 3.255 8 0.784 9 1.071 dtype: float64 In : sum(J.weightJ.price) Out: 97.27799999999999

MATLAB:

Python:

>> J.weight .* J.price

ans=

5.8800

3.2550

0.7840

1.0710

15.0850

19.4950

10.3600

17.7800

11.7840

>> sum(J.weight .* J.price)

ans=

97.2780

In : J.weight*J.price

Out:

0 11.784

1 11.784

2 15.085

3 19.495

4 10.360

5 17.780

6 5.880

7 3.255

8 0.784

9 1.071

dtype: float64

In : sum(J.weight*J.price)

Out: 97.27799999999999

What’s the cost breakdown by animal type? A pivot table gives the answer, but first we’ll need another temporary column storing the cost per animal computed earlier. We’ll call this new column wxp for “weight times price”:

MATLAB:	Python:
>> J.wxp = J.weight .* J.price; varfun(@sum,J,"GroupingVariables",... "type","InputVariables","wxp") 3×3 table type GroupCt sum_wxp _________ _______ __________ {'bunny'} 4.e+00 1.0990e+01 {'cat' } 4.e+00 6.2720e+01 {'dog' } 2.e+00 2.3568e+01	In : J['wxp'] = J.weight*J.price In : pd.pivot_table(J,index=['type'], ...: values=['wxp'],aggfunc=sum) Out: wxp type bunny 10.990 cat 62.720 dog 23.568

MATLAB:

Python:

>> J.wxp = J.weight .* J.price;

varfun(@sum,J,"GroupingVariables",...

"type","InputVariables","wxp")

3×3 table

type GroupCt sum_wxp

_________ _______ __________

{'bunny'} 4.e+00 1.0990e+01

{'cat' } 4.e+00 6.2720e+01

{'dog' } 2.e+00 2.3568e+01

In : J['wxp'] = J.weight*J.price

In : pd.pivot_table(J,index=['type'],

...: values=['wxp'],aggfunc=sum)

Out:

wxp

type

bunny 10.990

cat 62.720

dog 23.568

13.12 GeoPandas

GeoPandas [1] adds spatial data types and operations to Pandas. It allows one to query and aggregate information from Pandas dataframes spatially (“how do home prices vary by postal code and time?”), perform geographic set operations (“over which countries did the airplane fly?”), and create maps with data overlays. Here, we’ll explore that last capability.

GeoPandas needs geographic data in the form of GeoJSON files or shapefiles. In Section 12.4.3.2, we used shapefiles downloaded from the Natural Earth project. The US government’s data.gov website is another source of shapefiles; among others, it contains a 500 MB compressed file with high-resolution outlines of every zip code in the United States.¹

My goal is to make a map of Los Angeles County with zip codes colored by home prices. Zillow Research² generously makes available real estate price data grouped by zip code. From their portal, I downloaded the “ZVHI All Homes (SFR/Condo/Coop)” dataset, a 70 MB CSV file.

Loading a CSV file into Pandas is easy (Section 13.1). The novel part here is making a dataframe with geographic data of zip code boundaries. Fortunately, GeoPandas makes loading a shapefile as easy as loading a CSV file. The US postal code file, tl_2019_us_zcta510.zip, expands to seven files, the largest of which is the shapefile tl_2019_us_zcta510.shp. All we need to do is read it with the GeoPandas read_file() function . These commands

In : import pandas as pd

In : import geopandas as gp

In : home_prices = pd.read_csv('Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv')

In : zipcode_geom = gp.read_file('tl_2019_us_zcta510.shp')

create two Pandas dataframes, home_prices, containing Zillow’s home prices by zip code, and zipcode_geom, containing shapes of all zip codes. Their contents look like this:

In : home_prices

Out:

RegionID SizeRank RegionName ... 2021-03-31 2021-04-30 2021-05-31

0 61639 0 10025 ... 1053436.0 1071606.0 1089141.0

1 84654 1 60657 ... 514068.0 515876.0 517752.0

2 61637 2 10023 ... 1096338.0 1100727.0 1104368.0

3 91982 3 77494 ... 370188.0 376591.0 383493.0

4 84616 4 60614 ... 656960.0 659558.0 661174.0

... ... ... ... ... ... ... ...

30832 62532 34430 12345 ... 167810.0 170959.0 174757.0

30833 87060 34430 66045 ... 219054.0 224651.0 233820.0

30834 58379 34430 1470 ... 398219.0 402924.0 407785.0

30835 58117 35187 822 ... 168439.0 169986.0 173515.0

30836 58110 35187 801 ... 37344.0 36687.0 36030.0

[30837 rows x 314 columns]

In : zipcode_geom

Out:

ZCTA5CE10 ... geometry

0 43451 ... POLYGON ((-83.70873 41.32733, -83.70815 41.327...

1 43452 ... POLYGON ((-83.08698 41.53780, -83.08256 41.537...

2 43456 ... MULTIPOLYGON (((-82.83558 41.71082, -82.83515 ...

3 43457 ... POLYGON ((-83.49650 41.25371, -83.48382 41.253...

4 43458 ... POLYGON ((-83.22229 41.53102, -83.22228 41.532...

... ... ... ...

33139 84044 ... POLYGON ((-112.26022 40.76909, -112.25333 40.7...

33140 84045 ... MULTIPOLYGON (((-111.92421 40.17034, -111.9240...

33141 84046 ... POLYGON ((-110.00072 40.99745, -110.00036 40.9...

33142 84047 ... POLYGON ((-111.92141 40.62772, -111.92134 40.6...

33143 84049 ... POLYGON ((-111.59394 40.57707, -111.59386 40.5...

[33144 rows x 10 columns]

All we need to do is join the two dataframes on zip code and then invoke the .plot() method on the result to see a map. Some data cleaning is needed to prepare the dataframes for a join though; the two dataframes must use the same name and data type for the join column.

In : zipcode_geom.rename(columns={'ZCTA5CE10':'ZipCode'}, inplace=True)

In : home_prices.rename(columns={'RegionName':'ZipCode'}, inplace=True)

In : zipcode_geom['ZipCode'] = zipcode_geom['ZipCode'].astype(int)

How did I know to change the data type of zipcode_geom['ZipCode'] to integer? Initially, I didn’t. I attempted the join with only renamed columns and then got a Pandas error saying the join failed because of a type mismatch, “ValueError: You are trying to merge an object and int64 columns.”

After renaming and fixing the data type, I can do the join, but the resulting dataframe includes every zip code in the United States—too much for my needs. I’ll first subset the data to just LA County, and only for May 2020 and 2021:

In : LA_prices = home_prices[

...: home_prices['CountyName'] == 'Los Angeles County'][

...: ['ZipCode','2020-05-31','2021-05-31']]

Finally, I’ll do the join and create the plot :

In : LA_price_map = zipcode_geom.merge(LA_prices,on='ZipCode') # join

In : LA_price_map['2020-05-31'] /= 1.0e6 # change units to million dollars

In : LA_price_map.plot(column='2020-05-31',cmap='plasma',legend=True)

In : plt.show()

Not bad for just a dozen lines of code. There’s room for improvement, though. The following program adds a title, axis labels, and identifiers for every zip code and zooms in to the South Bay region:

Python:

#!/usr/bin/env python3

# file: code/geopandas/LA_home_prices.py

import numpy as np

import matplotlib.pyplot as plt

from matplotlib.transforms import Bbox

import geopandas as gp

import pandas as pd

zipcode_geom = gp.read_file('usa_zip/tl_2019_us_zcta510.shp')

home_prices = pd.read_csv('Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv')

# rename zip code columns to enable join

zipcode_geom.rename(columns={'ZCTA5CE10':'ZipCode'}, inplace=True)

home_prices.rename(columns={'RegionName':'ZipCode'}, inplace=True)

# change type from 'object' to 'int64'

zipcode_geom['ZipCode'] = zipcode_geom['ZipCode'].astype(int)

# subset to just LA County

LA_prices = home_prices[

home_prices['CountyName'] == 'Los Angeles County'][

['ZipCode','2020-05-31','2021-05-31']]

LA_price_map = zipcode_geom.merge(LA_prices,on='ZipCode')

LA_price_map['2020-05-31'] /= 1.0e6 # change units to million dollars

LA_price_map.plot(column='2020-05-31',cmap='plasma',

legend=True,figsize=(10,14))

plt.xlim(left=-118.55,right=-118.17)

plt.ylim(top=34.14,bottom=33.7)

plt.title('Los Angeles County Home Prices 2020-05-31, $M')

plt.ylabel('Latitude [deg]')

plt.xlabel('Longitude [deg]')

# label each zip code

for i,row in LA_price_map.iterrows():

center = row['geometry'].centroid.xy

xy = center[0][0], center[1][0]

plt.annotate(text=row['ZipCode'], xy=xy,

horizontalalignment='center')

plt.savefig('LA_home_prices.png', bbox_inches=Bbox([[0.9,0.9],[9,13]]),

transparent=True)

plt.show()

13.13 Recipe 13-1: Maps with GeoPandas

The map of Los Angeles County home prices created in Section 13.12 can also be made in MATLAB with GeoPandas. The Pandas idioms of filtering dataframe rows and columns with brackets (Section 13.6) and calling a dataframe’s .iterrows() method to traverse rows are not available to MATLAB though. We’ll need a bridge module to supply the following capabilities:

Filter rows based on Boolean operations
Extract selected columns
Rename columns
Change column data type
Scale column values
Iterate over rows

The last item, row iteration, is surprisingly challenging. An obvious solution is to create a Python generator that returns a new row each time it is called, but MATLAB doesn’t seem to work with Python generators. Instead, I implemented iterrows() in the bridge module as a closure.³

Python:

# file: code/matlab_py/bridge_geopandas.py

def row_filter(dframe, A, op, B):

operation = {

'==' : lambda DF, A, B : DF[DF[A] == B],

'!=' : lambda DF, A, B : DF[DF[A] != B],

'<=' : lambda DF, A, B : DF[DF[A] <= B],

'>=' : lambda DF, A, B : DF[DF[A] >= B],

'<' : lambda DF, A, B : DF[DF[A] < B],

'>' : lambda DF, A, B : DF[DF[A] > B],

}

if op not in operation:

print(f'bridge_geopandas.row_filter: "{op}" not recognized')

return None

return operation[op](dframe, A, B)

def col_filter(dframe, cols):

return dframe[[*cols]]

def rename_col(dframe, A, B):

dframe.rename(columns={A:B}, inplace=True)

def astype(dframe, col, new_type):

if new_type == 'int':

dframe[col] = dframe[col].astype(int)

elif new_type == 'float':

dframe[col] = dframe[col].astype(float)

elif new_type == 'str':

dframe[col] = dframe[col].astype(str)

else:

print(f'bridge_geopandas.astype: "{new_type}" not recognized')

def mult_by(dframe, col, scale):

dframe[col] *= scale

def iterrows(dframe):

i = -1

n_rows = len(dframe)

def increment():

nonlocal i

i += 1

if i < n_rows:

return dframe.iloc[i]

else:

return None

return increment

Calls to bridge functions in the following MATLAB program are preceded by commented Python code which implement the Pandas operation. Additionally, the sequence of commands differs a bit from the pure Python program of Section 13.12 to reduce the memory footprint. Rows of the two dataframes are filtered immediately after they are read.

MATLAB 2020b:

% file: code/geopandas/LA_home_prices.m

Im = @py.importlib.import_module;

pd = Im('pandas');

gp = Im('geopandas');

BG = Im('bridge_geopandas');

plt = Im('matplotlib.pyplot');

trx = Im('matplotlib.transforms');

mpl = Im('matplotlib');

if ispc

mpl.use('WXAgg')

else

mpl.use('TkAgg')

end

zipcode_geom = gp.read_file('usa_zip/tl_2019_us_zcta510.shp');

% zipcode_geom.rename(columns={'ZCTA5CE10':'ZipCode'}, inplace=True)

BG.rename_col(zipcode_geom, 'ZCTA5CE10','ZipCode');

% zipcode_geom['ZipCode'] = zipcode_geom['ZipCode'].astype(int)

BG.astype(zipcode_geom, 'ZipCode', 'int');

% zipcode_geom = zipcode_geom[ zipcode_geom['ZipCode'] > 90000]

BG.row_filter(zipcode_geom, 'ZipCode', '>', 90000);

home_prices = pd.read_csv('Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv');

% home_prices.rename(columns={'RegionName':'ZipCode'}, inplace=True)

BG.rename_col(home_prices, 'RegionName', 'ZipCode');

% home_prices = home_prices[['ZipCode','2020-05-31','2021-05-31']]

home_prices = BG.col_filter(home_prices, py.list({'ZipCode',...

'CountyName','2020-05-31','2021-05-31'}));

% LA_prices = home_prices[home_prices['CountyName'] == 'Los Angeles County']]

LA_prices = BG.row_filter(home_prices, 'CountyName', '==', 'Los Angeles County');

LA_price_map = zipcode_geom.merge(LA_prices,pyargs('on','ZipCode'));

% LA_price_map['2020-05-31'] /= 1.0e6 # change units to million dollars

BG.mult_by(LA_price_map, '2020-05-31', 1.0/1.0e6);

size_10_14 = py.tuple([10,14]);

LA_price_map.plot(pyargs('column','2020-05-31','cmap','plasma',...

'legend',py.True,'figsize',size_10_14));

plt.xlim(pyargs('left',-118.55,'right',-118.17));

plt.ylim(pyargs('top',34.14,'bottom',33.7));

plt.title('Los Angeles County Home Prices 2020-05-31, $M');

plt.ylabel('Latitude [deg]');

plt.xlabel('Longitude [deg]');

% label each zip code

next_row = BG.iterrows(LA_price_map);

while 1

row = next_row();

if strcmp(class(row), 'py.NoneType')

break

end

geo = row.get('geometry');

xy = py.tuple([geo.centroid.x, geo.centroid.y]);

plt.annotate(pyargs('text',row.get('ZipCode'),'xy',xy,...

'horizontalalignment','center'));

end

bbox = trx.Bbox([0.9 0.9; 9 13]);

plt.savefig('LA_m.png', pyargs('bbox_inches',bbox,'transparent',py.True));

plt.show()

Unfortunately, as of November 2021, attempting to use geopandas through MATLAB on Windows is problematic due to DLL load issues. Visit this book’s GitHub repository⁴ for updates.

13.14 References

[1]
Kelsey Jordahl et al. geopandas/geopandas: v0.8.1. Version v0.8.1. July 2020. DOI: 10.5281/zenodo. 3946761. URL: https://doi.org/10.5281/zenodo.3946761
[2]
Wes McKinney.Python for Data Analysis, 2nd ed. O’Reilly Media, 2017.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 13. Tables and Dataframes

Create new playlist

Sign In

Sign Up

13. Tables and Dataframes

13.1 Loading Tables from Files

13.2 Table Summaries

13.2.1 Table Size, Column Names, Column Types

13.2.2 summary() and .info()/.describe()

13.2.3 groupsummary() and .value_counts()

13.2.4 head() and tail()

13.3 Cleaning Data

13.3.1 Renaming Columns

13.3.2 Changing Column Data Types

13.3.3 Changing Column Data

13.3.4 Making Timestamps from Strings

13.4 Creating Tables Programmatically

13.5 Sorting Rows

13.6 Table Subsets

13.6.1 All Rows, Selected Columns

13.6.2 All Columns, Selected Rows

13.6.3 Selected Rows, Selected Columns

13.6.4 Filter Rows by Conditional Operations

13.7 Iterating over Rows

13.8 Pivot Tables

13.8.1 Single-Level Aggregation

13.8.2 Multilevel Aggregation

13.9 Adding Columns

13.10 Deleting Columns

13.11 Joins Across Tables

13.12 GeoPandas

13.13 Recipe 13-1: Maps with GeoPandas

13.14 References

Table of Contents for
13. Tables and Dataframes