The Pandas module for Python [2] came out in 2008, five years before the MathWorks added tables to MATLAB in the R2013b release. Since its release, Pandas has become a core tool for data scientists, financial analysts, statisticians, and machine learning specialists. A comprehensive discussion of Pandas is far beyond the scope of this book; instead, here we show the functional similarities and notational differences between Pandas dataframes and MATLAB tables.
pets_ns.csv type,MF,weight,birthday dog,M,9.82,2023-11-26 cat,F,4.31,2023-11-18 bunny,F,8.40,2023-11-23 bunny,M,4.65,2023-11-24 dog,F,9.82,2023-11-25 cat,M,5.57,2023-11-24 bunny,M,1.12,2023-11-17 cat,M,2.96,2023-11-21 cat,M,5.08,2023-11-21 bunny,F,1.53,2023-11-15 | pets.csv type , MF, weight, birthday dog , M, 9.82 , 2023-11-26 cat , F, 4.31 , 2023-11-18 bunny , F, 8.40 , 2023-11-23 bunny , M, 4.65 , 2023-11-24 dog , F, 9.82 , 2023-11-25 cat , M, 5.57 , 2023-11-24 bunny , M, 1.12 , 2023-11-17 cat , M, 2.96 , 2023-11-21 cat , M, 5.08 , 2023-11-21 bunny , F, 1.53 , 2023-11-15 |
Pandas ignores leading and trailing whitespace for row data but preserves whitespace for the column titles.
MATLAB preserves whitespace throughout.
Pandas does not automatically recognize date fields and so populates the fourth column with string literals.
MATLAB recognizes dates and times in a variety of formats and correctly sets the type of the fourth column to datetime.
Additional data manipulation and cleaning is often necessary after loading a file. An aphorism in data science is that data cleaning typically takes four times as much effort as the actual data analysis. Frequently performed data cleaning tasks are covered in Section 13.3.
13.1 Loading Tables from Files
MATLAB: | Python: |
---|---|
>> tb = readtable('pets.csv'); >> tb tb = 10×6 table Var1 Var2 Var3 Var4 Var5 Var6 _______ _____ ____ ____ ____ _____ {'dog' }{','}{'M,'}9.82{','}2023-11-26 {'cat' }{','}{'F,'}4.31{','}2023-11-18 {'bunny'}{','}{'F,'} 8.4{','}2023-11-23 {'bunny'}{','}{'M,'}4.65{','}2023-11-24 {'dog' }{','}{'F,'}9.82{','}2023-11-25 {'cat' }{','}{'M,'}5.57{','}2023-11-24 {'bunny'}{','}{'M,'}1.12{','}2023-11-17 {'cat' }{','}{'M,'}2.96{','}2023-11-21 {'cat' }{','}{'M,'}5.08{','}2023-11-21 {'bunny'}{','}{'F,'}1.53{','}2023-11-15 | In : import pandas as pd In : df = pd.read_csv('pets.csv') In : df Out: type MF weight birthday 0 dog M 9.82 2023-11-26 1 cat F 4.31 2023-11-18 2 bunny F 8.40 2023-11-23 3 bunny M 4.65 2023-11-24 4 dog F 9.82 2023-11-25 5 cat M 5.57 2023-11-24 6 bunny M 1.12 2023-11-17 7 cat M 2.96 2023-11-21 8 cat M 5.08 2023-11-21 9 bunny F 1.53 2023-11-15 |
MATLAB: | Python: |
---|---|
>> tb = readtable('pets_ns.csv'); >> tb tb = 10×4 table type MF weight birthday _________ _____ ______ __________ {'dog' } {'M'} 9.82 2023-11-26 {'cat' } {'F'} 4.31 2023-11-18 {'bunny'} {'F'} 8.4 2023-11-23 {'bunny'} {'M'} 4.65 2023-11-24 {'dog' } {'F'} 9.82 2023-11-25 {'cat' } {'M'} 5.57 2023-11-24 {'bunny'} {'M'} 1.12 2023-11-17 {'cat' } {'M'} 2.96 2023-11-21 {'cat' } {'M'} 5.08 2023-11-21 {'bunny'} {'F'} 1.53 2023-11-15 | In : import pandas as pd In : import numpy as np In : df = pd.read_csv('pets_ns.csv') In : df Out: type MF weight birthday 0 dog M 9.82 2023-11-26 1 cat F 4.31 2023-11-18 2 bunny F 8.40 2023-11-23 3 bunny M 4.65 2023-11-24 4 dog F 9.82 2023-11-25 5 cat M 5.57 2023-11-24 6 bunny M 1.12 2023-11-17 7 cat M 2.96 2023-11-21 8 cat M 5.08 2023-11-21 9 bunny F 1.53 2023-11-15 |
MATLAB: | Python: |
---|---|
>> tb{1,'birthday'} datetime 2023-11-26 | In : df.iloc[0]['birthday'] Out: '2023-11-26' In : type(df.iloc[0]['birthday']) Out: str |
MATLAB’s birthday column is recognized as a datetime, but Python’s is just a string—not ideal. The Pandas read_csv() function has 48 (!) optional keyword arguments, two of which govern date handling: parse_dates takes a list of column indices to interpret as dates, and infer_datetime_format=True enables the automatic date parser. This invocation of read_csv() will give us a time-aware type for birthday:
For now, we’ll load the dataframe df using only default arguments to read_csv() and keep the birthday entries as strings. We show how to convert the strings to time-aware objects in the data cleaning section, Section 13.3.4.
13.2 Table Summaries
Most real-world tables have far more rows than can fit on a page. MATLAB and Pandas have summary functions that give rapid glimpses into the table contents.
13.2.1 Table Size, Column Names, Column Types
MATLAB: | Python: |
---|---|
>> height(tb) 10 >> width(tb) 4 >> size(tb) 10 4 | In : len(df) Out: 10 In : df.shape Out: (10, 4) |
MATLAB: | Python: |
---|---|
>> tb.Properties.VariableNames 1×4 cell array {'type'} {'MF'} {'weight'} {'birthday'} | In : df.columns Out: Index(['type', 'MF', 'weight', 'birthday'], dtype="object") |
MATLAB: | Python: |
---|---|
cols=tb.Properties.VariableNames; for C = cols fprintf('%-8s %s ', ... C{1}, class(tb.(C{1}))) end type cell MF cell weight double birthday datetime | In : df.dtypes Out: type object MF object weight float64 birthday object dtype: object In : for C in df.columns: ...: tp = type(df.iloc[0][C]) ...: print(f'{C:<8s} {tp}') type <class 'str'> MF <class 'str'> weight <class 'numpy.float64'> birthday <class 'str'> |
13.2.2 summary() and .info()/.describe()
MATLAB: | Python: |
---|---|
>> summary(tb) Variables: type: 10×1 cell array of character vectors MF: 10×1 cell array of character vectors weight: 10×1 double Values: Min 1.12 Median 4.865 Max 9.82 birthday: 10×1 datetime Values: Min 2023-11-15 Median 2023-11-22 Max 2023-11-26 | In : df.info() <class 'pandas.core.frame.DataFrame'> RangeIndex: 10 entries, 0 to 9 Data columns (total 4 columns): # Column Non-Null Count Dtype - ------ -------------- ----- 0 type 10 non-null object 1 MF 10 non-null object 2 weight 10 non-null float64 3 birthday 10 non-null object dtypes: float64(1), object(3) memory usage: 448.0+ bytes |
Pandas dataframes also have a .describe() method that gives value distributions for columns holding numeric values:
The MATLAB summary information is more comprehensive as it gives extremal values for dates as well as numeric columns.
13.2.3 groupsummary() and .value_counts()
MATLAB: | Python: |
---|---|
>> groupsummary(tb,'type') type GroupCount _________ __________ {'bunny'} 4 {'cat' } 4 {'dog' } 2 | In : df['type'].value_counts() Out: cat 4 bunny 4 dog 2 |
13.2.4 head() and tail()
MATLAB: | Python: |
---|---|
>> head(tb,3) type MF weight birthday _________ _____ ______ __________ {'dog' } {'M'} 9.82 2023-11-26 {'cat' } {'F'} 4.31 2023-11-18 {'bunny'} {'F'} 8.4 2023-11-23 >> tail(tb,2) type MF weight birthday _________ _____ ______ __________ {'cat' } {'M'} 5.08 2023-11-21 {'bunny'} {'F'} 1.53 2023-11-15 | In : df.head(3) Out: type MF weight birthday 0 dog M 9.82 2023-11-26 1 cat F 4.31 2023-11-18 2 bunny F 8.40 2023-11-23 In : df.tail(2) Out: type MF weight birthday 8 cat M 5.08 2023-11-21 9 bunny F 1.53 2023-11-15 |
13.3 Cleaning Data
Cleaning data refers to operations like renaming, adding, and deleting columns; editing strings to remove spurious whitespace and normalize spelling, hyphenation, and case; changing data types; and deleting duplicate rows or those with missing values.
13.3.1 Renaming Columns
MATLAB: | Python: |
---|---|
>> head(tb,1) type MF weight birthday _______ _____ ______ __________ {'dog'} {'M'} 9.82 2023-11-26 >> tb.Properties.VariableNames(... [1 3]) = {'species' 'kg'}; >> head(tb,1) species MF kg birthday _______ _____ ____ __________ {'dog'} {'M'} 9.82 2023-11-26 | In : df.head(1) Out: type MF weight birthday 0 dog M 9.82 2023-11-26 In : df.rename(columns={ 'type':'species', 'weight':'kg'}).head(1) Out: species MF kg birthday 0 dog M 9.82 2023-11-26 |
Modifying tb.Properties.VariableNames causes MATLAB to change the table in-place. This differs from the .rename() method in Pandas which returns a new dataframe unless the option inplace=True is also given.
13.3.2 Changing Column Data Types
Operations involving multiple columns can typically only be performed if the columns have consistent data types. For example, if you have home prices for one year stored as strings in one column and floats in another year, you won’t be able to compute differences between the years until the strings are converted to floats or integers.
Joins between multiple tables, covered in Section 13.11, also require matching data types in addition to common column names.
MATLAB: | Python: |
---|---|
>> head(tb,2) type MF weight birthday _________ ___ ______ __________ {'dog' } {M} 9.82 2023-11-26 {'cat' } {F} 4.31 2023-11-18 >> tb.MF = string(tb.MF); >> head(tb,2) type MF weight birthday _________ ___ ______ __________ {'dog' } "M" 9.82 2023-11-26 {'cat' } "F" 4.31 2023-11-18 | In : df.head(2) Out: type MF weight birthday 0 dog M 9.82 2023-11-26 1 cat F 4.31 2023-11-18 In : df.MF = df.MF.astype(bytes) In : df.head(2) Out: type MF weight birthday 0 dog b'M' 9.82 2023-11-26 1 cat b'F' 4.31 2023-11-18 |
13.3.3 Changing Column Data
Dataframe columns have an .apply() method which invokes a user-supplied function on each value in the column. This example defines a function caps() which makes the first letter of a string uppercase, then applies this function to the type column:
The type column is not modified unless it is reassigned to the output of the .apply() with
13.3.4 Making Timestamps from Strings
String-to-time conversions are handled directly with the Pandas to_datetime() function , provided the string follows conventional time formats.
13.4 Creating Tables Programmatically
Tables and dataframes need not come solely from files; they can also be created programmatically. MATLAB has a collection of X2table() functions—array2table(), cell2table(), struct2table()—that perform the expected conversions. The Pandas DataFrame() function accepts lists and dictionaries and returns a dataframe.
MATLAB: | Python: |
---|---|
>> x.year = [1867 1776 1810]'; >> x.n_prov = [9 50 31]'; >> x.country = [{'Canada'},... {'USA'}, {'Mexico'}]'; >> tb = struct2table(x) 3×3 table year n_prov country ____ ______ __________ 1867 9 {'Canada'} 1776 50 {'USA' } 1810 31 {'Mexico'} | In : x = { ...: 'year' : [1867,1776,1810], ...: 'nprov' : [9,50,31], ...: 'country' : ['Canada', ...: 'USA', 'Mexico'] } In : df = pd.DataFrame(x) In : df Out: year nprov country 0 1867 9 Canada 1 1776 50 USA 2 1810 31 Mexico |
13.5 Sorting Rows
Head and tail results of a sorted table reveal extremal values. Here, we find the lightest and heaviest animals by sorting on weight—and where weights are equal, we do a secondary sort on birthday.
MATLAB: | Python: |
---|---|
>> head(sortrows(tb,{'weight',... 'birthday'}),2) type MF weight birthday _________ _____ ______ __________ {'bunny'} {'M'} 1.12 2023-11-17 {'bunny'} {'F'} 1.53 2023-11-15 >> tail(sortrows(tb,{'weight',... 'birthday'}),2) type MF weight birthday _______ _____ ______ __________ {'dog'} {'F'} 9.82 2023-11-25 {'dog'} {'M'} 9.82 2023-11-26 | In : df.sort_values(by=['weight', ...: 'birthday']).head(2) Out: type MF weight birthday 6 bunny M 1.12 2023-11-17 9 bunny F 1.53 2023-11-15 In : df.sort_values(by=['weight', ...: 'birthday']).tail(2) Out: type MF weight birthday 4 dog F 9.82 2023-11-25 0 dog M 9.82 2023-11-26 |
13.6 Table Subsets
Portions of tables can be referenced through row and column slices using indexing techniques similar to those for MATLAB matrices and NumPy ndarrays. As with numeric arrays (Section 11.1.15), table indexing in MATLAB is a bit more convenient than in Python. MATLAB accepts any integer array as a table index, but Pandas needs special attributes .loc[] for label and logical indexing and .iloc[] for arbitrary numeric indexing.
The biggest differences between .loc[] and .iloc[] are that .loc[] permits Boolean expressions to filter on rows (in contrast, .iloc[] only accepts numeric indices), and .iloc[] enables one to select row and column subsets (.loc[] only works on rows).
13.6.1 All Rows, Selected Columns
MATLAB: | Python: |
---|---|
>> tb(:,{'weight','type'}) weight type ______ _________ 9.82 {'dog' } 4.31 {'cat' } 8.4 {'bunny'} 4.65 {'bunny'} 9.82 {'dog' } 5.57 {'cat' } 1.12 {'bunny'} 2.96 {'cat' } 5.08 {'cat' } 1.53 {'bunny'} | In : df[['weight','type']] Out: weight type 0 9.82 dog 1 4.31 cat 2 8.40 bunny 3 4.65 bunny 4 9.82 dog 5 5.57 cat 6 1.12 bunny 7 2.96 cat 8 5.08 cat 9 1.53 bunny |
13.6.2 All Columns, Selected Rows
MATLAB: | Python: |
---|---|
>> tb([2,3],:) type MF weight birthday _________ _____ ______ __________ {'cat' } {'F'} 4.31 2023-11-18 {'bunny'} {'F'} 8.4 2023-11-23 | In : df.loc[[1,2]] Out: type MF weight birthday 1 cat F 4.31 2023-11-18 2 bunny F 8.40 2023-11-23 |
13.6.3 Selected Rows, Selected Columns
MATLAB: | Python: |
---|---|
>> tb(1:3:end,{'weight','type'}) weight type ______ _________ 9.82 {'dog' } 4.65 {'bunny'} 1.12 {'bunny'} 1.53 {'bunny'} | In : df[::3][['weight', 'type']] Out: weight type 0 9.82 dog 3 4.65 bunny 6 1.12 bunny 9 1.53 bunny |
MATLAB: | Python: |
---|---|
> tb([1,6,2],{'weight','type'}) weight type ______ _______ 9.82 {'dog'} 5.57 {'cat'} 4.31 {'cat'} | In : df[['weight','type']].iloc[[0,5,1]] Out: weight type 0 9.82 dog 5 5.57 cat 1 4.31 cat |
Python: | Python: |
---|---|
In : df.iloc[[0,5,1]][ ['weight','type']] Out: weight type 0 9.82 dog 5 5.57 cat 1 4.31 cat | In : df[['weight','type'] ].iloc[[0,5,1]] Out: weight type 0 9.82 dog 5 5.57 cat 1 4.31 cat |
This is possible because each indexing component returns a new table which can subsequently be indexed independently. df.iloc[[0,5,1]] returns the first, sixth, and second rows, and df.iloc[[0,5,1]][['weight','type']] returns the weight and size columns from the three specified rows. Similarly, df[['weight','type']] is a table containing all rows of weight and type columns, and df[['weight','type']].iloc[[0,5,1]] returns the first, sixth, and second rows from those two columns.
MATLAB tables are indexed via {columns} or {rows, columns}.
13.6.4 Filter Rows by Conditional Operations
MATLAB: | Python: |
---|---|
>> tb((tb.weight > 4) & (... cell2mat(tb.MF) == 'M'),:) 4×4 table type MF weight birthday _________ _____ ______ __________ {'dog' } {'M'} 9.82 2023-11-26 {'bunny'} {'M'} 4.65 2023-11-24 {'cat' } {'M'} 5.57 2023-11-24 {'cat' } {'M'} 5.08 2023-11-21 | In : df[(df.weight > 4) & ...: (df.MF == 'M')] Out: type MF weight birthday 0 dog M 9.82 2023-11-26 3 bunny M 4.65 2023-11-24 5 cat M 5.57 2023-11-24 8 cat M 5.08 2023-11-21 |
Note that in MATLAB, string values (as in column MF) must be converted to a matrix to allow a string equivalence test.
13.7 Iterating over Rows
MATLAB: | Python: |
---|---|
>> for i = 1:height(tb) fprintf("%d %s %s ",i,string(... tb{i,1}),string(tb{i,'MF'})) end 1 dog M 2 cat F 3 bunny F 4 bunny M 5 dog F 6 cat M 7 bunny M 8 cat M 9 cat M 10 bunny F | In : for i,row in df.iterrows(): ...: print(i,row[0],row['MF']) 0 dog M 1 cat F 2 bunny F 3 bunny M 4 dog F 5 cat M 6 bunny M 7 cat M 8 cat M 9 bunny F |
13.8 Pivot Tables
Pivot tables allow one to aggregate data sharing common values. They help answer questions such as “what were sales by region?” or “how many points did home teams score compared to visiting teams?” Pandas has explicit functions for these—pd.pivot_table() and pd.groupby()—while MATLAB relies on different techniques depending on the complexity of the desired aggregation.
13.8.1 Single-Level Aggregation
MATLAB: | Python: |
---|---|
>> varfun(@length,tb,... "GroupingVariables","type",... "InputVariables","type") type GroupCount length_type _________ __________ ___________ {'bunny'} 4 4 {'cat' } 4 4 {'dog' } 2 2 | In : pd.pivot_table(df, index=['type'], aggfunc="count") Out: MF birthday weight type bunny 4 4 4 cat 4 4 4 dog 2 2 2 |
The aggregation function is provided in MATLAB as a function handle and in Pandas with the optional keyword argument aggfunc. Pandas uses mean() if this argument is not supplied.
MATLAB: | Python: |
---|---|
>> varfun(@sum,tb,... "GroupingVariables","MF",... "InputVariables","weight") MF GroupCount sum_weight _____ __________ __________ {'F'} 4 24.06 {'M'} 6 29.2 | In : df.pivot_table(index=['MF'], ...: values=['weight'],aggfunc='sum') Out: weight MF F 24.06 M 29.20 |
The Pandas .groupby() function returns the same result, but only because weight is the only numeric column in df. Had there been others, sums would be returned for those too.
13.8.2 Multilevel Aggregation
MATLAB: | Python: |
---|---|
>> [Idx, MF, Type] = ... findgroups(tb.MF, tb.type); >> sum_weight = splitapply(... @sum, tb.weight, Idx); >> table(MF, Type, sum_weight) 6×3 table MF Type sum_weight _____ _________ __________ {'F'} {'bunny'} 9.93 {'F'} {'cat' } 4.31 {'F'} {'dog' } 9.82 {'M'} {'bunny'} 5.77 {'M'} {'cat' } 13.61 {'M'} {'dog' } 9.82 | In : pd.pivot_table(df, index=['MF','type'], values=['weight'], aggfunc='sum') Out: weight MF type F bunny 9.93 cat 4.31 dog 9.82 M bunny 5.77 cat 13.61 dog 9.82 |
13.9 Adding Columns
New columns derived from data in existing columns can be helpful in subsequent aggregation and join operations. They can be added simply by indexing the table or dataframe with a new column name in an assignment statement. For example, tb.New = 4 and df['New' = 4] create column “New” with a value of 4 for all rows in table tb and dataframe df. In the following, we’ll add a new column, wd, containing the integer weekday of each animal’s birthday. The Pandas birthday column needs to be a timestamp rather than string (Section 13.3.4).
MATLAB: | Python: |
---|---|
>> tb.wd = weekday(tb{:,'birthday'}) type MF weight birthday wd _________ ____ ______ __________ __ {'dog' } {'M'} 9.82 2023-11-26 1 {'cat' } {'F'} 4.31 2023-11-18 7 {'bunny'} {'F'} 8.4 2023-11-23 5 {'bunny'} {'M'} 4.65 2023-11-24 6 {'dog' } {'F'} 9.82 2023-11-25 7 {'cat' } {'M'} 5.57 2023-11-24 6 {'bunny'} {'M'} 1.12 2023-11-17 6 {'cat' } {'M'} 2.96 2023-11-21 3 {'cat' } {'M'} 5.08 2023-11-21 3 {'bunny'} {'F'} 1.53 2023-11-15 4 | In : i = df['birthday'].dt.weekday In : df['wd'] = ((i+1) % 7) + 1 In : df type MF weight birthday wd 0 dog M 9.82 2023-11-26 1 1 cat F 4.31 2023-11-18 7 2 bunny F 8.40 2023-11-23 5 3 bunny M 4.65 2023-11-24 6 4 dog F 9.82 2023-11-25 7 5 cat M 5.57 2023-11-24 6 6 bunny M 1.12 2023-11-17 6 7 cat M 2.96 2023-11-21 3 8 cat M 5.08 2023-11-21 3 9 bunny F 1.53 2023-11-15 4 |
MATLAB: | Python: |
---|---|
>> tail(sortrows(varfun(@length,... tb,"GroupingVariables","wd",... "InputVariables","MF"),... "GroupCount"),1) wd GroupCount length_MF __ __________ _________ 6 3 3 | In : pd.pivot_table(df,index=['wd'], values=['MF'],aggfunc='count' ).sort_values(by=['MF']).tail(1) Out: MF wd 6 3 |
We find that three of the animals were born on wd = 6, that is, Friday.
13.10 Deleting Columns
MATLAB: | Python: |
---|---|
>> head(tb,2) type MF weight birthday wd _______ _____ ______ __________ __ {'dog'} {'M'} 9.82 2023-11-26 1 {'cat'} {'F'} 4.31 2023-11-18 7 >> tb.wd = []; % remove wd column >> head(tb,2) type MF weight birthday _______ _____ ______ __________ {'dog'} {'M'} 9.82 2023-11-26 {'cat'} {'F'} 4.31 2023-11-18 | In : df.head(2) Out: type MF weight birthday wd 0 dog M 9.82 2023-11-26 1 1 cat F 4.31 2023-11-18 7 In : del df['wd'] # remove wd column In : df.head(2) Out: type MF weight birthday 0 dog M 9.82 2023-11-26 1 cat F 4.31 2023-11-18 |
MATLAB: | Python: |
---|---|
>> head(tb,2) type MF weight birthday _______ _____ ______ __________ {'dog'} {'M'} 9.82 2023-11-26 {'cat'} {'F'} 4.31 2023-11-18 >> head(removevars(tb,... {'MF','weight'}),2) type birthday _______ __________ {'dog'} 2023-11-26 {'cat'} 2023-11-18 | In : df.head(2) Out: type MF weight birthday 0 dog M 9.82 2023-11-26 1 cat F 4.31 2023-11-18 In df.drop(columns=['MF','weight']).head(2) Out: type birthday 0 dog 2023-11-26 1 cat 2023-11-18 |
In both cases, a new table is returned; the original tb and df remain unchanged.
13.11 Joins Across Tables
Complex datasets may span multiple tables. MATLAB and Pandas support join operations that yield results from data related by indices common to the tables. Joins come in four flavors—inner, full, left, right—that are the Boolean equivalents of intersection, union, and subtraction operations between two sets. Inner joins are seen frequently because they return results only where both tables have common indices.
MATLAB: | Python: |
---|---|
>> food.type = {'dog','cat',... 'bunny'}'; >> food.price = [1.2 3.5 0.7]'; >> cost_tb = struct2table(food) 3×2 table type price _________ _____ {'dog' } 1.2 {'cat' } 3.5 {'bunny'} 0.7 | In : food = { ...: 'type' : ['dog','cat','bunny'], ...: 'price' : [ 1.2, 3.5, 0.7]} In : cost_df = pd.DataFrame(food) In : cost_df Out: type price 0 dog 1.2 1 cat 3.5 2 bunny 0.7 |
MATLAB: | Python: |
---|---|
>> J = innerjoin(tb,cost_tb, 'Keys',... {'type'}) type MF weight birthday price ________ _____ ______ _________ ______ {'bunny'} {'F'} 8.4 2023-11-23 0.7 {'bunny'} {'M'} 4.65 2023-11-24 0.7 {'bunny'} {'M'} 1.12 2023-11-17 0.7 {'bunny'} {'F'} 1.53 2023-11-15 0.7 {'cat' } {'F'} 4.31 2023-11-18 3.5 {'cat' } {'M'} 5.57 2023-11-24 3.5 {'cat' } {'M'} 2.96 2023-11-21 3.5 {'cat' } {'M'} 5.08 2023-11-21 3.5 {'dog' } {'M'} 9.82 2023-11-26 1.2 {'dog' } {'F'} 9.82 2023-11-25 1.2 | In : J = pd.merge(df,cost_df,on='type') In : J Out: type MF weight birthday price 0 dog M 9.82 2023-11-26 1.2 1 dog F 9.82 2023-11-25 1.2 2 cat F 4.31 2023-11-18 3.5 3 cat M 5.57 2023-11-24 3.5 4 cat M 2.96 2023-11-21 3.5 5 cat M 5.08 2023-11-21 3.5 6 bunny F 8.40 2023-11-23 0.7 7 bunny M 4.65 2023-11-24 0.7 8 bunny M 1.12 2023-11-17 0.7 9 bunny F 1.53 2023-11-15 0.7 |
MATLAB: | Python: |
---|---|
>> J.weight .* J.price ans= 5.8800 3.2550 0.7840 1.0710 15.0850 19.4950 10.3600 17.7800 11.7840 11.7840 >> sum(J.weight .* J.price) ans= 97.2780 | In : J.weight*J.price Out: 0 11.784 1 11.784 2 15.085 3 19.495 4 10.360 5 17.780 6 5.880 7 3.255 8 0.784 9 1.071 dtype: float64 In : sum(J.weight*J.price) Out: 97.27799999999999 |
MATLAB: | Python: |
---|---|
>> J.wxp = J.weight .* J.price; varfun(@sum,J,"GroupingVariables",... "type","InputVariables","wxp") 3×3 table type GroupCt sum_wxp _________ _______ __________ {'bunny'} 4.e+00 1.0990e+01 {'cat' } 4.e+00 6.2720e+01 {'dog' } 2.e+00 2.3568e+01 | In : J['wxp'] = J.weight*J.price In : pd.pivot_table(J,index=['type'], ...: values=['wxp'],aggfunc=sum) Out: wxp type bunny 10.990 cat 62.720 dog 23.568 |
13.12 GeoPandas
GeoPandas [1] adds spatial data types and operations to Pandas. It allows one to query and aggregate information from Pandas dataframes spatially (“how do home prices vary by postal code and time?”), perform geographic set operations (“over which countries did the airplane fly?”), and create maps with data overlays. Here, we’ll explore that last capability.
GeoPandas needs geographic data in the form of GeoJSON files or shapefiles. In Section 12.4.3.2, we used shapefiles downloaded from the Natural Earth project. The US government’s data.gov website is another source of shapefiles; among others, it contains a 500 MB compressed file with high-resolution outlines of every zip code in the United States.1
My goal is to make a map of Los Angeles County with zip codes colored by home prices. Zillow Research2 generously makes available real estate price data grouped by zip code. From their portal, I downloaded the “ZVHI All Homes (SFR/Condo/Coop)” dataset, a 70 MB CSV file.
How did I know to change the data type of zipcode_geom['ZipCode'] to integer? Initially, I didn’t. I attempted the join with only renamed columns and then got a Pandas error saying the join failed because of a type mismatch, “ValueError: You are trying to merge an object and int64 columns.”
Not bad for just a dozen lines of code. There’s room for improvement, though. The following program adds a title, axis labels, and identifiers for every zip code and zooms in to the South Bay region:
13.13 Recipe 13-1: Maps with GeoPandas
Filter rows based on Boolean operations
Extract selected columns
Rename columns
Change column data type
Scale column values
Iterate over rows
The last item, row iteration, is surprisingly challenging. An obvious solution is to create a Python generator that returns a new row each time it is called, but MATLAB doesn’t seem to work with Python generators. Instead, I implemented iterrows() in the bridge module as a closure.3
Calls to bridge functions in the following MATLAB program are preceded by commented Python code which implement the Pandas operation. Additionally, the sequence of commands differs a bit from the pure Python program of Section 13.12 to reduce the memory footprint. Rows of the two dataframes are filtered immediately after they are read.
Unfortunately, as of November 2021, attempting to use geopandas through MATLAB on Windows is problematic due to DLL load issues. Visit this book’s GitHub repository4 for updates.
13.14 References
- [1]
Kelsey Jordahl et al. geopandas/geopandas: v0.8.1. Version v0.8.1. July 2020. DOI: 10.5281/zenodo. 3946761. URL: https://doi.org/10.5281/zenodo.3946761
- [2]
Wes McKinney.Python for Data Analysis, 2nd ed. O’Reilly Media, 2017.