© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. DanialPython for MATLAB Developmenthttps://doi.org/10.1007/978-1-4842-7223-7_13

13. Tables and Dataframes

Albert Danial1  
(1)
Redondo Beach, CA, USA
 

The Pandas module for Python [2] came out in 2008, five years before the MathWorks added tables to MATLAB in the R2013b release. Since its release, Pandas has become a core tool for data scientists, financial analysts, statisticians, and machine learning specialists. A comprehensive discussion of Pandas is far beyond the scope of this book; instead, here we show the functional similarities and notational differences between Pandas dataframes and MATLAB tables.

Comma-separated value files are a frequent source of inputs for tables and dataframes, so we begin with a pair of CSV files, pets.csv and pets_ns.csv (“ns” for “no spaces”), containing information on a few animals. The files have 11 lines—1 line of column names followed by 10 lines of data—and differ only in whitespace:

pets_ns.csv

type,MF,weight,birthday

dog,M,9.82,2023-11-26

cat,F,4.31,2023-11-18

bunny,F,8.40,2023-11-23

bunny,M,4.65,2023-11-24

dog,F,9.82,2023-11-25

cat,M,5.57,2023-11-24

bunny,M,1.12,2023-11-17

cat,M,2.96,2023-11-21

cat,M,5.08,2023-11-21

bunny,F,1.53,2023-11-15

pets.csv

type  , MF, weight, birthday

dog   ,  M,  9.82 , 2023-11-26

cat   ,  F,  4.31 , 2023-11-18

bunny ,  F,  8.40 , 2023-11-23

bunny ,  M,  4.65 , 2023-11-24

dog   ,  F,  9.82 , 2023-11-25

cat   ,  M,  5.57 , 2023-11-24

bunny ,  M,  1.12 , 2023-11-17

cat   ,  M,  2.96 , 2023-11-21

cat   ,  M,  5.08 , 2023-11-21

bunny ,  F,  1.53 , 2023-11-15

Although the two files contain identical information, the file readers’ default behaviors differ:
  • Pandas ignores leading and trailing whitespace for row data but preserves whitespace for the column titles.

  • MATLAB preserves whitespace throughout.

  • Pandas does not automatically recognize date fields and so populates the fourth column with string literals.

  • MATLAB recognizes dates and times in a variety of formats and correctly sets the type of the fourth column to datetime.

Additional data manipulation and cleaning is often necessary after loading a file. An aphorism in data science is that data cleaning typically takes four times as much effort as the actual data analysis. Frequently performed data cleaning tasks are covered in Section 13.3.

13.1 Loading Tables from Files

MATLAB and Pandas can load tables and dataframes from CSV, Excel, and text files. Pandas can additionally read HDF5, HTML, JSON, and Python pickle files and pull data from websites or database connections through SQLAlchemy. These commands show how to create a MATLAB table and Pandas dataframe from our pets.csv file :

MATLAB:

Python:

>> tb = readtable('pets.csv');

>> tb

tb =

  10×6 table

  Var1   Var2  Var3 Var4 Var5   Var6

_______ _____  ____ ____ ____  _____

{'dog'  }{','}{'M,'}9.82{','}2023-11-26

{'cat'  }{','}{'F,'}4.31{','}2023-11-18

{'bunny'}{','}{'F,'} 8.4{','}2023-11-23

{'bunny'}{','}{'M,'}4.65{','}2023-11-24

{'dog'  }{','}{'F,'}9.82{','}2023-11-25

{'cat'  }{','}{'M,'}5.57{','}2023-11-24

{'bunny'}{','}{'M,'}1.12{','}2023-11-17

{'cat'  }{','}{'M,'}2.96{','}2023-11-21

{'cat'  }{','}{'M,'}5.08{','}2023-11-21

{'bunny'}{','}{'F,'}1.53{','}2023-11-15

In : import pandas as pd

In : df = pd.read_csv('pets.csv')

In : df

Out:

   type   MF weight   birthday

0  dog     M   9.82 2023-11-26

1  cat     F   4.31 2023-11-18

2  bunny   F   8.40 2023-11-23

3  bunny   M   4.65 2023-11-24

4  dog     F   9.82 2023-11-25

5  cat     M   5.57 2023-11-24

6  bunny   M   1.12 2023-11-17

7  cat     M   2.96 2023-11-21

8  cat     M   5.08 2023-11-21

9  bunny   F   1.53 2023-11-15

MATLAB’s table is unexpected; instead of four columns, it returned six and also failed to recognize column names. There are two alternatives: either work with a file like pets_ns.csv whose entries have been trimmed of whitespace or use explicit column-specific format options via detectImportOptions followed by setvaropts. Here, we’ll use the CSV file trimmed of whitespace:

MATLAB:

Python:

>> tb = readtable('pets_ns.csv');

>> tb

tb =

  10×4 table

  type      MF  weight  birthday

 _________ _____ ______ __________

 {'dog'  } {'M'}  9.82  2023-11-26

 {'cat'  } {'F'}  4.31  2023-11-18

 {'bunny'} {'F'}   8.4  2023-11-23

 {'bunny'} {'M'}  4.65  2023-11-24

 {'dog'  } {'F'}  9.82  2023-11-25

 {'cat'  } {'M'}  5.57  2023-11-24

 {'bunny'} {'M'}  1.12  2023-11-17

 {'cat'  } {'M'}  2.96  2023-11-21

 {'cat'  } {'M'}  5.08  2023-11-21

 {'bunny'} {'F'}  1.53  2023-11-15

In : import pandas as pd

In : import numpy as np

In : df = pd.read_csv('pets_ns.csv')

In : df

Out:

   type   MF weight   birthday

0  dog     M   9.82 2023-11-26

1  cat     F   4.31 2023-11-18

2  bunny   F   8.40 2023-11-23

3  bunny   M   4.65 2023-11-24

4  dog     F   9.82 2023-11-25

5  cat     M   5.57 2023-11-24

6  bunny   M   1.12 2023-11-17

7  cat     M   2.96 2023-11-21

8  cat     M   5.08 2023-11-21

9  bunny   F   1.53 2023-11-15

String and floating-point types are obvious, but the birthday column type is not. Let’s investigate by checking the data type of the first row in birthday (table and dataframe indexing is covered in Section 13.6.3). The .iloc() dataframe attribute, covered in more detail in Section 13.6, allows us to select one or more rows based on their (zero-based) indices.

MATLAB:

Python:

>> tb{1,'birthday'}

  datetime

   2023-11-26

In : df.iloc[0]['birthday']

Out: '2023-11-26'

In : type(df.iloc[0]['birthday'])

Out: str

MATLAB’s birthday column is recognized as a datetime, but Python’s is just a string—not ideal. The Pandas read_csv() function has 48 (!) optional keyword arguments, two of which govern date handling: parse_dates takes a list of column indices to interpret as dates, and infer_datetime_format=True enables the automatic date parser. This invocation of read_csv() will give us a time-aware type for birthday:

Python:
In : df = pd.read_csv('pets_ns.csv',parse_dates=[3],
...:                  infer_datetime_format=True)
In : type(df['birthday'].loc[0])
Out: pandas._libs.tslibs.timestamps.Timestamp

For now, we’ll load the dataframe df using only default arguments to read_csv() and keep the birthday entries as strings. We show how to convert the strings to time-aware objects in the data cleaning section, Section 13.3.4.

13.2 Table Summaries

Most real-world tables have far more rows than can fit on a page. MATLAB and Pandas have summary functions that give rapid glimpses into the table contents.

13.2.1 Table Size, Column Names, Column Types

Table dimensions can be found the same way as with numeric arrays:

MATLAB:

Python:

>> height(tb)

    10

>> width(tb)

     4

>> size(tb)

    10      4

In : len(df)

Out: 10

In : df.shape

Out: (10, 4)

Column names are stored in a cell array in MATLAB and an index array in Pandas:

MATLAB:

Python:

>> tb.Properties.VariableNames

  1×4 cell array

 {'type'} {'MF'} {'weight'}

     {'birthday'}

In : df.columns

Out: Index(['type', 'MF', 'weight',

            'birthday'], dtype="object")

Column type information is a bit more obscured; the clearest method is to iterate over columns and explicitly print the type. MATLAB’s column types are revealed by invoking class() on each column. Pandas dataframes have a type attribute, .dtypes, which contains column information, but the types often come back as just object, not at all helpful. In Pandas, too, we therefore explicitly loop over columns and report the type of the entry in the first row:

MATLAB:

Python:

cols=tb.Properties.VariableNames;

for C = cols

  fprintf('%-8s %s ', ...

      C{1}, class(tb.(C{1})))

end

type     cell

MF       cell

weight   double

birthday datetime

In : df.dtypes

Out:

type         object

MF           object

weight      float64

birthday     object

dtype: object

In : for C in df.columns:

...:   tp = type(df.iloc[0][C])

...:   print(f'{C:<8s} {tp}')

type     <class 'str'>

MF       <class 'str'>

weight   <class 'numpy.float64'>

birthday <class 'str'>

13.2.2 summary() and .info()/.describe()

MATLAB’s summary() function and Pandas’s .info() method for dataframes give information on each column:

MATLAB:

Python:

>> summary(tb)

Variables:

  type: 10×1 cell array

       of character vectors

MF: 10×1 cell array of

character vectors

  weight: 10×1 double

      Values:

          Min         1.12

          Median     4.865

          Max         9.82

  birthday: 10×1 datetime

      Values:

          Min       2023-11-15

          Median    2023-11-22

          Max       2023-11-26

In : df.info()

<class 'pandas.core.frame.DataFrame'>

RangeIndex: 10 entries, 0 to 9

Data columns (total 4 columns):

# Column   Non-Null Count Dtype

- ------   -------------- -----

0 type     10 non-null    object

1 MF       10 non-null    object

2 weight   10 non-null    float64

3 birthday 10 non-null    object

dtypes: float64(1), object(3)

memory usage: 448.0+ bytes

Pandas dataframes also have a .describe() method that gives value distributions for columns holding numeric values:

Python:
In : df.describe()
Out:
          weight
count  10.000000
mean    5.326000
std     3.147289
min     1.120000
25%     3.297500
50%     4.865000
75%     7.692500
max     9.820000

The MATLAB summary information is more comprehensive as it gives extremal values for dates as well as numeric columns.

13.2.3 groupsummary() and .value_counts()

Another typical summary operation is obtaining unique counts of entries in desired columns. This can be done with groupsummary() in MATLAB and a column’s .value_counts() method in Pandas. Counts of each type of animal are obtained as follows:

MATLAB:

Python:

>> groupsummary(tb,'type')

   type    GroupCount

 _________ __________

 {'bunny'}     4

 {'cat'  }     4

 {'dog'  }     2

In : df['type'].value_counts()

Out:

cat      4

bunny    4

dog      2

13.2.4 head() and tail()

Head and tail functions print the desired number of rows at the beginning and end of the table:

MATLAB:

Python:

>> head(tb,3)

   type     MF   weight  birthday

 _________ _____ ______ __________

 {'dog'  } {'M'}  9.82  2023-11-26

 {'cat'  } {'F'}  4.31  2023-11-18

 {'bunny'} {'F'}   8.4  2023-11-23

>> tail(tb,2)

   type     MF   weight  birthday

 _________ _____ ______ __________

 {'cat'  } {'M'}  5.08  2023-11-21

 {'bunny'} {'F'}  1.53  2023-11-15

In : df.head(3)

Out:

   type   MF weight   birthday

0  dog     M   9.82 2023-11-26

1  cat     F   4.31 2023-11-18

2  bunny   F   8.40 2023-11-23

In : df.tail(2)

Out:

   type   MF weight   birthday

8  cat     M   5.08 2023-11-21

9  bunny   F   1.53 2023-11-15

13.3 Cleaning Data

Cleaning data refers to operations like renaming, adding, and deleting columns; editing strings to remove spurious whitespace and normalize spelling, hyphenation, and case; changing data types; and deleting duplicate rows or those with missing values.

13.3.1 Renaming Columns

Data sources may not have suitable or convenient column names. Renaming columns is simple, though. A dataframe’s .rename() method takes a map of before:after names for keyword columns. In MATLAB, one simply updates the table’s Properties. This example renames type to species and weight to kg:

MATLAB:

Python:

>> head(tb,1)

   type    MF   weight  birthday

 _______ _____ ______ __________

 {'dog'} {'M'}  9.82  2023-11-26

>> tb.Properties.VariableNames(...

        [1 3]) = {'species' 'kg'};

>> head(tb,1)

 species  MF    kg   birthday

 _______ _____ ____ __________

 {'dog'} {'M'} 9.82 2023-11-26

In : df.head(1)

Out:

  type MF  weight    birthday

0  dog  M    9.82  2023-11-26

In : df.rename(columns={

        'type':'species',

        'weight':'kg'}).head(1)

Out:

  species MF    kg    birthday

0     dog  M  9.82  2023-11-26

Modifying tb.Properties.VariableNames causes MATLAB to change the table in-place. This differs from the .rename() method in Pandas which returns a new dataframe unless the option inplace=True is also given.

13.3.2 Changing Column Data Types

Operations involving multiple columns can typically only be performed if the columns have consistent data types. For example, if you have home prices for one year stored as strings in one column and floats in another year, you won’t be able to compute differences between the years until the strings are converted to floats or integers.

Joins between multiple tables, covered in Section 13.11, also require matching data types in addition to common column names.

Column types can be changed with the .astype() method in Pandas and by conventional typecasting in MATLAB. Here, we change the “male/female” column MF from cells to strings in MATLAB and from the vague “object” type to a character array in Pandas.

MATLAB:

Python:

>> head(tb,2)

   type    MF  weight  birthday

 _________ ___ ______ __________

 {'dog'  } {M}  9.82  2023-11-26

 {'cat'  } {F}  4.31  2023-11-18

>> tb.MF = string(tb.MF);

>> head(tb,2)

   type    MF  weight  birthday

 _________ ___ ______ __________

 {'dog'  } "M"  9.82  2023-11-26

 {'cat'  } "F"  4.31  2023-11-18

In : df.head(2)

Out:

  type MF weight   birthday

0  dog  M   9.82 2023-11-26

1  cat  F   4.31 2023-11-18

In : df.MF = df.MF.astype(bytes)

In : df.head(2)

Out:

  type   MF weight   birthday

0  dog b'M'   9.82 2023-11-26

1  cat b'F'   4.31 2023-11-18

13.3.3 Changing Column Data

Dataframe columns have an .apply() method which invokes a user-supplied function on each value in the column. This example defines a function caps() which makes the first letter of a string uppercase, then applies this function to the type column:

Python:
In : df['type'].head(5)
Out:
0      dog
1      cat
2    bunny
3    bunny
4      dog
Name: type, dtype: object
In : def caps(x):
...:    return x.capitalize()
In : df['type'].apply(caps).head(5)
Out:
0      Dog
1      Cat
2    Bunny
3    Bunny
4      Dog
Name: type, dtype: object

The type column is not modified unless it is reassigned to the output of the .apply() with

Python:
df['type'] = df['type'].apply(caps)

13.3.4 Making Timestamps from Strings

String-to-time conversions are handled directly with the Pandas to_datetime() function , provided the string follows conventional time formats.

Python:
In : df.iloc[0]['birthday']
Out: '2023-11-26'
In : type(df.iloc[0]['birthday'])
Out: str
In : df['birthday'] = pd.to_datetime(df['birthday'])
In : df.iloc[0]['birthday']
Out: Timestamp('2023-11-26 00:00:00')
In : type(df.iloc[0]['birthday'])
Out: pandas._libs.tslibs.timestamps.Timestamp

13.4 Creating Tables Programmatically

Tables and dataframes need not come solely from files; they can also be created programmatically. MATLAB has a collection of X2table() functions—array2table(), cell2table(), struct2table()—that perform the expected conversions. The Pandas DataFrame() function accepts lists and dictionaries and returns a dataframe.

struct2table() and DataFrame() with a dictionary argument are convenient because the structure attributes and dictionary keys become column names:

MATLAB:

Python:

>> x.year = [1867 1776 1810]';

>> x.n_prov  = [9 50 31]';

>> x.country = [{'Canada'},...

       {'USA'}, {'Mexico'}]';

>> tb = struct2table(x)

3×3 table

year n_prov  country

____ ______ __________

1867    9   {'Canada'}

1776   50   {'USA'   }

1810   31   {'Mexico'}

In : x = {

...:   'year' : [1867,1776,1810],

...:   'nprov' : [9,50,31],

...:   'country' : ['Canada',

...:       'USA', 'Mexico'] }

In : df = pd.DataFrame(x)

In : df

Out:

   year  nprov country

0  1867      9  Canada

1  1776     50     USA

2  1810     31  Mexico

13.5 Sorting Rows

Head and tail results of a sorted table reveal extremal values. Here, we find the lightest and heaviest animals by sorting on weight—and where weights are equal, we do a secondary sort on birthday.

A side note on sorting dates and times: Ideally, the column stores these with a time-aware data type. However, proper chronological sorting is possible with strings if the date is formatted from most significant to least significant time entries. This is the case with our year-month-day string; a lexical sort is equivalent to a chronological sort.

MATLAB:

Python:

>> head(sortrows(tb,{'weight',...

                'birthday'}),2)

   type     MF   weight  birthday

 _________ _____ ______ __________

 {'bunny'} {'M'}  1.12  2023-11-17

 {'bunny'} {'F'}  1.53  2023-11-15

>> tail(sortrows(tb,{'weight',...

                'birthday'}),2)

  type    MF   weight  birthday

 _______ _____ ______ __________

 {'dog'} {'F'}  9.82  2023-11-25

 {'dog'} {'M'}  9.82  2023-11-26

In : df.sort_values(by=['weight',

...:         'birthday']).head(2)

Out:

    type MF  weight    birthday

6  bunny  M    1.12  2023-11-17

9  bunny  F    1.53  2023-11-15

In : df.sort_values(by=['weight',

...:         'birthday']).tail(2)

Out:

  type MF  weight    birthday

4  dog  F    9.82  2023-11-25

0  dog  M    9.82  2023-11-26

13.6 Table Subsets

Portions of tables can be referenced through row and column slices using indexing techniques similar to those for MATLAB matrices and NumPy ndarrays. As with numeric arrays (Section 11.​1.​15), table indexing in MATLAB is a bit more convenient than in Python. MATLAB accepts any integer array as a table index, but Pandas needs special attributes .loc[] for label and logical indexing and .iloc[] for arbitrary numeric indexing.

The biggest differences between .loc[] and .iloc[] are that .loc[] permits Boolean expressions to filter on rows (in contrast, .iloc[] only accepts numeric indices), and .iloc[] enables one to select row and column subsets (.loc[] only works on rows).

13.6.1 All Rows, Selected Columns

New tables containing only weight and type can be created by specifying these column names in the indexing notation shown as follows:

MATLAB:

Python:

>> tb(:,{'weight','type'})

    weight      type

    ______    _________

     9.82     {'dog'  }

     4.31     {'cat'  }

     8.4      {'bunny'}

     4.65     {'bunny'}

     9.82     {'dog'  }

     5.57     {'cat'  }

     1.12     {'bunny'}

     2.96     {'cat'  }

     5.08     {'cat'  }

     1.53     {'bunny'}

In : df[['weight','type']]

Out:

   weight   type

0    9.82    dog

1    4.31    cat

2    8.40  bunny

3    4.65  bunny

4    9.82    dog

5    5.57    cat

6    1.12  bunny

7    2.96    cat

8    5.08    cat

9    1.53  bunny

13.6.2 All Columns, Selected Rows

New tables containing all columns but only selected rows are produced using the following indexing:

MATLAB:

Python:

>> tb([2,3],:)

   type     MF   weight  birthday

 _________ _____ ______ __________

 {'cat'  } {'F'}  4.31  2023-11-18

 {'bunny'} {'F'}   8.4  2023-11-23

In : df.loc[[1,2]]

Out:

    type  MF  weight    birthday

1    cat   F    4.31  2023-11-18

2  bunny   F    8.40  2023-11-23

13.6.3 Selected Rows, Selected Columns

A subset of rows can be similarly targeted using a row slice with either a strided range or an arbitrary list of integers. These select every third row starting with the first:

MATLAB:

Python:

>> tb(1:3:end,{'weight','type'})

 weight   type

 ______ _________

9.82  {'dog'  }

4.65  {'bunny'}

1.12  {'bunny'}

1.53  {'bunny'}

In : df[::3][['weight', 'type']]

Out:

   weight   type

0    9.82    dog

3    4.65  bunny

6    1.12  bunny

9    1.53  bunny

Selecting arbitrary rows of a dataframe requires the .iloc[] attribute:

MATLAB:

Python:

> tb([1,6,2],{'weight','type'})

 weight  type

 ______ _______

9.82  {'dog'}

5.57  {'cat'}

4.31  {'cat'}

In : df[['weight','type']].iloc[[0,5,1]]

Out:

   weight type

0    9.82  dog

5    5.57  cat

1    4.31  cat

A unique feature of dataframes is that row and column indexing can be given in any order:

Python:

Python:

In : df.iloc[[0,5,1]][

          ['weight','type']]

Out:

   weight type

0    9.82  dog

5    5.57  cat

1    4.31  cat

In : df[['weight','type']

        ].iloc[[0,5,1]]

Out:

weight type

0    9.82  dog

5    5.57  cat

1    4.31  cat

This is possible because each indexing component returns a new table which can subsequently be indexed independently. df.iloc[[0,5,1]] returns the first, sixth, and second rows, and df.iloc[[0,5,1]][['weight','type']] returns the weight and size columns from the three specified rows. Similarly, df[['weight','type']] is a table containing all rows of weight and type columns, and df[['weight','type']].iloc[[0,5,1]] returns the first, sixth, and second rows from those two columns.

MATLAB tables are indexed via {columns} or {rows, columns}.

13.6.4 Filter Rows by Conditional Operations

In addition to numeric indexing, rows of MATLAB tables and Pandas dataframes can be indexed with Boolean expressions on columns. This example returns males weighing more than four units:

MATLAB:

Python:

>> tb((tb.weight > 4) & (...

       cell2mat(tb.MF) == 'M'),:)

4×4 table

   type     MF   weight  birthday

 _________ _____ ______ __________

 {'dog'  } {'M'}  9.82  2023-11-26

 {'bunny'} {'M'}  4.65  2023-11-24

 {'cat'  } {'M'}  5.57  2023-11-24

 {'cat'  } {'M'}  5.08  2023-11-21

In : df[(df.weight > 4) &

...:    (df.MF == 'M')]

Out:

    type MF  weight    birthday

0    dog  M    9.82  2023-11-26

3  bunny  M    4.65  2023-11-24

5    cat  M    5.57  2023-11-24

8    cat  M    5.08  2023-11-21

Note that in MATLAB, string values (as in column MF) must be converted to a matrix to allow a string equivalence test.

13.7 Iterating over Rows

Interaction with tables generally happens with entire columns of data. Occasionally though, access to individual rows is necessary. Rows of MATLAB tables are indexed using brace subscripts. Pandas rows may be indexed numerically using the .iloc[] accessor (Section 13.6.3), but when iterating over an entire table, the dataframe’s .iterrows() generator invoked in a for loop is more convenient. Column values within a row may be indexed numerically or with column names:

MATLAB:

Python:

>> for i = 1:height(tb)

  fprintf("%d %s %s ",i,string(...

       tb{i,1}),string(tb{i,'MF'}))

end

1 dog M

2 cat F

3 bunny F

4 bunny M

5 dog F

6 cat M

7 bunny M

8 cat M

9 cat M

10 bunny F

In : for i,row in df.iterrows():

...:   print(i,row[0],row['MF'])

0 dog M

1 cat F

2 bunny F

3 bunny M

4 dog F

5 cat M

6 bunny M

7 cat M

8 cat M

9 bunny F

13.8 Pivot Tables

Pivot tables allow one to aggregate data sharing common values. They help answer questions such as “what were sales by region?” or “how many points did home teams score compared to visiting teams?” Pandas has explicit functions for these—pd.pivot_table() and pd.groupby()—while MATLAB relies on different techniques depending on the complexity of the desired aggregation.

13.8.1 Single-Level Aggregation

MATLAB can produce a pivot table grouped by a single variable through its generic varfun() function with "GroupingVariables" as the third argument and the column name to group by as the fourth argument. The Pandas pd.pivot_table() function, which can also be invoked as a dataframe method, takes a list of column names to group by with the keyword argument index. This example returns a new table or dataframe with a count of animals based on type. It duplicates the result of .value_counts() from Section 13.2.3:

MATLAB:

Python:

>> varfun(@length,tb,...

     "GroupingVariables","type",...

     "InputVariables","type")

   type    GroupCount length_type

 _________ __________ ___________

 {'bunny'}     4           4

 {'cat'  }     4           4

 {'dog'  }     2           2

In : pd.pivot_table(df,

       index=['type'], aggfunc="count")

Out:

       MF  birthday  weight

type

bunny   4         4       4

cat     4         4       4

dog     2         2       2

The aggregation function is provided in MATLAB as a function handle and in Pandas with the optional keyword argument aggfunc. Pandas uses mean() if this argument is not supplied.

varfun() and pd.pivot_table() return a new table and dataframe which can be sliced, sorted, and so on. The next example sums the weight of males and females using the method form of pivot_table():

MATLAB:

Python:

>> varfun(@sum,tb,...

     "GroupingVariables","MF",...

     "InputVariables","weight")

MF   GroupCount sum_weight

_____ __________ __________

{'F'}     4        24.06

{'M'}     6         29.2

In : df.pivot_table(index=['MF'],

...:  values=['weight'],aggfunc='sum')

Out:

    weight

MF

F    24.06

M    29.20

The Pandas .groupby() function returns the same result, but only because weight is the only numeric column in df. Had there been others, sums would be returned for those too.

Python:
In : df.groupby(['MF']).sum()
Out:
    weight
MF
F    24.06
M    29.20

13.8.2 Multilevel Aggregation

Aggregation may be nested to multiple levels in Pandas simply by adding more columns to the index= keyword argument. Despite the plural form of "GroupingVariables", the same is not true for MATLAB’s varfun(); it only allows grouping by one column. Multilevel pivot tables in MATLAB require the three-step “split-apply-combine” method involving several intermediate variables. Here, we’ll get a more detailed view into animal weights by separating the aggregation by animal type in addition to gender:

MATLAB:

Python:

>> [Idx, MF, Type] = ...

      findgroups(tb.MF, tb.type);

>> sum_weight = splitapply(...

      @sum, tb.weight, Idx);

>> table(MF, Type, sum_weight)

  6×3 table

  MF     Type    sum_weight

 _____ _________ __________

 {'F'} {'bunny'}    9.93

 {'F'} {'cat'  }    4.31

 {'F'} {'dog'  }    9.82

 {'M'} {'bunny'}    5.77

 {'M'} {'cat'  }   13.61

 {'M'} {'dog'  }    9.82

In : pd.pivot_table(df,

index=['MF','type'],

values=['weight'],

aggfunc='sum')

Out:

weight

MF type

F  bunny    9.93

cat      4.31

dog      9.82

M  bunny    5.77

cat     13.61

dog      9.82

13.9 Adding Columns

New columns derived from data in existing columns can be helpful in subsequent aggregation and join operations. They can be added simply by indexing the table or dataframe with a new column name in an assignment statement. For example, tb.New = 4 and df['New' = 4] create column “New” with a value of 4 for all rows in table tb and dataframe df. In the following, we’ll add a new column, wd, containing the integer weekday of each animal’s birthday. The Pandas birthday column needs to be a timestamp rather than string (Section 13.3.4).

One complication is that MATLAB weekday values are Sunday = 1 to Saturday = 7, while Python’s are Monday = 0 to Sunday = 6. We’ll map the Python values to match MATLAB’s with mod (i + 1, 7) + 1 where i is Python’s weekday value.

MATLAB:

Python:

>> tb.wd = weekday(tb{:,'birthday'})

  type     MF  weight  birthday  wd

_________ ____ ______ __________ __

{'dog'  } {'M'} 9.82  2023-11-26 1

{'cat'  } {'F'} 4.31  2023-11-18 7

{'bunny'} {'F'}  8.4  2023-11-23 5

{'bunny'} {'M'} 4.65  2023-11-24 6

{'dog'  } {'F'} 9.82  2023-11-25 7

{'cat'  } {'M'} 5.57  2023-11-24 6

{'bunny'} {'M'} 1.12  2023-11-17 6

{'cat'  } {'M'} 2.96  2023-11-21 3

{'cat'  } {'M'} 5.08  2023-11-21 3

{'bunny'} {'F'} 1.53  2023-11-15 4

In : i = df['birthday'].dt.weekday

In : df['wd'] = ((i+1) % 7) + 1

In : df

    type MF  weight   birthday  wd

0    dog  M    9.82 2023-11-26   1

1    cat  F    4.31 2023-11-18   7

2  bunny  F    8.40 2023-11-23   5

3  bunny  M    4.65 2023-11-24   6

4    dog  F    9.82 2023-11-25   7

5    cat  M    5.57 2023-11-24   6

6  bunny  M    1.12 2023-11-17   6

7    cat  M    2.96 2023-11-21   3

8    cat  M    5.08 2023-11-21   3

9  bunny  F    1.53 2023-11-15   4

If we count the number of animals born by weekday using a pivot table, sort the result by counts, then tail that, we’ll see the weekday on which most animals were born. The selection of MF as the “value” column to sum over is arbitrary; the row count is the same regardless of the column chosen to count:

MATLAB:

Python:

>> tail(sortrows(varfun(@length,...

    tb,"GroupingVariables","wd",...

    "InputVariables","MF"),...

    "GroupCount"),1)

wd GroupCount length_MF

 __ __________ _________

 6      3          3

In : pd.pivot_table(df,index=['wd'],

values=['MF'],aggfunc='count'

).sort_values(by=['MF']).tail(1)

Out:

MF

wd

6    3

We find that three of the animals were born on wd = 6, that is, Friday.

13.10 Deleting Columns

Columns can be deleted as easily as they are added. A column can be removed from a MATLAB table by reassigning it to an empty array. In Pandas, a column is deleted the same way as a dictionary entry. Here, we remove the weekday column added in the previous section:

MATLAB:

Python:

>> head(tb,2)

  type    MF   weight  birthday  wd

 _______ _____ ______ __________ __

 {'dog'} {'M'}  9.82  2023-11-26 1

 {'cat'} {'F'}  4.31  2023-11-18 7

>> tb.wd = []; % remove wd column

>> head(tb,2)

  type    MF   weight  birthday

 _______ _____ ______ __________

 {'dog'} {'M'}  9.82  2023-11-26

 {'cat'} {'F'}  4.31  2023-11-18

In : df.head(2)

Out:

  type MF  weight   birthday  wd

0  dog  M    9.82 2023-11-26   1

1  cat  F    4.31 2023-11-18   7

In : del df['wd'] # remove wd column

In : df.head(2)

Out:

  type MF  weight   birthday

0  dog  M    9.82 2023-11-26

1  cat  F    4.31 2023-11-18

Multiple columns can be deleted from a dataframe with the .drop() method and removevars() in MATLAB:

MATLAB:

Python:

>> head(tb,2)

  type    MF   weight  birthday

 _______ _____ ______ __________

 {'dog'} {'M'}  9.82  2023-11-26

 {'cat'} {'F'}  4.31  2023-11-18

>> head(removevars(tb,...

      {'MF','weight'}),2)

  type     birthday

 _______  __________

 {'dog'}  2023-11-26

 {'cat'}  2023-11-18

In : df.head(2)

Out:

  type MF  weight   birthday

0  dog  M    9.82 2023-11-26

1  cat  F    4.31 2023-11-18

In df.drop(columns=['MF','weight']).head(2)

Out:

  type   birthday

0  dog   2023-11-26

1  cat   2023-11-18

In both cases, a new table is returned; the original tb and df remain unchanged.

13.11 Joins Across Tables

Complex datasets may span multiple tables. MATLAB and Pandas support join operations that yield results from data related by indices common to the tables. Joins come in four flavors—inner, full, left, right—that are the Boolean equivalents of intersection, union, and subtraction operations between two sets. Inner joins are seen frequently because they return results only where both tables have common indices.

As a simple example, say we have to feed our animals. We’ll introduce a second table that has the cost of a unit of food for each animal type:

MATLAB:

Python:

>> food.type = {'dog','cat',...

                'bunny'}';

>> food.price = [1.2 3.5 0.7]';

>> cost_tb = struct2table(food)

  3×2 table

      type       price

    _________    _____

    {'dog'  }     1.2

    {'cat'  }     3.5

    {'bunny'}     0.7

In : food = {

...:  'type' : ['dog','cat','bunny'],

...:  'price' : [ 1.2, 3.5, 0.7]}

In : cost_df = pd.DataFrame(food)

In : cost_df

Out:

    type  price

0    dog    1.2

1    cat    3.5

2  bunny    0.7

We can create a new table containing all the animal entries plus the type-specific price by joining the two tables on type. Joins are done in MATLAB with join(), innerjoin(), or outerjoin() and in Pandas with pd.merge(). pd.merge()’s default join type, “inner,” can be changed with the how= keyword option.

MATLAB:

Python:

>> J = innerjoin(tb,cost_tb, 'Keys',...

                 {'type'})

type     MF   weight  birthday  price

________ _____ ______ _________ ______

{'bunny'} {'F'}   8.4  2023-11-23  0.7

{'bunny'} {'M'}  4.65  2023-11-24  0.7

{'bunny'} {'M'}  1.12  2023-11-17  0.7

{'bunny'} {'F'}  1.53  2023-11-15  0.7

{'cat'  } {'F'}  4.31  2023-11-18  3.5

{'cat'  } {'M'}  5.57  2023-11-24  3.5

{'cat'  } {'M'}  2.96  2023-11-21  3.5

{'cat'  } {'M'}  5.08  2023-11-21  3.5

{'dog'  } {'M'}  9.82  2023-11-26  1.2

{'dog'  } {'F'}  9.82  2023-11-25  1.2

In : J = pd.merge(df,cost_df,on='type')

In : J

Out:

type MF weight   birthday price

0   dog  M   9.82 2023-11-26   1.2

1   dog  F   9.82 2023-11-25   1.2

2   cat  F   4.31 2023-11-18   3.5

3   cat  M   5.57 2023-11-24   3.5

4   cat  M   2.96 2023-11-21   3.5

5   cat  M   5.08 2023-11-21   3.5

6 bunny  F   8.40 2023-11-23   0.7

7 bunny  M   4.65 2023-11-24   0.7

8 bunny  M   1.12 2023-11-17   0.7

9 bunny  F   1.53 2023-11-15   0.7

The tables match although rows appear in different orders. The joined table J lets us compute the cost to feed each animal. If each animal eats its weight in food, the costs are

MATLAB:

Python:

>> J.weight .* J.price

ans=

    5.8800

    3.2550

    0.7840

    1.0710

   15.0850

   19.4950

   10.3600

   17.7800

   11.7840

   11.7840

>> sum(J.weight .* J.price)

ans=

   97.2780

In : J.weight*J.price

Out:

0    11.784

1    11.784

2    15.085

3    19.495

4    10.360

5    17.780

6     5.880

7     3.255

8     0.784

9     1.071

dtype: float64

In : sum(J.weight*J.price)

Out: 97.27799999999999

What’s the cost breakdown by animal type? A pivot table gives the answer, but first we’ll need another temporary column storing the cost per animal computed earlier. We’ll call this new column wxp for “weight times price”:

MATLAB:

Python:

>> J.wxp = J.weight .* J.price;

varfun(@sum,J,"GroupingVariables",...

       "type","InputVariables","wxp")

3×3 table

   type    GroupCt   sum_wxp

 _________ _______ __________

 {'bunny'} 4.e+00  1.0990e+01

 {'cat'  } 4.e+00  6.2720e+01

 {'dog'  } 2.e+00  2.3568e+01

In : J['wxp'] = J.weight*J.price

In : pd.pivot_table(J,index=['type'],

...:     values=['wxp'],aggfunc=sum)

Out:

wxp

type

bunny  10.990

cat    62.720

dog    23.568

13.12 GeoPandas

GeoPandas [1] adds spatial data types and operations to Pandas. It allows one to query and aggregate information from Pandas dataframes spatially (“how do home prices vary by postal code and time?”), perform geographic set operations (“over which countries did the airplane fly?”), and create maps with data overlays. Here, we’ll explore that last capability.

GeoPandas needs geographic data in the form of GeoJSON files or shapefiles. In Section 12.​4.​3.​2, we used shapefiles downloaded from the Natural Earth project. The US government’s data.gov website is another source of shapefiles; among others, it contains a 500 MB compressed file with high-resolution outlines of every zip code in the United States.1

My goal is to make a map of Los Angeles County with zip codes colored by home prices. Zillow Research2 generously makes available real estate price data grouped by zip code. From their portal, I downloaded the “ZVHI All Homes (SFR/Condo/Coop)” dataset, a 70 MB CSV file.

Loading a CSV file into Pandas is easy (Section 13.1). The novel part here is making a dataframe with geographic data of zip code boundaries. Fortunately, GeoPandas makes loading a shapefile as easy as loading a CSV file. The US postal code file, tl_2019_us_zcta510.zip, expands to seven files, the largest of which is the shapefile tl_2019_us_zcta510.shp. All we need to do is read it with the GeoPandas read_file() function . These commands
In : import pandas as pd
In : import geopandas as gp
In : home_prices  = pd.read_csv('Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv')
In : zipcode_geom = gp.read_file('tl_2019_us_zcta510.shp')
create two Pandas dataframes, home_prices, containing Zillow’s home prices by zip code, and zipcode_geom, containing shapes of all zip codes. Their contents look like this:
In : home_prices
Out:
       RegionID  SizeRank  RegionName  ... 2021-03-31 2021-04-30 2021-05-31
0         61639         0       10025  ...  1053436.0  1071606.0  1089141.0
1         84654         1       60657  ...   514068.0   515876.0   517752.0
2         61637         2       10023  ...  1096338.0  1100727.0  1104368.0
3         91982         3       77494  ...   370188.0   376591.0   383493.0
4         84616         4       60614  ...   656960.0   659558.0   661174.0
...         ...       ...         ...  ...        ...        ...        ...
30832     62532     34430       12345  ...   167810.0   170959.0   174757.0
30833     87060     34430       66045  ...   219054.0   224651.0   233820.0
30834     58379     34430        1470  ...   398219.0   402924.0   407785.0
30835     58117     35187         822  ...   168439.0   169986.0   173515.0
30836     58110     35187         801  ...    37344.0    36687.0    36030.0
[30837 rows x 314 columns]
In : zipcode_geom
Out:
      ZCTA5CE10  ...                                           geometry
0         43451  ...  POLYGON ((-83.70873 41.32733, -83.70815 41.327...
1         43452  ...  POLYGON ((-83.08698 41.53780, -83.08256 41.537...
2         43456  ...  MULTIPOLYGON (((-82.83558 41.71082, -82.83515 ...
3         43457  ...  POLYGON ((-83.49650 41.25371, -83.48382 41.253...
4         43458  ...  POLYGON ((-83.22229 41.53102, -83.22228 41.532...
...         ...  ...                                                ...
33139     84044  ...  POLYGON ((-112.26022 40.76909, -112.25333 40.7...
33140     84045  ...  MULTIPOLYGON (((-111.92421 40.17034, -111.9240...
33141     84046  ...  POLYGON ((-110.00072 40.99745, -110.00036 40.9...
33142     84047  ...  POLYGON ((-111.92141 40.62772, -111.92134 40.6...
33143     84049  ...  POLYGON ((-111.59394 40.57707, -111.59386 40.5...
[33144 rows x 10 columns]
All we need to do is join the two dataframes on zip code and then invoke the .plot() method on the result to see a map. Some data cleaning is needed to prepare the dataframes for a join though; the two dataframes must use the same name and data type for the join column.
In : zipcode_geom.rename(columns={'ZCTA5CE10':'ZipCode'}, inplace=True)
In : home_prices.rename(columns={'RegionName':'ZipCode'}, inplace=True)
In : zipcode_geom['ZipCode'] = zipcode_geom['ZipCode'].astype(int)

How did I know to change the data type of zipcode_geom['ZipCode'] to integer? Initially, I didn’t. I attempted the join with only renamed columns and then got a Pandas error saying the join failed because of a type mismatch, “ValueError: You are trying to merge an object and int64 columns.”

After renaming and fixing the data type, I can do the join, but the resulting dataframe includes every zip code in the United States—too much for my needs. I’ll first subset the data to just LA County, and only for May 2020 and 2021:
In : LA_prices = home_prices[
...:                  home_prices['CountyName'] == 'Los Angeles County'][
...:                  ['ZipCode','2020-05-31','2021-05-31']]
Finally, I’ll do the join and create the plot :
In : LA_price_map = zipcode_geom.merge(LA_prices,on='ZipCode')  # join
In : LA_price_map['2020-05-31'] /= 1.0e6  # change units to million dollars
In : LA_price_map.plot(column='2020-05-31',cmap='plasma',legend=True)
In : plt.show()

Not bad for just a dozen lines of code. There’s room for improvement, though. The following program adds a title, axis labels, and identifiers for every zip code and zooms in to the South Bay region:

Python:
#!/usr/bin/env python3
# file: code/geopandas/LA_home_prices.py
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.transforms import Bbox
import geopandas as gp
import pandas as pd
zipcode_geom = gp.read_file('usa_zip/tl_2019_us_zcta510.shp')
home_prices  = pd.read_csv('Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv')
# rename zip code columns to enable join
zipcode_geom.rename(columns={'ZCTA5CE10':'ZipCode'}, inplace=True)
home_prices.rename(columns={'RegionName':'ZipCode'}, inplace=True)
# change type from 'object' to 'int64'
zipcode_geom['ZipCode'] = zipcode_geom['ZipCode'].astype(int)
# subset to just LA County
LA_prices = home_prices[
                  home_prices['CountyName'] == 'Los Angeles County'][
                  ['ZipCode','2020-05-31','2021-05-31']]
LA_price_map = zipcode_geom.merge(LA_prices,on='ZipCode')
LA_price_map['2020-05-31'] /= 1.0e6  # change units to million dollars
LA_price_map.plot(column='2020-05-31',cmap='plasma',
        legend=True,figsize=(10,14))
plt.xlim(left=-118.55,right=-118.17)
plt.ylim(top=34.14,bottom=33.7)
plt.title('Los Angeles County Home Prices 2020-05-31, $M')
plt.ylabel('Latitude [deg]')
plt.xlabel('Longitude [deg]')
# label each zip code
for i,row in LA_price_map.iterrows():
    center = row['geometry'].centroid.xy
    xy = center[0][0], center[1][0]
    plt.annotate(text=row['ZipCode'], xy=xy,
                 horizontalalignment='center')
plt.savefig('LA_home_prices.png', bbox_inches=Bbox([[0.9,0.9],[9,13]]),
            transparent=True)
plt.show()

13.13 Recipe 13-1: Maps with GeoPandas

The map of Los Angeles County home prices created in Section 13.12 can also be made in MATLAB with GeoPandas. The Pandas idioms of filtering dataframe rows and columns with brackets (Section 13.6) and calling a dataframe’s .iterrows() method to traverse rows are not available to MATLAB though. We’ll need a bridge module to supply the following capabilities:
  • Filter rows based on Boolean operations

  • Extract selected columns

  • Rename columns

  • Change column data type

  • Scale column values

  • Iterate over rows

The last item, row iteration, is surprisingly challenging. An obvious solution is to create a Python generator that returns a new row each time it is called, but MATLAB doesn’t seem to work with Python generators. Instead, I implemented iterrows() in the bridge module as a closure.3

Python:
# file: code/matlab_py/bridge_geopandas.py
def row_filter(dframe, A, op, B):
    operation = {
            '==' : lambda DF, A, B : DF[DF[A] == B],
            '!=' : lambda DF, A, B : DF[DF[A] != B],
            '<=' : lambda DF, A, B : DF[DF[A] <= B],
            '>=' : lambda DF, A, B : DF[DF[A] >= B],
            '<'  : lambda DF, A, B : DF[DF[A] <  B],
            '>'  : lambda DF, A, B : DF[DF[A] >  B],
            }
    if op not in operation:
        print(f'bridge_geopandas.row_filter: "{op}" not recognized')
        return None
    return operation[op](dframe, A, B)
def col_filter(dframe, cols):
    return dframe[[*cols]]
def rename_col(dframe, A, B):
    dframe.rename(columns={A:B}, inplace=True)
def astype(dframe, col, new_type):
    if new_type == 'int':
        dframe[col] = dframe[col].astype(int)
    elif new_type == 'float':
        dframe[col] = dframe[col].astype(float)
    elif new_type == 'str':
        dframe[col] = dframe[col].astype(str)
    else:
        print(f'bridge_geopandas.astype: "{new_type}" not recognized')
def mult_by(dframe, col, scale):
    dframe[col] *= scale
def iterrows(dframe):
    i = -1
    n_rows = len(dframe)
    def increment():
        nonlocal i
        i += 1
        if i < n_rows:
            return dframe.iloc[i]
        else:
            return None
    return increment

Calls to bridge functions in the following MATLAB program are preceded by commented Python code which implement the Pandas operation. Additionally, the sequence of commands differs a bit from the pure Python program of Section 13.12 to reduce the memory footprint. Rows of the two dataframes are filtered immediately after they are read.

MATLAB 2020b:
% file: code/geopandas/LA_home_prices.m
Im  = @py.importlib.import_module;
pd  = Im('pandas');
gp  = Im('geopandas');
BG  = Im('bridge_geopandas');
plt = Im('matplotlib.pyplot');
trx = Im('matplotlib.transforms');
mpl = Im('matplotlib');
if ispc
    mpl.use('WXAgg')
else
    mpl.use('TkAgg')
end
zipcode_geom = gp.read_file('usa_zip/tl_2019_us_zcta510.shp');
% zipcode_geom.rename(columns={'ZCTA5CE10':'ZipCode'}, inplace=True)
BG.rename_col(zipcode_geom, 'ZCTA5CE10','ZipCode');
% zipcode_geom['ZipCode'] = zipcode_geom['ZipCode'].astype(int)
BG.astype(zipcode_geom, 'ZipCode', 'int');
% zipcode_geom = zipcode_geom[ zipcode_geom['ZipCode'] > 90000]
BG.row_filter(zipcode_geom, 'ZipCode', '>', 90000);
home_prices = pd.read_csv('Zip_zhvi_uc_sfrcondo_tier_0.33_0.67_sm_sa_mon.csv');
% home_prices.rename(columns={'RegionName':'ZipCode'}, inplace=True)
BG.rename_col(home_prices, 'RegionName', 'ZipCode');
% home_prices = home_prices[['ZipCode','2020-05-31','2021-05-31']]
home_prices = BG.col_filter(home_prices, py.list({'ZipCode',...
                            'CountyName','2020-05-31','2021-05-31'}));
% LA_prices = home_prices[home_prices['CountyName'] == 'Los Angeles County']]
LA_prices = BG.row_filter(home_prices, 'CountyName', '==', 'Los Angeles County');
LA_price_map = zipcode_geom.merge(LA_prices,pyargs('on','ZipCode'));
% LA_price_map['2020-05-31'] /= 1.0e6  # change units to million dollars
BG.mult_by(LA_price_map, '2020-05-31', 1.0/1.0e6);
size_10_14 = py.tuple([10,14]);
LA_price_map.plot(pyargs('column','2020-05-31','cmap','plasma',...
        'legend',py.True,'figsize',size_10_14));
plt.xlim(pyargs('left',-118.55,'right',-118.17));
plt.ylim(pyargs('top',34.14,'bottom',33.7));
plt.title('Los Angeles County Home Prices 2020-05-31, $M');
plt.ylabel('Latitude [deg]');
plt.xlabel('Longitude [deg]');
% label each zip code
next_row = BG.iterrows(LA_price_map);
while 1
    row = next_row();
    if strcmp(class(row), 'py.NoneType')
        break
    end
    geo = row.get('geometry');
    xy = py.tuple([geo.centroid.x, geo.centroid.y]);
    plt.annotate(pyargs('text',row.get('ZipCode'),'xy',xy,...
                 'horizontalalignment','center'));
end
bbox = trx.Bbox([0.9 0.9; 9 13]);
plt.savefig('LA_m.png', pyargs('bbox_inches',bbox,'transparent',py.True));
plt.show()

Unfortunately, as of November 2021, attempting to use geopandas through MATLAB on Windows is problematic due to DLL load issues. Visit this book’s GitHub repository4 for updates.

13.14 References

  1. [1]

    Kelsey Jordahl et al. geopandas/geopandas: v0.8.1. Version v0.8.1. July 2020. DOI: 10.5281/zenodo. 3946761. URL: https://doi.org/10.5281/zenodo.3946761

     
  2. [2]

    Wes McKinney.Python for Data Analysis, 2nd ed. O’Reilly Media, 2017.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.26.217