How to do it...

  1. Let's read in stock data for 2016, 2017, and 2018 into a list of DataFrames using a loop instead of three different calls to the read_csv function. Jupyter notebooks currently only allow a single DataFrame to be displayed on one line. However, there is a way to customize the HTML output with help from the IPython library. The user-defined display_frames function accepts a list of DataFrames and outputs them all in a single row:
>>> from IPython.display import display_html

>>> years = 2016, 2017, 2018
>>> stock_tables = [pd.read_csv('data/stocks_{}.csv'.format(year),
index_col='Symbol')
for year in years]

>>> def display_frames(frames, num_spaces=0):
t_style = '<table style="display: inline;"'
tables_html = [df.to_html().replace('<table', t_style)
for df in frames]

space = '&nbsp;' * num_spaces
display_html(space.join(tables_html), raw=True)

>>> display_frames(stock_tables, 30)
>>> stocks_2016, stocks_2017, stocks_2018 = stock_tables
  1. The concat function is the only one able to combine DataFrames vertically. Let's do this by passing it the list stock_tables:
>>> pd.concat(stock_tables, keys=[2016, 2017, 2018])
  1. It can also combine DataFrames horizontally by changing the axis parameter to columns:
>>> pd.concat(dict(zip(years,stock_tables)), axis='columns')
  1. Now that we have started combining DataFrames horizontally, we can use the join and merge methods to replicate this functionality of concat. Here, we use the join method to combine the stock_2016 and stock_2017 DataFrames. By default, the DataFrames align on their index. If any of the columns have the same names, then you must supply a value to the lsuffix or rsuffix parameters to distinguish them in the result:
>>> stocks_2016.join(stocks_2017, lsuffix='_2016',
rsuffix='_2017', how='outer')
  1. To exactly replicate the output of the concat function from step 3, we can pass a list of DataFrames to the join method:
>>> other = [stocks_2017.add_suffix('_2017'),
stocks_2018.add_suffix('_2018')]
>>> stocks_2016.add_suffix('_2016').join(other, how='outer')
  1. Let's check whether they actually are exactly equal:
>>> stock_join = stocks_2016.add_suffix('_2016').join(other, 
how='outer')
>>> stock_concat = pd.concat(dict(zip(years,stock_tables)),
axis='columns')
>>> level_1 = stock_concat.columns.get_level_values(1)
>>> level_0 = stock_concat.columns.get_level_values(0).astype(str)
>>> stock_concat.columns = level_1 + '_' + level_0
>>> stock_join.equals(stock_concat)
True
  1. Now, let's turn to merge that, unlike concat and join, can combine exactly two DataFrames together. By default, merge attempts to align the values in the columns that have the same name for each of the DataFrames. However, you can choose to have it align on the index by setting the boolean parameters left_index and right_index to True. Let's merge the 2016 and 2017 stock data together:
>>> stocks_2016.merge(stocks_2017, left_index=True, 
right_index=True)
  1. By default, merge uses an inner join and automatically supplies suffixes for identically named columns. Let's change to an outer join and then perform another outer join of the 2018 data to exactly replicate concat:
>>> step1 = stocks_2016.merge(stocks_2017, left_index=True, 
right_index=True, how='outer',
suffixes=('_2016', '_2017'))

>>> stock_merge = step1.merge(stocks_2018.add_suffix('_2018'),
left_index=True, right_index=True,
how='outer')

>>> stock_concat.equals(stock_merge)
True
  1. Now let's turn our comparison to datasets where we are interested in aligning together the values of columns and not the index or column labels themselves. The merge method is built exactly for this situation. Let's take a look at two new small datasets, food_prices and food_transactions:
>>> names = ['prices', 'transactions']
>>> food_tables = [pd.read_csv('data/food_{}.csv'.format(name))
for name in names]
>>> food_prices, food_transactions = food_tables
>>> display_frames(food_tables, 30)
  1. If we wanted to find the total amount of each transaction, we would need to join these tables on the item and store columns:
>>> food_transactions.merge(food_prices, on=['item', 'store'])
  1. The price is now aligned correctly with its corresponding item and store, but there is a problem. Customer 2 has a total of four steak items. As the steak item appears twice in each table for store B, a Cartesian product takes place between them, resulting in four rows. Also, notice that the item, coconut, is missing because there was no corresponding price for it. Let's fix both of these issues:
>>> food_transactions.merge(food_prices.query('Date == 2017'),
how='left')
  1. We can replicate this with the join method but we must first put the joining columns of the food_prices DataFrame into the index:
>>> food_prices_join = food_prices.query('Date == 2017') 
.set_index(['item', 'store'])
>>> food_prices_join
  1. The join method only aligns with the index of the passed DataFrame but can use the index or the columns of the calling DataFrame. To use columns for alignment on the calling DataFrame, you will need to pass them to the on parameter:
>>> food_transactions.join(food_prices_join, on=['item', 'store'])
  1. The output matches the result from step 11 exactly. To replicate this with the concat method, you would need to put the item and store columns into the index of both DataFrames. However, in this particular case, an error would be produced as a duplicate index value occurs in at least one of the DataFrames (with item steak and store B):
>>> pd.concat([food_transactions.set_index(['item', 'store']), 
food_prices.set_index(['item', 'store'])],
axis='columns')
Exception: cannot handle a non-unique multi-index!
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.178.53