How it works...

It can get tedious to repeatedly write the read_csv function when importing many DataFrames at the same time. One way to automate this process is to put all the file names in a list and iterate through them with a for loop. This was done in step 1 with a list comprehension.

The rest of this step builds a function to display multiple DataFrames on the same line of output in a Jupyter notebook. All DataFrames have a to_html method, which returns a raw HTML string representation of the table. The CSS (cascading style sheet) of each table is changed by altering the display attribute to inline so that elements get displayed horizontally next to one another rather than vertically. To properly render the table in the notebook, you must use the helper function read_html provided by the IPython library.

At the end of step 1, we unpack the list of DataFrames into their own appropriately named variables so that each individual table may be easily and clearly referenced. The nice thing about having a list of DataFrames is that, it is the exact requirement for the concat function, as seen in step 2. Notice how step 2 uses the keys parameter to name each chunk of data. This can be also be accomplished by passing a dictionary to concat, as done in step 3.

In step 4, we must change the type of join to outer to include all of the rows in the passed DataFrame that do not have an index present in the calling DataFrame. In step 5, the passed list of DataFrames cannot have any columns in common. Although there is an rsuffix parameter, it only works when passing a single DataFrame and not a list of them. To work around this limitation, we change the names of the columns beforehand with the add_suffix method, and then call the join method.

In step 7, we use merge, which defaults to aligning on all column names that are the same in both DataFrames. To change this default behavior, and align on the index of either one or both, set the left_index or right_index parameters to True. Step 8 finishes the replication with two calls to merge. As you can see, when you are aligning multiple DataFrames on their index, concat is usually going to be a far better choice than merge.

In step 9, we switch gears to focus on a situation where merge has the advantage. The merge method is the only one capable of aligning both the calling and passed DataFrame by column values. Step 10 shows you how easy it is to merge two DataFrames. The on parameter is not necessary but provided for clarity.

Unfortunately, it is very easy to duplicate or drop data when combining DataFrames, as shown in step 10. It is vital to take some time to do some sanity checks after combining data. In this instance, the food_prices dataset had a duplicate price for steak in store B so we eliminated this row by querying for only the current year in step 11. We also change to a left join to ensure that each transaction is kept regardless if a price is present or not.

It is possible to use join in these instances but all the columns in the passed DataFrame must be moved into the index first. Finally, concat is going to be a poor choice whenever you intend to align data by values in their columns.

Table of Contents for How it works...

Create new playlist

Sign In

Sign Up

Table of Contents for
How it works...