Merging database-style dataframes

Many beginner developers get confused when working with pandas dataframes, especially regarding when to use append, concat, merge, or join. In this section, we are going to check out the separate use cases for each of these.

Let's assume that you are working at a university as a professor teaching a Software Engineering course and an Introduction to Machine Learning course, and there are enough students to split into two classes. The examination for each class was done in two separate buildings and graded by two different professors. They gave you two different dataframes. In the first example, let's only consider one subject— the Software Engineering course.

Check out the following screenshot:

In the preceding dataset, the first column contains information about student identifiers and the second column contains their respective scores in any subject. The structure of the dataframes is the same in both cases. In this case, we would need to concatenate them. 

We can do that by using the pandas concat() method:

dataframe = pd.concat([dataFrame1, dataFrame2], ignore_index=True)
dataframe

The output of the preceding code is a single dataframe combining both of the tables. These tables would be merged into a single one as shown in the following screenshot:

The ignore_index argument creates a new index; in its absence, we'd keep the original indices. Note that we combined the dataframes along axis=0, that is to say, we combined them together in the same direction. What if we want to combine them side by side? Then we have to specify axis=1.

See the difference using the following code:

pd.concat([dataFrame1, dataFrame2], axis=1)

The output of the preceding code is shown in the following screenshot:

Note the difference in the output. When we specify axis=1, the concatenation happens on a side-by-side basis. 

Let's continue using the same case we discussed in the preceding code. In the first example, you received two dataframe files for the same subject. Now, consider another use case where you are teaching two courses: Software Engineering and Introduction to Machine Learning. You will get two dataframes from each subject:

  • Two for the Software Engineering course 
  • Another two for the Introduction to Machine Learning course

Check the following dataframes:

In case you missed it, there are important details you need to note in the preceding dataframes:

  • There are some students who are not taking the software engineering exam. 
  • There are some students who are not taking the machine learning exam.
  • There are students who appeared in both courses. 

Now, assume your head of department walked up to your desk and started bombarding you with a series of questions:

  • How many students appeared for the exams in total?
  • How many students only appeared for the Software Engineering course?
  • How many students only appeared for the Machine Learning course?

There are several ways in which you can answer these questions. Using the EDA technique is one of them. In this section, we are going to use the pandas library to answer the preceding questions.

Let's check the dataframes for both subjects:

import pandas as pd

df1SE = pd.DataFrame({ 'StudentID': [9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29], 'ScoreSE' : [22, 66, 31, 51, 71, 91, 56, 32, 52, 73, 92]})
df2SE = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30], 'ScoreSE': [98, 93, 44, 77, 69, 56, 31, 53, 78, 93, 56, 77, 33, 56, 27]})

df1ML = pd.DataFrame({ 'StudentID': [1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29], 'ScoreML' : [39, 49, 55, 77, 52, 86, 41, 77, 73, 51, 86, 82, 92, 23, 49]})
df2ML = pd.DataFrame({'StudentID': [2, 4, 6, 8, 10, 12, 14, 16, 18, 20], 'ScoreML': [93, 44, 78, 97, 87, 89, 39, 43, 88, 78]})`

As you can see in the preceding dataset, you have two dataframes for each subject. So, the first task is to concatenate these two subjects into one. Secondly, these students have taken the Introduction to Machine Learning course as well as the Software Engineering course. So, we need to merge these scores into the same dataframes. There are several ways to do this. Let's explore some options.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.18.141