Retrieving and cleaning data

First things first, we must get data. We also need to clean it before doing any drawing. The wbstats package will be used to get data. It retrieves data from the World Bank Data API. This section will demonstrate how to use wbstats. Data obtained and cleaned through this section are going to be later used to make plots.

Worldwide data about inequality, education, and population will be searched. All of these can be retrieved from the World Bank Database. Start by downloading wbstats if don't have it yet. If you are not sure whether you have it, simply run the following code:

if(!require('wbstats')){install.packages('wbstats')}

Load the wbstats library and enter wbcache() to download an updated list of available countries, indicators, and source information:

library(wbstats)
update_cache <- wbcache()

One can investigate the available information by entering str(update_cache) in the console.

During this chapter, worldwide information about income inequality (as GINI Index), education (as mean years of schooling), and population are going to be used. Let's begin with GINI Index. Try wbsearch(pattern = 'gini') and assign it to an object for further investigation, just like in the next code block:

gini_vars <- wbsearch(pattern = 'gini')

The gini_vars object is a DataFrame containing 178 rows and 2 columns: indicatorID and indicator. Such a DataFrame shows all of the available topics related to the pattern: gini. Inputting the topic you're interested in into the wbsearch() function is essential.

I encourage the reader to try head(gini_vars) at this point. Notice the indicator column. It briefly describes each of the available topics; indicatorID shows the correspondent ID for that information. After investigating gini_vars a little bit, I found the information I was looking for; it was described under the indicatorID 3.0.Gini:

gini_vars[gini_vars$indicatorID == '3.0.Gini',]
#    indicatorID        indicator
# 10784 3.0.Gini Gini Coefficient

Inputting wb() with the desired indicator will return the complete DataFrame regarding such a topic:

dt_gini <- wb(indicator = '3.0.Gini')

The last code block is looking for the 3.0.Gini indicator and storing it in dt_gini. By the time I got it, the DataFrame contained 232 observations of seven variables. It may be different for you. The names for the variables returned are given by default.

Only three variables will be required by this chapter: value, date, and country. The first one, value, holds the value for the queried indicator; the last two are pretty much self-explanatory. The next code block is getting data about education:

edu_vars <- wbsearch(pattern = 'years of schooling')
edu_vars[edu_vars$indicatorID == 'UIS.EA.MEAN.1T6.AG25T99',]
#                indicatorID                                                     indicator
#499 UIS.EA.MEAN.1T6.AG25T99 UIS: Mean years of schooling of the population age 25+. Total
dt_edu <- wb(indicator = 'SE.SCH.LIFE')

The mean years of schooling of the population aged above 25 years old was the variable selected to represent education. The next code block retrieves data about population:

pop_vars <- wbsearch(pattern = 'total population')
pop_vars[pop_vars$indicatorID == 'SP.POP.TOTL',]
dt_pop <-wb(indicator = 'SP.POP.TOTL')

At this point, all of the data that we need is split into different datasets. Having your data stored in a minimal DataFrame is a good practice, so that is something to work on. Each DataFrame has seven variables, but I am only interested in three of them for each frame. Let's reduce dt_gini:

dt_gini <- dt_gini[, c('date', 'value', 'country')]

Except for the variables named inside the brackets, all of the other variables were dropped. We can check the remaining ones using names():

names(dt_gini)
#[1] "date"    "value"   "country"

As a default, the queried indicator is always named value. This may cause confusion when the time comes to merge the different DataFrames. We can rename this variable using names():

names(dt_gini)[2] <- 'gini'

Notice how a single index was called inside the bracket. This way, we could change the name of the value variable alone. The next code block is doing the same for the DataFrames, dt_edu and dt_pop:

dt_edu <- dt_edu[, c('date', 'value', 'country')]
names(dt_edu)[2] <- 'mean_yrs_schooling'
dt_pop <- dt_pop[, c('date', 'value', 'country')]
names(dt_pop)[2] <- 'population'

The next thing to do is to merge the datasets. To do so, we must have a single matching ID for each DataFrame—neither date nor country alone could do it. The solution is to create a new variable combining both. The following code block uses paste() to do so:

dt_gini$merge_key <- paste(dt_gini$date, dt_gini$country, 
                           sep = '_')
dt_edu$merge_key <- paste(dt_edu$date, dt_edu$country, 
                           sep = '_')
dt_pop$merge_key <- paste(dt_pop$date, dt_pop$country, 
                          sep = '_')

Once we created matching unique IDs for rows in all DataFrames, merging the different DataFrames is actually pretty easy. For the sake of organization, the merged DataFrame will be stored in a new variable, dt. Merge dt_edu with dt_gini:

dt <- merge(dt_edu, dt_gini, by = 'merge_key', all = F)

The merge() function is doing the heavy lift. The first two arguments are the DataFrames to be merged. Later, we have the by argument, which carries the variable used to merge both datasets. Setting all to FALSE (F) prevents the new data set from containing any observation that is not present in both DataFrames at the same time. Merge dt with dt_pop:

dt <- merge(dt, dt_pop, by = 'merge_key', all = F)

Data is now reunited in a single DataFrame. Nonetheless, it has far more variables than we need. In pursuance of a minimal DataFrame, the code block ahead is keeping only the variables that are going to be used later:

dt <- dt[,c('gini', 'population', 'mean_yrs_schooling', 'date', 'country')]

In the real world, data is rarely ready to go into a plot. That is one really good reason for you to master data manipulation and there are many more. This section showed how to retrieve data from the World Bank Data API and how to put different indicators together while keeping a minimal DataFrame. The next section using this data to build bubble plots and a map using different packages.

Table of Contents for Retrieving and cleaning data

Create new playlist

Sign In

Sign Up

Table of Contents for
Retrieving and cleaning data