Big data linear regression analysis

In this section, we will illustrate how to load large datasets directly from a URL with the help of the ff package and how to interact with a biglm package to fit a general linear regression model to the datasets that are larger than the memory. The biglm package can effectively handle datasets even if they overload the RAM of the computer, as it loads data into memory in chunks. It processes the last chunk and updates the sufficient statistics required for the model. It then disposes the chunk and loads the next one. This process is repeated until all the data is processed in the calculation.

The following example examines the unemployment compensation amount as a linear function of a few social-economic data.

Loading big data

To perform a big data linear regression analysis, we first need to install and load the ff packages, which we will use to open large files in R, and the biglm package, which we will use to fit the linear regression model on our data:

install.packages("ff")
install.packages("biglm")
library(ff)
library(biglm)

For the big data linear regression analysis, we used the Individual Income Tax ZIP Code Data provided by the U.S government agency, Internal Revenue Service (IRS). ZIP code-level data shows selected income and tax items classified by the state, ZIP code, and income classes. We used the 2012 data of the database; this database is reasonable in size but allows us to highlight the functionality of the big data packages.

We will directly load the required dataset into R from the URL with the following command:

download.file("http://www.irs.gov/file_source/pub/irs-soi/12zpallagi.csv","soi.csv")

Once we have downloaded the data, we will use the read.table.ffdf function that reads the files into an ffdf object that is supported by the ff package. The read.table.ffdf function works very much like the read.table function. It also provides convenient options to read other file formats, such as csv:

x <- read.csv.ffdf(file="soi.csv",header=TRUE)

After we have converted the dataset into an ff object, we will load the biglm package to perform the linear regression analysis.

Leveraging the dataset of almost 1,67,000 observations along 77 different variables, we will investigate whether the location-level amount of unemployment compensation (defined as variable A02300) can be explained by the total salary and wages amount (A00200), the number of residents by income category (AGI_STUB), the number of dependents (the NUMDEP variable), and the number of married people (MARS2) in the given location.

Fitting a linear regression model on large datasets

For the linear regression analysis, we will use the biglm function; therefore, before we specify our model, we need to load the package:

require(biglm)

As the next step, we will define the formula and fit the model on our data. With the summary function, we can obtain the coefficients and the significance level of the variable of the fitted model. As the model output does not include the R-square value, we need to load the R-square value of the model with a separate command:

mymodel<-biglm(A02300 ~  A00200+AGI_STUB+NUMDEP+MARS2,data=x)
summary(mymodel)
Large data regression model: biglm(A02300 ~ A00200 + AGI_STUB + NUMDEP + MARS2, data = x)
Sample size =  166904 
                Coef     (95%      CI)      SE      p
(Intercept) 131.9412  44.3847 219.4977 43.7782 0.0026
A00200       -0.0019  -0.0019  -0.0018  0.0000 0.0000
AGI_STUB    -40.1597 -62.6401 -17.6794 11.2402 0.0004
NUMDEP        0.9270   0.9235   0.9306  0.0018 0.0000
MARS2        -0.1451  -0.1574  -0.1327  0.0062 0.0000
A00200       -0.0019  -0.0019  -0.0018  0.0000 0.0000
summary(mymodel)$rsq
[1] 0.8609021

We can conclude from the regression model coefficient output that all the variables contribute significantly to the model. The independent variables explain 86.09 percent of the total variance of the unemployment compensation amount, indicating a good fit of the model.

Fitting a linear regression model on large datasets
Fitting a linear regression model on large datasets
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.166.252