In this section, we will illustrate how to load large datasets directly from a URL with the help of the ff
package and how to interact with a biglm
package to fit a general linear regression model to the datasets that are larger than the memory. The biglm
package can effectively handle datasets even if they overload the RAM of the computer, as it loads data into memory in chunks. It processes the last chunk and updates the sufficient statistics required for the model. It then disposes the chunk and loads the next one. This process is repeated until all the data is processed in the calculation.
The following example examines the unemployment compensation amount as a linear function of a few social-economic data.
To perform a big data linear regression analysis, we first need to install and load the ff
packages, which we will use to open large files in R, and the biglm
package, which we will use to fit the linear regression model on our data:
install.packages("ff") install.packages("biglm") library(ff) library(biglm)
For the big data linear regression analysis, we used the Individual Income Tax ZIP Code Data provided by the U.S government agency, Internal Revenue Service (IRS). ZIP code-level data shows selected income and tax items classified by the state, ZIP code, and income classes. We used the 2012 data of the database; this database is reasonable in size but allows us to highlight the functionality of the big data packages.
We will directly load the required dataset into R from the URL with the following command:
download.file("http://www.irs.gov/file_source/pub/irs-soi/12zpallagi.csv","soi.csv")
Once we have downloaded the data, we will use the read.table.ffdf
function that reads the files into an ffdf
object that is supported by the ff
package. The read.table.ffdf
function works very much like the read.table
function. It also provides convenient options to read other file formats, such as csv
:
x <- read.csv.ffdf(file="soi.csv",header=TRUE)
After we have converted the dataset into an ff
object, we will load the biglm
package to perform the linear regression analysis.
Leveraging the dataset of almost 1,67,000 observations along 77 different variables, we will investigate whether the location-level amount of unemployment compensation (defined as variable A02300
) can be explained by the total salary and wages amount (A00200), the number of residents by income category (AGI_STUB), the number of dependents (the NUMDEP variable), and the number of married people (MARS2) in the given location.
For the linear regression analysis, we will use the biglm
function; therefore, before we specify our model, we need to load the package:
require(biglm)
As the next step, we will define the formula and fit the model on our data. With the summary function, we can obtain the coefficients and the significance level of the variable of the fitted model. As the model output does not include the R-square value, we need to load the R-square value of the model with a separate command:
mymodel<-biglm(A02300 ~ A00200+AGI_STUB+NUMDEP+MARS2,data=x) summary(mymodel) Large data regression model: biglm(A02300 ~ A00200 + AGI_STUB + NUMDEP + MARS2, data = x) Sample size = 166904 Coef (95% CI) SE p (Intercept) 131.9412 44.3847 219.4977 43.7782 0.0026 A00200 -0.0019 -0.0019 -0.0018 0.0000 0.0000 AGI_STUB -40.1597 -62.6401 -17.6794 11.2402 0.0004 NUMDEP 0.9270 0.9235 0.9306 0.0018 0.0000 MARS2 -0.1451 -0.1574 -0.1327 0.0062 0.0000 A00200 -0.0019 -0.0019 -0.0018 0.0000 0.0000 summary(mymodel)$rsq [1] 0.8609021
We can conclude from the regression model coefficient output that all the variables contribute significantly to the model. The independent variables explain 86.09 percent of the total variance of the unemployment compensation amount, indicating a good fit of the model.
3.137.166.252