Introduction to big data analysis in R

Big data refers to the situations when volume, velocity, or a variety of data exceeds the abilities of our computation capacity to process, store, and analyze them. Big data analysis has to deal not only with large datasets but also with computationally intensive analyses, simulations, and models with many parameters.

Leveraging large data samples can provide significant advantages in the field of quantitative finance; we can relax the assumption of linearity and normality, generate better perdition models, or identify low-frequency events.

However, the analysis of large datasets raises two challenges. First, most of the tools of quantitative analysis have limited capacity to handle massive data, and even simple calculations and data-management tasks can be challenging to perform. Second, even without the capacity limit, computation on large datasets may be extremely time consuming.

Although R is a powerful and robust program with a rich set of statistical algorithms and capabilities, one of the biggest shortcomings is its limited potential to scale to large data sizes. The reason for this is that R requires the data that it operates on to be first loaded into memory. However, the operating system and system architecture can only access approximately 4 GB of memory. If the dataset reaches the RAM threshold of the computer, it can literally become impossible to work with on a standard computer with a standard algorithm. Sometimes, even small datasets can cause serious computation problems in R, as R has to store the biggest object created during the analysis process.

R, however, has a few packages to bridge the gap to provide efficient support for big data analysis. In this section, we will introduce two particular packages that can be useful tools to create, store, access, and manipulate massive data.

First, we will introduce the bigmemory package that is a widely used option for large-scale statistical computing. The package and its sister packages (biganalytics, bigtabulate, and bigalgebra) address two challenges in handling and analyzing massive datasets: data management and statistical analysis. The tools are able to implement massive matrices that do not fit in the R runtime environment and support their manipulation and exploration.

An alternative for the bigmemory package is the ff package. This package allows R users to handle large vectors and matrices and work with several large data files simultaneously. The big advantage of ff objects is that they behave as ordinary R vectors. However, the data is not stored in the memory; it is a resident on the disk.

In this section, we will showcase how these packages can help R users overcome the limitations of R to cope with very large datasets. Although the datasets we use here are simple in size, they effectively shows the power of big data packages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.187.113