Using optimized packages

Many of the functionalities in base R have alternative implementations available in contributed packages. Quite often, these packages offer a faster or less memory-intensive substitute for the base R equivalent. For example, in addition to adding a ton of extra functionality, the glmnet package performs regression far faster than glm in my experience.

For faster data import, you might be able to use fread from the data.table package or the read_* family of functions from the readr package. It is not uncommon for data import tasks that used to take several hours to take only a few minutes with these read functions.

For common data manipulation tasks—like merging (joining), conditional selection, sorting, and so on—you will find that the data.table and dplyr packages offer incredible speed improvements. Both of these packages have a ton of useRs that swear by them, and the community support is solid. You'd be well advised to become proficient in one of these packages when you're ready.

Note

As it turns out, the sqldf package that I mentioned in passing in Chapter 10, Sources of Data—the one that can perform SQL queries on data frames—can sometimes offer performance improvements for common data manipulation tasks, too. Behind the scenes, sqldf (by default) loads your data frame into a temporary SQLite database, performs the query in the database's SQL execution environment, returns the results from the database in the form of a data frame, and destroys the temporary database. Since the queries run on the database, sqldf can (a) sometimes perform the queries faster than the equivalent native R code, and (b) somewhat relaxes the constraint that the data objects, which R uses, be held completely in memory.

The constraint that the data objects in R must be able to fit into memory can be a real obstacle for people who work with datasets that are rather large, but just shy of being big enough to necessitate special tools. Some can thwart this constraint by storing their data objects in a database, and only using selected subsets (that will fit in the memory). Others can get by using random samples of the available data instead of requiring the whole dataset to be held at once. If none of these options sound appealing, there are packages in R that will allow importing data that is larger than the memory available by directly referring to the data as it's stored on your hard disk. The most popular of these seem to be ff and bigmemory. There is a cost to this, however; not only are the operations slower than they would be if they were in memory, but since the data is processed piecemeal—in chunks—many standard R functions won't work on them. Be that as it may, the ffbase and the biganalytics packages provide methods to restore some of the functionality lost for the two packages respectively. Most notably, these packages allow ff and bigmemory objects to be used in the biglm package, which can build generalized linear models using data that is too big to fit in the memory.

Note

biglm can also be used to build generalized linear models using data stored in a database!

Remember the CRAN Task Views we talked about in the last chapter? There is a whole Task View dedicated to High Performance Computing (https://cran.r-project.org/web/views/HighPerformanceComputing.html). If there is a particular statistical technique that you'd like to find an optimized alternative for, this is the first place I'd check.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.19.186