Chapter 9. Bayesian Modeling at Big Data Scale

When we learned the principles of Bayesian inference in Chapter 3, Introducing Bayesian Inference, we saw that as the amount of training data increases, contribution to the parameter estimation from data overweighs that from the prior distribution. Also, the uncertainty in parameter estimation decreases. Therefore, you may wonder why one needs Bayesian modeling in large-scale data analysis. To answer this question, let us look at one such problem, which is building recommendation systems for e-commerce products.

In a typical e-commerce store, there will be millions of users and tens of thousands of products. However, each user would have purchased only a small fraction (less than 10%) of all the products found in the store in their lifetime. Let us say the e-commerce store is collecting users' feedback for each product sold as a rating on a scale of 1 to 5. Then, the store can create a user-product rating matrix to capture the ratings of all users. In this matrix, rows would correspond to users and columns would correspond to products. The entry of each cell would be the rating given by the user (corresponding to the row) to the product (corresponding to the column). Now, it is easy to see that although the overall size of this matrix is huge, only less than 10% entries would have values since every user would have bought only less than 10% products from the store. So, this is a highly sparse dataset. Whenever there is a machine learning task where, even though the overall data size is huge, the data is highly sparse, overfitting can happen and one should rely on Bayesian methods (reference 1 in the References section of this chapter). Also, many models such as Bayesian networks, Latent Dirichlet allocation, and deep belief networks are built on the Bayesian inference paradigm.

When these models are trained on a large dataset, such as text corpora from Reuters, then the underlying problem is large-scale Bayesian modeling. As it is, Bayesian modeling is computationally intensive since we have to estimate the whole posterior distribution of parameters and also do model averaging of the predictions. The presence of large datasets will make the situation even worse. So what are the computing frameworks that we can use to do Bayesian learning at a large scale using R? In the next two sections, we will discuss some of the latest developments in this area.

Distributed computing using Hadoop

In the last decade, tremendous progress was made in distributed computing when two research engineers from Google developed a computing paradigm called the MapReduce framework and an associated distributed filesystem called Google File System (reference 2 in the References section of this chapter). Later on, Yahoo developed an open source version of this distributed filesystem named Hadoop that became the hallmark of Big Data computing. Hadoop is ideal for processing large amounts of data, which cannot fit into the memory of a single large computer, by distributing the data into multiple computers and doing the computation on each node locally from the disk. An example would be extracting relevant information from log files, where typically the size of data for a month would be in the order of terabytes.

To use Hadoop, one has to write programs using MapReduce framework to parallelize the computing. A Map operation splits the data into multiple key-value pairs and sends it to different nodes. At each of those nodes, a computation is done on each of the key-value pairs. Then, there is a shuffling operation where all the pairs with the same value of key are brought together. After this, a Reduce operation sums up all the results corresponding to the same key from the previous computation step. Typically, these MapReduce operations can be written using a high-level language called Pig. One can also write MapReduce programs in R using the RHadoop package, which we will describe in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.139.8