In Chapter 1, Introduction to Scala and Machine Learning, we gained some idea about Scala, Apache Spark, and machine learning. In this chapter, we will explore ways to compose a data processing pipeline using Scala. In particular, we will discuss:
And then hopefully, we will be able to compose different components of the processing pipeline.
In this chapter, we will focus our discussion based on a dataset that is apt for recommendation engines. We have selected the Entree dataset for this chapter. This dataset can be found at: https://archive.ics.uci.edu/ml/machine-learning-databases/entree-mld/entree.data.html. We have selected this dataset because:
Let's take a look at the data that we are presented with. First download it:
$ wget -c https://archive.ics.uci.edu/ml/machine-learning-databases/entree-mld/entree_data.tar.gz $ tar zxf entree_data.tar.gz $ ls -F entree data/ README session/
On a Windows, machine you have two choices: download and unzip the file manually, or use a Linux-based VM and follow the preceding steps.
So there are two folders data/
and session/
along with the README
file. Your first task would be to explore different files in the dataset, and get a feel of its structure. If you read the entree/data/README
file, you will find there are restaurants in eight cities: Atlanta, Boston, Chicago, Los Angeles, New Orleans, New York, San Francisco, and Washington DC.
Each of the restaurants has a few of the features such as bakeries, Burmese, Chinese, cafeterias, diners, early dining, and so on. These features of a particular restaurant can be encoded in terms of boolean values, that is, if a restaurant has early dining we can represent it as true if present, and false, if not. Since there are 257 features in total, we can represent each restaurant in terms of 257 Boolean values. Or, simply as an array of numbers (either 0 for false, or 1 for true):
val r1features = Array(0, 1, 1, 1, 0, 0, 0,...)
Note that the Entree system generates recommendations for Chicago city restaurants only. Additionally, we have historical data of different users. These users are identified by their IP addresses / domain names. These are stored in session.YEAR-QUARTER
text files in entree/session/
folder. Each row of this file has different fields, which are well described in the dataset description. Take some time to go through the description, and you will notice the following fields:
Of these fields, an entry point value of 0, and end point value of -1 means missing data. The end point value is assumed to be a restaurant that a user actually liked. In a sense this is a good indicator of a user preference to a particular restaurant, and will serve as a good data point for making potentially sensible recommendations.
Now we need to answer the following questions:
Our focus for now should be on building a data processing pipeline. Therefore, for the sake of simplicity let us keep our discussion to recommendations based on popularity over some time period or a window. We will cover many different recommendation algorithms in the next chapters.
For Chapter 1, Introduction to Scala and Machine Learning, learning algorithm, we simply choose a learning algorithm based on popularity of a restaurant.
For this chapter, data representation, let's use the historical data to infer popularity of a restaurant. For this, we can utilize the session history provided in the dataset. In a typical browsing session, a user starts from any starting restaurant, then moves on to the next and so on until finally he/she finds one at the end (that is, in the Chicago area). Therefore, the navigation sequence gives us a base for counting the restaurant visits for data representation.
For Chapter 3, Conceptualizing an E-Commerce Store, feeding data to algorithm, we will consider a stream based approach. If our data is static, and it won't change for a long time, we can simply process everything in bulk and store the results. However, for an online system like Entree, this is not true and we are presented with more challenges. How should we feed data then?
Generally, there are two approaches we can take in this scenario:
Of course, we would not want the recommender system to wait for an infinite amount of time to accumulate as much data as it can. So, a window-based approach makes sense here. In this case, we wait for some time period while new data is arriving. Once this, time period has elapsed we process this chunk of data and store the results (or move it to some appropriate place).
Now in practical situations, there is too much data for a single machine to handle. Typically, a set of many machine nodes form a distributed system, which together as a cluster process this huge data. We can assume that for small number of nodes in a distributed system, the push strategy performs better. Read this paper to convince yourselves: A Fair Comparison of Pull and Push Strategies in Large Distributed Networks at http://www.pats.ua.ac.be/content/publications/2014/Pull_push_RR_extended.pdf.
Next, we perform some simple analysis on the Entree dataset.
It goes a long way to know the dataset better. So let's write some Scala code to find out which cities have what kind of restaurants. First we count the number of restaurants in each city. Next, we find the kind of restaurant features most likely to be found in different cities.
Here is the output of the analysis:
$ sbt 'run-main chapter02.Stats /home/tuxdna/work/packt/dataset/entree' [info] Running chapter02.Stats /home/tuxdna/work/packt/dataset/entree Cities and their restaurant count -------------------------------------------------- atlanta has 267 restaurants boston has 438 restaurants chicago has 676 restaurants los_angeles has 447 restaurants new_orleans has 327 restaurants new_york has 1200 restaurants san_francisco has 414 restaurants washington_dc has 391 restaurants City: atlanta -------------------------------------------------- Parking/Valet at 225 restaurants Wheelchair Access at 199 restaurants Excellent Service at 164 restaurants Weekend Dining at 158 restaurants Private Parties at 155 restaurants City: boston -------------------------------------------------- Weekend Dining at 532 restaurants Wheelchair Access at 238 restaurants Excellent Service at 209 restaurants Excellent Food at 202 restaurants Private Parties at 181 restaurants City: chicago -------------------------------------------------- Weekend Brunch at 512 restaurants Excellent Service at 371 restaurants Excellent Food at 353 restaurants Parking/Valet at 328 restaurants Short Drive at 255 restaurants City: los_angeles -------------------------------------------------- Weekend Brunch at 342 restaurants Weekend Lunch at 302 restaurants Excellent Service at 249 restaurants Weekend Jazz Brunch at 206 restaurants Warm spots by the fire at 206 restaurants City: new_orleans -------------------------------------------------- Open on Mondays at 259 restaurants Open on Sundays at 259 restaurants Wheelchair Access at 151 restaurants Excellent Service at 147 restaurants Excellent Food at 143 restaurants City: new_york -------------------------------------------------- Excellent Food at 612 restaurants Excellent Service at 576 restaurants Good Decor at 406 restaurants Excellent Decor at 404 restaurants Good Service at 369 restaurants City: san_francisco -------------------------------------------------- Weekend Dining at 251 restaurants Excellent Service at 243 restaurants Wheelchair Access at 241 restaurants Private Parties at 193 restaurants Private Rooms Available at 193 restaurants City: washington_dc -------------------------------------------------- Weekend Dining at 318 restaurants Parking/Valet at 279 restaurants Weekend Lunch at 222 restaurants Wheelchair Access at 216 restaurants Excellent Service at 210 restaurants
Note that in a city, for restaurants to be profitable, they must provide facilities/features that are relevant to people in the area. From the preceding output, we can glean much information about different cities. For example, New York has almost double the number of restaurants as Chicago. While the people from New York seem to prefer excellent food and decor, Chicago people seem to prefer weekend brunch and also valet parking / short drive. This would make sense depending on the kind of conveyance people can afford in their cities. By analyzing our dataset, we find some important information, using which we can tune our system for some special cases. In fact, it could be a good recommendation to different restaurant owners, that they provide nice valet parking when in Chicago.
18.227.111.208