Chapter 2. Data Processing Pipeline Using Scala

In Chapter 1, Introduction to Scala and Machine Learning, we gained some idea about Scala, Apache Spark, and machine learning. In this chapter, we will explore ways to compose a data processing pipeline using Scala. In particular, we will discuss:

  • Entree—A sample dataset for recommendation systems
  • ETL—extract transform load
  • Extraction and transformation for machine learning
  • Setting up MongoDB and Apache Kafka
  • Data processing pipeline for Entree

And then hopefully, we will be able to compose different components of the processing pipeline.

Entree – a sample dataset for recommendation systems

In this chapter, we will focus our discussion based on a dataset that is apt for recommendation engines. We have selected the Entree dataset for this chapter. This dataset can be found at: https://archive.ics.uci.edu/ml/machine-learning-databases/entree-mld/entree.data.html. We have selected this dataset because:

  • It is one of the classic datasets for recommendation systems (specifically case-based recommendation systems)
  • The data is well formed, and it can be processed by any text processing tool
  • The data has missing values, which makes the problem even more interesting

Let's take a look at the data that we are presented with. First download it:

$ wget -c https://archive.ics.uci.edu/ml/machine-learning-databases/entree-mld/entree_data.tar.gz
$ tar zxf entree_data.tar.gz
$ ls -F entree
data/  README  session/

On a Windows, machine you have two choices: download and unzip the file manually, or use a Linux-based VM and follow the preceding steps.

So there are two folders data/ and session/ along with the README file. Your first task would be to explore different files in the dataset, and get a feel of its structure. If you read the entree/data/README file, you will find there are restaurants in eight cities: Atlanta, Boston, Chicago, Los Angeles, New Orleans, New York, San Francisco, and Washington DC.

Each of the restaurants has a few of the features such as bakeries, Burmese, Chinese, cafeterias, diners, early dining, and so on. These features of a particular restaurant can be encoded in terms of boolean values, that is, if a restaurant has early dining we can represent it as true if present, and false, if not. Since there are 257 features in total, we can represent each restaurant in terms of 257 Boolean values. Or, simply as an array of numbers (either 0 for false, or 1 for true):

val r1features = Array(0, 1, 1, 1, 0, 0, 0,...)

Note that the Entree system generates recommendations for Chicago city restaurants only. Additionally, we have historical data of different users. These users are identified by their IP addresses / domain names. These are stored in session.YEAR-QUARTER text files in entree/session/ folder. Each row of this file has different fields, which are well described in the dataset description. Take some time to go through the description, and you will notice the following fields:

  • Date/time
  • IP address / domain name
  • Entry point (a restaurant code)
  • Some more restaurants navigated by the user during a session. These are all Chicago restaurants
  • End point (a restaurant code in Chicago)

Of these fields, an entry point value of 0, and end point value of -1 means missing data. The end point value is assumed to be a restaurant that a user actually liked. In a sense this is a good indicator of a user preference to a particular restaurant, and will serve as a good data point for making potentially sensible recommendations.

Now we need to answer the following questions:

  • What kind of learning algorithm should we run on this dataset to make good recommendations?
  • What data representation does this algorithm expect?
  • How will the data be fed into this algorithm over time?

Our focus for now should be on building a data processing pipeline. Therefore, for the sake of simplicity let us keep our discussion to recommendations based on popularity over some time period or a window. We will cover many different recommendation algorithms in the next chapters.

For Chapter 1, Introduction to Scala and Machine Learning, learning algorithm, we simply choose a learning algorithm based on popularity of a restaurant.

For this chapter, data representation, let's use the historical data to infer popularity of a restaurant. For this, we can utilize the session history provided in the dataset. In a typical browsing session, a user starts from any starting restaurant, then moves on to the next and so on until finally he/she finds one at the end (that is, in the Chicago area). Therefore, the navigation sequence gives us a base for counting the restaurant visits for data representation.

For Chapter 3, Conceptualizing an E-Commerce Store, feeding data to algorithm, we will consider a stream based approach. If our data is static, and it won't change for a long time, we can simply process everything in bulk and store the results. However, for an online system like Entree, this is not true and we are presented with more challenges. How should we feed data then?

Generally, there are two approaches we can take in this scenario:

  • Push: Someone or some system, notifies our recommendation engine with the new set of data, or new transactions
  • Pull: The recommendation engine asks for the data periodically

Of course, we would not want the recommender system to wait for an infinite amount of time to accumulate as much data as it can. So, a window-based approach makes sense here. In this case, we wait for some time period while new data is arriving. Once this, time period has elapsed we process this chunk of data and store the results (or move it to some appropriate place).

Now in practical situations, there is too much data for a single machine to handle. Typically, a set of many machine nodes form a distributed system, which together as a cluster process this huge data. We can assume that for small number of nodes in a distributed system, the push strategy performs better. Read this paper to convince yourselves: A Fair Comparison of Pull and Push Strategies in Large Distributed Networks at http://www.pats.ua.ac.be/content/publications/2014/Pull_push_RR_extended.pdf.

Next, we perform some simple analysis on the Entree dataset.

Data analysis of the Entree dataset

It goes a long way to know the dataset better. So let's write some Scala code to find out which cities have what kind of restaurants. First we count the number of restaurants in each city. Next, we find the kind of restaurant features most likely to be found in different cities.

Data analysis of the Entree dataset

Here is the output of the analysis:

$ sbt 'run-main chapter02.Stats /home/tuxdna/work/packt/dataset/entree'
[info] Running chapter02.Stats /home/tuxdna/work/packt/dataset/entree
Cities and their restaurant count
--------------------------------------------------
             atlanta has    267 restaurants
              boston has    438 restaurants
             chicago has    676 restaurants
         los_angeles has    447 restaurants
         new_orleans has    327 restaurants
            new_york has   1200 restaurants
       san_francisco has    414 restaurants
       washington_dc has    391 restaurants

City: atlanta
--------------------------------------------------
            Parking/Valet at    225 restaurants
        Wheelchair Access at    199 restaurants
        Excellent Service at    164 restaurants
           Weekend Dining at    158 restaurants
          Private Parties at    155 restaurants

City: boston
--------------------------------------------------
           Weekend Dining at    532 restaurants
        Wheelchair Access at    238 restaurants
        Excellent Service at    209 restaurants
           Excellent Food at    202 restaurants
          Private Parties at    181 restaurants

City: chicago
--------------------------------------------------
           Weekend Brunch at    512 restaurants
        Excellent Service at    371 restaurants
           Excellent Food at    353 restaurants
            Parking/Valet at    328 restaurants
              Short Drive at    255 restaurants

City: los_angeles
--------------------------------------------------
           Weekend Brunch at    342 restaurants
            Weekend Lunch at    302 restaurants
        Excellent Service at    249 restaurants
      Weekend Jazz Brunch at    206 restaurants
   Warm spots by the fire at    206 restaurants

City: new_orleans
--------------------------------------------------
          Open on Mondays at    259 restaurants
          Open on Sundays at    259 restaurants
        Wheelchair Access at    151 restaurants
        Excellent Service at    147 restaurants
           Excellent Food at    143 restaurants

City: new_york
--------------------------------------------------
           Excellent Food at    612 restaurants
        Excellent Service at    576 restaurants
               Good Decor at    406 restaurants
          Excellent Decor at    404 restaurants
             Good Service at    369 restaurants

City: san_francisco
--------------------------------------------------
           Weekend Dining at    251 restaurants
        Excellent Service at    243 restaurants
        Wheelchair Access at    241 restaurants
          Private Parties at    193 restaurants
  Private Rooms Available at    193 restaurants

City: washington_dc
--------------------------------------------------
           Weekend Dining at    318 restaurants
            Parking/Valet at    279 restaurants
            Weekend Lunch at    222 restaurants
        Wheelchair Access at    216 restaurants
        Excellent Service at    210 restaurants

Note that in a city, for restaurants to be profitable, they must provide facilities/features that are relevant to people in the area. From the preceding output, we can glean much information about different cities. For example, New York has almost double the number of restaurants as Chicago. While the people from New York seem to prefer excellent food and decor, Chicago people seem to prefer weekend brunch and also valet parking / short drive. This would make sense depending on the kind of conveyance people can afford in their cities. By analyzing our dataset, we find some important information, using which we can tune our system for some special cases. In fact, it could be a good recommendation to different restaurant owners, that they provide nice valet parking when in Chicago.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.111.208