Collaborative filtering with Apache Mahout

Collaborative filtering is a technique that can be used to discover relationships between people and items (for example, books and music). It works by examining the preferences of a set of users, such as the items they purchase, and then determines which users have similar preferences. Collaborative filtering can be used to build recommender systems. Recommender systems are used by many companies including Amazon, LinkedIn, and Facebook.

In this recipe, we are going to use Apache Mahout to generate book recommendations based on a dataset containing people's book preferences.

Getting ready

You will need to download, compile, and install the following:

Once you have compiled Mahout, add the mahout binary to the system path. In addition, you must set the HADOOP_HOME environment variable to point to the root folder of your Hadoop installation. You can accomplish this in the bash shell by using the following commands:

$ export PATH=$PATH:/path/to/mahout/bin
$ export HADOOP_HOME=/opt/mapr/hadoop/hadoop-0.20.2

Next, extract the Book-Crossing Dataset to the folder you are currently working on. You should see three files named BX-Books.csv, BX-Book-Ratings.csv, and BX-Users.csv.

How to do it...

Carry out the following steps to perform Collaborative filtering in Mahout:

  1. Run the clean_book_ratings.py script to transform the BX-Book-Ratings.csv file into a format the Mahout recommender can use.
    $ ./clean_book_ratings.py BX-Book-Ratings.csv cleaned_book_ratings.txt
  2. Run the clean_book_users.sh bash script to transform the BX-Users.csv file into a format the Mahout recommender can use. Note that the BX-Users.csv file should be in the folder you are currently working on:
    $ ./clean_book_users.sh
  3. Place both the cleaned_book_ratings.txt and the cleaned_book_users.txt files into HDFS:
    $ hadoop fs –mkdir /user/hadoop/books
    $ hadoop fs –put cleaned_book_ratings.txt /user/hadoop/books
    $ hadoop fs –put cleaned_book_users.txt /user/hadoop/books
  4. Run the Mahout recommender using the ratings and user information we just put into HDFS. Mahout will launch multiple MapReduce jobs to generate the book recommendations:
    $ mahout recommenditembased --input /user/hadoop/books/ cleaned_book_ratings.txt --output /user/hadoop/books/recommended --usersFile /user/hadoop/books/cleaned_book_users.txt -s SIMILARITY_LOGLIKELIHOOD
  5. Examine the results, which are in the format of USERID [RECOMMENDED BOOK ISBN:SCORE,...]. The output should look similar to the following:
    $ hadoop fs -cat /user/hadoop/books/recommended/part* | head -n1
    17      [849911788:4.497727,807503193:4.497536,881030392:4.497536,761528547:4.497536,380724723:4.497536,807533424:4.497536,310203414:4.497536,590344153:4.497536,761536744:4.497536,531000265:4.497536]
  6. Examine the results in a more human-friendly way using print_user_summaries.py. To print the recommendations for the first 10 users, use 10 for the last argument to print_user_summaries.py:
    hadoop fs -cat /user/hadoop/books/recommended/part-r-00000 | ./print_user_summaries.py BX-Books.csv BX-Users.csv BX-Book-Ratings.csv 10
    ==========
    user id =  114073
    rated:
    Digital Fortress : A Thriller  with:  9
    
    Angels &amp Demons with:  10
    
    recommended:
    Morality for Beautiful Girls (No.1 Ladies Detective Agency)
    Q Is for Quarry
    The Last Juror
    The Da Vinci Code
    Deception Point
    A Walk in the Woods: Rediscovering America on the Appalachian Trail (Official Guides to the Appalachian Trail)
    Tears of the Giraffe (No.1 Ladies Detective Agency)
    The No. 1 Ladies' Detective Agency (Today Show Book Club #8)

The output from print_user_summaries.py shows which books the user rated, and then it shows the recommendations generated by Mahout.

How it works...

The first steps of this recipe required us to clean up the Book-Crossing dataset. The BX-Book-Ratings.csv file was in a semicolon-delimited format with the following columns:

  • USER_ID: The user ID assigned to a person
  • ISBN: The book's ISBN the person reviewed
  • BOOK-RATING: The rating a person gave to the book

The Mahout recommendation engine expects the input dataset to be in the following comma-separated format:

  • USER_ID: The USER_ID must be an integer
  • ITEM_ID: The ITEM_ID must be an integer
  • RATING: The RATING must be an integer that increases in order of preference. For example, 1 would mean that the user disliked a book intensely, 10 would mean the user enjoyed the book.

Once the transformation was completed on the BX-Book-Ratings.csv file, we performed a similar transformation on the BX-Users.csv file. We stripped away most of the information in the BX-Users.csv file, except for USER_ID.

Finally, we launch the Mahout recommendation engine. Mahout will launch a series of MapReduce jobs to determine the book recommendations for a given set of users, specified with the –usersFile flag. In this example, we wanted Mahout to generate book recommendations for all of the users in the dataset, so we provided the complete USER_ID list to Mahout. In addition to providing an input path, output path, and user list as command-line arguments to Mahout, we also specified a fourth parameter -s SIMILARITY_LOGLIKELIHOOD. The -s flag is used to specify which similarity measure we want Mahout to use, to compare similar book preferences across all users. This recipe used log likelihood because it is a simple and effective algorithm, but Mahout supports many more similarity functions. To see for yourself, run the following command, and examine the options for the -s flag:

$mahout recommenditembased

See also

  • Clustering with Apache Mahout
  • Sentiment classification with Apache Mahout
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.252.204