Calculating the cosine similarity of artists in the Audioscrobbler dataset using Pig

Cosine similarity is used to measure the similarity of two vectors. In this recipe, it will be used to find the similarity of artists based on the number of times Audioscrobbler users have added each user to their playlist. The idea is to show how often users play both artist 1 and artist 2.

Getting ready

Download the Audioscrobbler dataset from http://www.packtpub.com/support.

How to do it...

Perform the following steps to calculate cosine similarity using Pig:

  1. Copy the artist_data.txt and user_artist_data.txt files into HDFS:
    hadoop fs –put artist_data.txt user_artist_data.txt /data/audioscrobbler/
  2. Load the data into Pig:
    plays = load '/data/audioscrobbler/user_artist_data.txt'
            using PigStorage(' ') as (user_id:long, artist_id:long, playcount:long);
    
    artist = load '/data/audioscrobbler/artist_data.txt' as (artist_id:long, artist_name:chararray);
  3. Sample the user_artist_data.txt file:
    plays = sample plays .01;
  4. Normalize the play counts to 100:
    user_total_grp = group plays by user_id;
    
    user_total = foreach user_total_grp generate group as user_id, SUM(plays.playcount) as totalplays;
    
    plays_user_total = join plays by user_id, user_total by user_id using 'replicated';
    
    norm_plays = foreach plays_user_total generate user_total::user_id as user_id, artist_id, ((double)playcount/(double)totalplays) * 100.0 as norm_play_cnt;
  5. Get artist pairs for each user:
    norm_plays2 = foreach norm_plays generate *;
    
    play_pairs = join norm_plays by user_id, norm_plays2 by user_id using 'replicated';
    
    play_pairs = filter play_pairs by norm_plays::plays::artist_id != norm_plays2::plays::artist_id;
  6. Calculate cosine similarity:
    cos_sim_step1 = foreach play_pairs generate ((double)norm_plays::norm_play_cnt) * (double)norm_plays2::norm_play_cnt) as dot_product_step1, ((double)norm_plays::norm_play_cnt *(double) norm_plays::norm_play_cnt) as play1_sq;
    ((double)norm_plays2::norm_play_cnt *(double) norm_plays2::norm_play_cnt) as play2_sq;
    
    cos_sim_grp = group cos_sim_step1 by (norm_plays::plays::artist_id, norm_plays2::plays::artist_id);
    
    cos_sim_step2 = foreach cos_sim_grp generate flatten(group), COUNT(cos_sim_step1.dot_prodct_step1) as cnt, SUM(cos_sim_step1.dot_product_step1) as dot_product, SUM(cos_sim_step1.norm_plays::norm_play_cnt) as tot_play_sq, SUM(cos_sim_step1.norm_plays2::norm_play_cnt) as tot_play_sq2;
    
    cos_sim = foreach cos_sim_step2 generate group::norm_plays::plays::artist_id as artist_id1, group::norm_plays2::plays_artist_id as artist_id2, dot_product / (tot_play_sq1 * tot_play_sq2) as cosine_similarity;
  7. Get the artist's name:
    art1 = join cos_sim by artist_id1, artist by artist_id using 'replicated';
    art2 = join art1 by artist_id2, artist by artist_id using 'replicated';
    art3 = foreach art2 generate artist_id1, art1::artist::artist_name as artist_name1, artist_id2, artist::artist_name as artist_name2, cosin_similarity;
  8. To output the top 25 records:
    top = order art3 by cosine_similarity DESC;
    top_25 = limit top 25;
    dump top25;

    The output would be:

    (1000157,AC/DC,3418,Hole,0.9115799166673817)
    (829,Nas,1002216,The Darkness,0.9110152004952198)
    (1022845,Jessica Simpson,1002325,Mandy Moore,0.9097097460071537)
    (53,Wu-Tang Clan,78,Sublime,0.9096468367168238)
    (1001180,Godsmack,1234871,Devildriver,0.9093019011575069)
    (1001594,Adema,1007903,Maroon 5,0.909297052154195)
    (689,Bette Midler,1003904,Better Than Ezra,0.9089467492461345)
    (949,Ben Folds Five,2745,Ladytron,0.908736095810886)
    (1000388,Ben Folds,930,Eminem,0.9085664586931873)
    (1013654,Who Da Funk,5672,Nancy Sinatra,0.9084521262343653)
    (1005386,Stabbing Westward,30,Jane's Addiction,0.9075360259222892)
    (1252,Travis,1275996,R.E.M.,0.9071980963712077)
    (100,Phoenix,1278,Ryan Adams,0.9071754511713067)
    (2247,Four Tet,1009898,A Silver Mt. Zion,0.9069623744896833)
    (1037970,Kanye West,1000991,Alison Krauss,0.9058717234023009)
    (352,Beck,5672,Nancy Sinatra,0.9056851798338253)
    (831,Nine Inch Nails,1251,Morcheeba,0.9051453756031981)
    (1007004,Journey,1005479,Mr. Mister,0.9041311825160151)
    (1002470,Elton John,1000416,Ramones,0.9040551837635081)
    (1200,Faith No More,1007903,Maroon 5,0.9038274644717641)
    (1002850,Glassjaw,1016435,Senses Fail,0.9034604126636377)
    (1004294,Thursday,2439,HiM,0.902728300518356)
    (1003259,ABBA,1057704,Readymade,0.9026955950032872)
    (1001590,Hybrid,791,Beenie Man,0.9020872203833108)
    (1501,Wolfgang Amadeus Mozart,4569,Simon & Garfunkel,0.9018860912385024)

How it works...

The load statements tell Pig about the format and datatypes of the data being loaded. Pig loads data lazily. This means that the load statements at the beginning of this script will not do any work until another statement is entered that asks for output.

The user_artist_data.txt file is sampled so that a replicated join can be used when it is joined with itself. This significantly reduces the processing time at the cost of accuracy. The sample value of .01 is used, meaning that roughly one in hundred rows of data will be loaded.

A user selecting to play an artist is treated as a vote for that artist. The play counts are normalized to 100. This ensures that each user is given the same number of votes.

A self join of the user_artist_data.txt file by user_id will generate all pairs of artists that users have added to their playlist. The filter removes duplicates caused by the self join.

The next few statements calculate the cosine similarity. For each pair of artists that users have added to their playlist, multiply the number of plays for artist 1 by the number of plays for artist 2. Then output the number of plays for artist 1 and the number of plays for artist 2. Group the previous result by each pair of artists. Sum the multiplication of the number of plays for artist 1 by the number of plays by artist 2 for each user generated previously as the dot product. Sum the number of plays for artist 1 by all users. Sum the number of plays for artist 2 by all users. The cosine similarly is the dot product over the total plays for artist 1 multiplied by the total plays for artist two. The idea is to show how often users play both artist 1 and artist 2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.14.172.93