Building a recommendation engine

One thing I love to stumble upon is a really useful GitHub repository. You can find repositories that contain everything from hand-curated tutorials on machine learning to libraries that will save you dozens of lines of code when using Elasticsearch. The trouble is, finding these libraries is far more difficult than it should be. Fortunately, we now have the knowledge to leverage the GitHub API in a way that will help us to find these code gems.

We're going to be using the GitHub API to create a recommendation engine based on collaborative filtering. The plan is to get all of the repositories I've starred over time and to then get all of the creators of those repositories to find what repositories they've starred. Once that's done, we'll find which users are most similar to me (or you, if you're running this for your own repository, which I suggest). Once we have the most similar users, we can use the repositories they've starred and that I haven't to generate a set of recommendations.

Let's get started:

  1. We'll import the libraries we'll need:
import pandas as pd 
import numpy as np 
import requests 
import json 
  1. You'll need to have opened an account with GitHub and to have starred a number of repositories for this to work for your GitHub handle, but you won't actually need to sign up for the developer program. You can get an authorization token from your profile, which will allow you to use the API. You can also get it to work with this code, but the limits are too restrictive to make it usable for our example.
  1. To create a token for use with the API, go to the following URL at https://github.com/settings/tokens. There, you will see a button in the upper-right corner like the following:

  1. You'll need to click on that Generate new token button. Once you've done that, you need to select the permissions, I chose just public_repo. Then, finally, copy the token it gives you for use in the following code. Be sure to enclose both in quotes:
myun = YOUR_GITHUB_HANDLE 
mypw = YOUR_PERSONAL_TOKEN 
  1. We'll create the function that will pull the names of every repository you've starred:
my_starred_repos = [] 
def get_starred_by_me(): 
    resp_list = [] 
    last_resp = '' 
    first_url_to_get = 'https://api.github.com/user/starred' 
    first_url_resp = requests.get(first_url_to_get, auth=(myun,mypw)) 
    last_resp = first_url_resp 
    resp_list.append(json.loads(first_url_resp.text)) 
     
    while last_resp.links.get('next'): 
        next_url_to_get = last_resp.links['next']['url'] 
        next_url_resp = requests.get(next_url_to_get, auth=(myun,mypw)) 
        last_resp = next_url_resp 
        resp_list.append(json.loads(next_url_resp.text)) 
         
    for i in resp_list: 
        for j in i: 
            msr = j['html_url'] 
            my_starred_repos.append(msr) 

There's a lot going on in there, but, essentially, we're querying the API to get our own starred repositories. GitHub uses pagination rather than return everything in one call. Because of this, we'll need to check the .links that are returned from each response. As long as there is a next link to call, we'll continue to do so.

  1. We just need to call that function we created:
get_starred_by_me() 
  1. Then, we can see the full list of starred repositories:
my_starred_repos 

The preceding code will result in output similar to the following:

  1. We need to parse out the user names for each of the libraries we starred so that we can retrieve the libraries they starred:
my_starred_users = [] 
for ln in my_starred_repos: 
    right_split = ln.split('.com/')[1] 
    starred_usr = right_split.split('/')[0] 
    my_starred_users.append(starred_usr) 
 
my_starred_users 

This will result in output similar to the following:

  1. Now that we have the handles for all of the users we starred, we'll need to retrieve all of the repositories they starred. The following function will do just that:
starred_repos = {k:[] for k in set(my_starred_users)} 
def get_starred_by_user(user_name): 
    starred_resp_list = [] 
    last_resp = '' 
    first_url_to_get = 'https://api.github.com/users/'+ user_name +'/starred' 
    first_url_resp = requests.get(first_url_to_get, auth=(myun,mypw)) 
    last_resp = first_url_resp 
    starred_resp_list.append(json.loads(first_url_resp.text)) 
     
    while last_resp.links.get('next'): 
        next_url_to_get = last_resp.links['next']['url'] 
        next_url_resp = requests.get(next_url_to_get, auth=(myun,mypw)) 
        last_resp = next_url_resp 
        starred_resp_list.append(json.loads(next_url_resp.text)) 
         
    for i in starred_resp_list: 
        for j in i: 
            sr = j['html_url'] 
            starred_repos.get(user_name).append(sr) 

This function works in nearly the same way as the function we called earlier, but calls a different endpoint. It'll add their starred repositories to a dict we'll use later.

  1. Let's call it now. It may take a few minutes to run, depending on the number of repositories each user has starred. I actually had one that starred over 4,000 repositories:
for usr in list(set(my_starred_users)): 
    print(usr) 
    try: 
        get_starred_by_user(usr) 
    except: 
        print('failed for user', usr) 

The preceding code will result in output similar to the following:

Notice that I turned the list of starred users into a set before I called it. I noticed some duplication that resulted from starring multiple repositories under one user handle, so it makes sense to follow these steps to reduce extra calls:

  1. We now need to build a feature set that includes all of the starred repositories of everyone we have starred:
repo_vocab = [item for sl in list(starred_repos.values()) for item in sl] 
  1. We'll convert that into a set to remove duplicates that may be present from multiple users starring the same repositories:
repo_set = list(set(repo_vocab)) 
  1. Let's see how many that produces:
len(repo_vocab) 

The preceding code should result in output similar to the following:

I had starred 170 repositories, and together the users of those repositories starred over 27,000 unique repositories. You can imagine if we went one degree further out how many we might see.

Now that we have the full feature set, or repository vocabulary, we need to run every user to create a binary vector that contains a 1 for every repository they've starred and a 0 for every repository they haven't:

all_usr_vector = [] 
for k,v in starred_repos.items(): 
    usr_vector = [] 
    for url in repo_set: 
        if url in v: 
            usr_vector.extend([1]) 
        else: 
            usr_vector.extend([0]) 
    all_usr_vector.append(usr_vector) 

What we just did was check for every user whether they had starred every repository in our repository vocabulary. If they did, they received a 1, if they didn't, they received a 0.

At this point, we have a 27,098 item binary vector for each userall 170 of them. Let's now put this into a DataFrame. The row index will be the user handles we starred, and the columns will be the repository vocabulary:

df = pd.DataFrame(all_usr_vector, columns=repo_set, index=starred_repos.keys()) 
 
df 

The preceding code will generate output similar to the following:

Next, in order to compare ourselves to the other users, we need to add our own row to this frame. Here, I add my user handle, but you should add your own:

my_repo_comp = [] 
for i in df.columns: 
    if i in my_starred_repos: 
        my_repo_comp.append(1) 
    else: 
        my_repo_comp.append(0) 
 
mrc = pd.Series(my_repo_comp).to_frame('acombs').T 
 
mrc 

The preceding code will generate output similar to the following:

We now need add the appropriate column names and to concatenate this to our other DataFrame:

mrc.columns = df.columns 
 
fdf = pd.concat([df, mrc]) 
 
fdf 

The preceding code will result in output similar to the following:

You can see in the previous screenshot that I've been added into the DataFrame.

From here, we just need to calculate the similarity between ourselves and the other users we've starred. We'll do that now using the pearsonr function which we'll need to import from SciPy:

from scipy.stats import pearsonr 
 
sim_score = {} 
for i in range(len(fdf)): 
    ss = pearsonr(fdf.iloc[-1,:], fdf.iloc[i,:]) 
    sim_score.update({i: ss[0]}) 
 
sf = pd.Series(sim_score).to_frame('similarity') 
sf 

The preceding code will generate output similar to the following:

What we've just done is compare our vector, the last one in the DataFrame, to every other user's vector to generate a centered cosine similarity (Pearson correlation coefficient). Some values are by necessity NaN, as they've starred no repositories, and hence result in division by zero in the calculation:

  1. Let's now sort these values to return the index of the users who are most similar:
sf.sort_values('similarity', ascending=False) 

The preceding code will result in output similar to the following:

So there we have it, those are the most similar users, and hence the ones that we can use to recommend repositories we might enjoy. Let's take a look at these users and what they have starred that we might like.

  1. You can ignore that first user with a perfect similarity score; that's our own repository. Going down the list, the three nearest matches are user 6, user 42, and user 116. Let's look at each:
fdf.index[6] 

The preceding code will result in output similar to the following:

  1. Let's take a look at who this is and their repository. From https://github.com/cchi, I can see who the repository belongs to the following user:

This is actually Charles Chi, a former colleague of mine from Bloomberg, so this is no surprise. Let's see what he has starred:

  1. There are a couple of ways to do this; we can either use our code, or just click under their picture on stars. Let's do both for this one, just to compare and make sure everything matches up. First, let's do it via code:
fdf.iloc[6,:][fdf.iloc[6,:]==1] 

This results in the following output:

  1. We see 30 starred repositories. Let's compare those to the ones from GitHub's site:

  1. Here we can see they're identical, and you'll notice you can ID the repositories that we've both starred: they are the ones labelled Unstar.
  2. Unfortunately, with just 30 starred repositories, there aren't a lot of repositories to generate recommendations.
  3. The next user in terms of similarity is 42, Artem Golubin:
fdf.index[42] 

The preceding code results in the following output:

And his GitHub profile below:

Here we see the repositories he has starred:

  1. Artem has starred over 500 repositories, so there are definitely some recommendations to be found there.
  2. And finally, let's look at the third most similar user:
fdf.index[116] 

This results in the following output:

This user, Kevin Markham, has starred around 60 repositories:

We can see the starred repositories in the following image:

This is definitely fertile ground for generating recommendations. Let's now do just that; let's use the links from these three to produce some recommendations:

  1. We need to gather the links to the repositories they've starred and that I haven't. We'll create a DataFrame that has the repositories I've starred as well as the three most similar users to me:
all_recs = fdf.iloc[[6,42,116,159],:] 
all_recs.T 

The preceding code will produce the following output:

  1. Don't worry if it looks like it's all zeros; this is a sparse matrix so most will be 0. Let's see whether there are any repositories we've all starred:
all_recs[(all_recs==1).all(axis=1)] 

The preceding code will produce the following output:

  1. As you can see, we all seem to love scikit-learn and machine learning repositoriesno surprise there. Let's see what they might have all starred that I missed. We'll start by creating a frame that excludes me, and then we'll query it for commonly starred repositories:
str_recs_tmp = all_recs[all_recs[myun]==0].copy() 
str_recs = str_recs_tmp.iloc[:,:-1].copy() 
str_recs[(str_recs==1).all(axis=1)] 

The preceding code produces the following output:

  1. Okay, so it looks like I haven't been missing anything super obvious. Let's see if there any repositories that at least two out of three users starred. To find this, we'll just sum across the rows:
str_recs.sum(axis=1).to_frame('total').sort_values(by='total', ascending=False) 

The preceding code will result in output similar to the following:

This looks promising. There are lots of good ML and AI repositories, and I'm honestly ashamed I never starred fuzzywuzzy as I use that quite frequently.

At this point, I have to say I'm impressed with the results. These are definitely repositories that interest me, and I'll be checking them out.

So far, we've generated the recommendations using collaborative filtering and done some light additional filtering using aggregation. If we wanted to go further, we could order the recommendation based upon the total number of stars they received. This could be achieved by making another call to the GitHub API. There's an endpoint that provides this information.

Another thing we could do to improve the results is to add in a layer of content-based filtering. This is the hybridization step we discussed earlier. We would need to create a set of features from our own repository that was indicative of the types of things we would be interested in. One way to do this would be to create a feature set by tokenizing the names of the repositories we have starred along with their descriptions.

Here's a look at my starred repositories:

As you might imagine, this would generate a set of word features that we could use to vet the repositories of those we found using collaborative filtering. This would include a lot of words such as Python, Machine Learning, and Data Science. This would ensure that users who are less similar to ourselves are still providing recommendations that are based on our own interests. It would also reduce the serendipity of the recommendations, which is something to consider. Perhaps there's something unlike anything I have currently that I would love to see. It's certainly a possibility.

What would that content-based filtering step look like in terms of a DataFrame? The columns would be word features (n-grams) and the rows would be the repositories generated from our collaborative filtering step. We would just run the similarity process once again using our own repository for comparison.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.246.148