Commands

The commands used in this application are needed to load the data into the memory (cache) and make the user experience fast. Although the movie database is the same used in Chapter 4, Web mining techniques (that is 603 movies rated more than 50 times by 942 users), each movie needs a description to set up an information retrieval system on the movies to rate. The first command we develop takes all the movie titles in the utility matrix used in Chapter 4, Web Mining Techniques and collects the corresponding descriptions from Open Movie Database (OMDb) online service:

from django.core.management.base import BaseCommand
import os
import optparse
import numpy as np
import json
import pandas as pd
import requests
class Command(BaseCommand):

    option_list = BaseCommand.option_list + (
            optparse.make_option('-i', '--input', dest='umatrixfile',
                                 type='string', action='store',
                                 help=('Input utility matrix')),   
            optparse.make_option('-o', '--outputplots', dest='plotsfile',
                                 type='string', action='store',
                                 help=('output file')),  
            optparse.make_option('--om', '--outputumatrix', dest='umatrixoutfile',
                                 type='string', action='store',
                                 help=('output file')),            
        )
        
        
    def getplotfromomdb(self,col,df_moviesplots,df_movies,df_utilitymatrix):
        string = col.split(';')[0]
        
        title=string[:-6].strip()
        year = string[-5:-1]      
        plot = ' '.join(title.split(' ')).encode('ascii','ignore')+'. '
        
        url = "http://www.omdbapi.com/?t="+title+"&y="+year+"&plot=full&r=json"
        
        headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"}
        r = requests.get(url,headers=headers)
        jsondata =  json.loads(r.content)
        if 'Plot' in jsondata:
            #store plot + title
            plot += jsondata['Plot'].encode('ascii','ignore')

        if plot!=None and plot!='' and plot!=np.nan and len(plot)>3:#at least 3 letters to consider the movie
            df_moviesplots.loc[len(df_moviesplots)]=[string,plot]
            df_utilitymatrix[col] = df_movies[col]
            print len(df_utilitymatrix.columns)

        return df_moviesplots,df_utilitymatrix
    
    def handle(self, *args, **options):
        pathutilitymatrix = options['umatrixfile']
        df_movies = pd.read_csv(pathutilitymatrix)
        movieslist = list(df_movies.columns[1:])

        df_moviesplots = pd.DataFrame(columns=['title','plot'])
        df_utilitymatrix = pd.DataFrame()
        df_utilitymatrix['user'] = df_movies['user']

        for m in movieslist[:]:
            df_moviesplots,df_utilitymatrix=self.getplotfromomdb(m,df_moviesplots,df_movies,df_utilitymatrix)

        outputfile = options['plotsfile']
        df_moviesplots.to_csv(outputfile, index=False)
        outumatrixfile = options['umatrixoutfile']
        df_utilitymatrix.to_csv(outumatrixfile, index=False)

The command syntax is:

python manage.py --input=utilitymatrix.csv --outputplots=plots.csv –outputumatrix='umatrix.csv'

Each movie title contained in the utilitymatrix file is used by the getplotfromomdb function to retrieve the movie's description (plot) from the website http://www.omdbapi.com/ using the requests in the Python module. The descriptions (and titles) of the movies are then saved in a CSV file (outputplots) together with the corresponding utility matrix (outputumatrix).

The other command will take the movie's descriptions and create an information retrieval system (Term Frequency, Inverse Document Frequency (TF-IDF) model) to allow the user to find movies typing some relevant words. This tf-idf model is then saved in the Django cache together with the initial recommendation systems models (CF item-based and log-likelihood ratio). The code is as follows:

from django.core.management.base import BaseCommand
import os
import optparse
import numpy as np
import pandas as pd
import math
import json
import copy
from BeautifulSoup import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import WordPunctTokenizer
tknzr = WordPunctTokenizer()
#nltk.download('stopwords')
stoplist = stopwords.words('english')
from nltk.stem.porter import PorterStemmer
stemmer = PorterStemmer()
from sklearn.feature_extraction.text import TfidfVectorizer
from books_recsys_app.models import MovieData
from django.core.cache import cache

class Command(BaseCommand):

    option_list = BaseCommand.option_list + (
            optparse.make_option('-i', '--input', dest='input',
                                 type='string', action='store',
                                 help=('Input plots file')),
            optparse.make_option('--nmaxwords', '--nmaxwords', dest='nmaxwords',
                                 type='int', action='store',
                                 help=('nmaxwords')),
            optparse.make_option('--umatrixfile', '--umatrixfile', dest='umatrixfile',
                                 type='string', action='store',
                                 help=('umatrixfile')), 
        )
        
    def PreprocessTfidf(self,texts,stoplist=[],stem=False):
        newtexts = []
        for i in xrange(len(texts)):
            text = texts[i]
            if stem:
               tmp = [w for w in tknzr.tokenize(text) if w not in stoplist]
            else:
               tmp = [stemmer.stem(w) for w in [w for w in tknzr.tokenize(text) if w not in stoplist]]
            newtexts.append(' '.join(tmp))
        return newtexts
    
    def handle(self, *args, **options):
        input_file = options['input']
        
        df = pd.read_csv(input_file)
        tot_textplots = df['plot'].tolist()
        tot_titles = df['title'].tolist()
        nmaxwords=options['nmaxwords']
        vectorizer = TfidfVectorizer(min_df=0,max_features=nmaxwords)
        processed_plots = self.PreprocessTfidf(tot_textplots,stoplist,True)
        mod_tfidf = vectorizer.fit(processed_plots)
        vec_tfidf = mod_tfidf.transform(processed_plots)
        ndims = len(mod_tfidf.get_feature_names())
        nmovies = len(tot_titles[:])
        
        #delete all data
        MovieData.objects.all().delete()
        
        matr = np.empty([1,ndims])
        titles = []
        cnt=0
        for m in xrange(nmovies):
            moviedata = MovieData()
            moviedata.title=tot_titles[m]
            moviedata.description=tot_textplots[m]
            moviedata.ndim= ndims
            moviedata.array=json.dumps(vec_tfidf[m].toarray()[0].tolist())
            moviedata.save()
            newrow = moviedata.array
            if cnt==0:
                matr[0]=newrow
            else:
                matr = np.vstack([matr, newrow])
            titles.append(moviedata.title)
            cnt+=1
        #cached
        cache.set('data', matr)
        cache.set('titles', titles)
        cache.set('model',mod_tfidf)

        
        #load the utility matrix
        umatrixfile = options['umatrixfile']
        df_umatrix = pd.read_csv(umatrixfile)
        Umatrix = df_umatrix.values[:,1:]
        cache.set('umatrix',Umatrix)
        #load rec methods... 
        cf_itembased = CF_itembased(Umatrix)
        cache.set('cf_itembased',cf_itembased)
        llr = LogLikelihood(Umatrix,titles)
        cache.set('loglikelihood',llr)
        
from scipy.stats import pearsonr
from scipy.spatial.distance import cosine 
def sim(x,y,metric='cos'):
    if metric == 'cos':
       return 1.-cosine(x,y)
    else:#correlation
       return pearsonr(x,y)[0]
       
class CF_itembased(object):
...        
class LogLikelihood(object):
...

To run the command the syntax is:

python manage.py load_data --input=plots.csv --nmaxwords=30000  --umatrixfile=umatrix.csv

The input parameter takes the movie's descriptions obtained using the get_plotsfromtitles command and creates a tf-idf model (see Chapter 4, Web-mining techniques) using a maximum of words specified by the nmaxwords parameter. The data of each movie is also saved in a MovieData object (title, tf-idf representation, description, and ndim number of words of the tf-idf vocabulary). Note that the first time the command is run the stopwords from nltk.download('stopwords') (commented in the preceding code) need to be downloaded.

The tf-idf model, the title's list, and the matrix of the tf-idf movies' representations, are saved in the Django cache using the commands:

from django.core.cache import cache
...
cache.set('model',mod_tfidf)
cache.set('data', matr)
cache.set('titles', titles)

Note

Note that the cache Django module (django.core.cache) needs to be loaded (at the beginning of the file) to be used.

In the same way, the utility matrix (umatrixfile parameter) is used to initialize the two recommendation systems used by the application: item-based collaborative filtering and log-likelihood ratio method. Both methods are not written in the preceding code because they are essentially the same as the code described in Chapter 5, Recommendation systems (the full code can be seen in the chapter_7 folder of the author's GitHub repository as usual). The methods and the utility matrix are then loaded into the Django cache ready to use:

cache.set('umatrix',Umatrix)
   cache.set('cf_itembased',cf_itembased)
   cache.set('loglikelihood',llr)

Now the data (and models) can be used in the web pages just by calling the corresponding name, as we will see in the following sections.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.8.8