The commands used in this application are needed to load the data into the memory (cache) and make the user experience fast. Although the movie database is the same used in Chapter 4, Web mining techniques (that is 603 movies rated more than 50 times by 942 users), each movie needs a description to set up an information retrieval system on the movies to rate. The first command we develop takes all the movie titles in the utility matrix used in Chapter 4, Web Mining Techniques and collects the corresponding descriptions from Open Movie Database (OMDb) online service:
from django.core.management.base import BaseCommand import os import optparse import numpy as np import json import pandas as pd import requests class Command(BaseCommand): option_list = BaseCommand.option_list + ( optparse.make_option('-i', '--input', dest='umatrixfile', type='string', action='store', help=('Input utility matrix')), optparse.make_option('-o', '--outputplots', dest='plotsfile', type='string', action='store', help=('output file')), optparse.make_option('--om', '--outputumatrix', dest='umatrixoutfile', type='string', action='store', help=('output file')), ) def getplotfromomdb(self,col,df_moviesplots,df_movies,df_utilitymatrix): string = col.split(';')[0] title=string[:-6].strip() year = string[-5:-1] plot = ' '.join(title.split(' ')).encode('ascii','ignore')+'. ' url = "http://www.omdbapi.com/?t="+title+"&y="+year+"&plot=full&r=json" headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2049.0 Safari/537.36"} r = requests.get(url,headers=headers) jsondata = json.loads(r.content) if 'Plot' in jsondata: #store plot + title plot += jsondata['Plot'].encode('ascii','ignore') if plot!=None and plot!='' and plot!=np.nan and len(plot)>3:#at least 3 letters to consider the movie df_moviesplots.loc[len(df_moviesplots)]=[string,plot] df_utilitymatrix[col] = df_movies[col] print len(df_utilitymatrix.columns) return df_moviesplots,df_utilitymatrix def handle(self, *args, **options): pathutilitymatrix = options['umatrixfile'] df_movies = pd.read_csv(pathutilitymatrix) movieslist = list(df_movies.columns[1:]) df_moviesplots = pd.DataFrame(columns=['title','plot']) df_utilitymatrix = pd.DataFrame() df_utilitymatrix['user'] = df_movies['user'] for m in movieslist[:]: df_moviesplots,df_utilitymatrix=self.getplotfromomdb(m,df_moviesplots,df_movies,df_utilitymatrix) outputfile = options['plotsfile'] df_moviesplots.to_csv(outputfile, index=False) outumatrixfile = options['umatrixoutfile'] df_utilitymatrix.to_csv(outumatrixfile, index=False)
python manage.py --input=utilitymatrix.csv --outputplots=plots.csv –outputumatrix='umatrix.csv'
Each movie title contained in the utilitymatrix
file is used by the getplotfromomdb
function to retrieve the movie's description (plot
) from the website http://www.omdbapi.com/ using the requests in the Python module. The descriptions (and titles
) of the movies are then saved in a CSV file (outputplots
) together with the corresponding utility matrix (outputumatrix
).
The other command will take the movie's descriptions and create an information retrieval system (Term Frequency, Inverse Document Frequency (TF-IDF) model) to allow the user to find movies typing some relevant words. This tf-idf model is then saved in the Django cache together with the initial recommendation systems models (CF item-based and log-likelihood ratio). The code is as follows:
from django.core.management.base import BaseCommand import os import optparse import numpy as np import pandas as pd import math import json import copy from BeautifulSoup import BeautifulSoup import nltk from nltk.corpus import stopwords from nltk.tokenize import WordPunctTokenizer tknzr = WordPunctTokenizer() #nltk.download('stopwords') stoplist = stopwords.words('english') from nltk.stem.porter import PorterStemmer stemmer = PorterStemmer() from sklearn.feature_extraction.text import TfidfVectorizer from books_recsys_app.models import MovieData from django.core.cache import cache class Command(BaseCommand): option_list = BaseCommand.option_list + ( optparse.make_option('-i', '--input', dest='input', type='string', action='store', help=('Input plots file')), optparse.make_option('--nmaxwords', '--nmaxwords', dest='nmaxwords', type='int', action='store', help=('nmaxwords')), optparse.make_option('--umatrixfile', '--umatrixfile', dest='umatrixfile', type='string', action='store', help=('umatrixfile')), ) def PreprocessTfidf(self,texts,stoplist=[],stem=False): newtexts = [] for i in xrange(len(texts)): text = texts[i] if stem: tmp = [w for w in tknzr.tokenize(text) if w not in stoplist] else: tmp = [stemmer.stem(w) for w in [w for w in tknzr.tokenize(text) if w not in stoplist]] newtexts.append(' '.join(tmp)) return newtexts def handle(self, *args, **options): input_file = options['input'] df = pd.read_csv(input_file) tot_textplots = df['plot'].tolist() tot_titles = df['title'].tolist() nmaxwords=options['nmaxwords'] vectorizer = TfidfVectorizer(min_df=0,max_features=nmaxwords) processed_plots = self.PreprocessTfidf(tot_textplots,stoplist,True) mod_tfidf = vectorizer.fit(processed_plots) vec_tfidf = mod_tfidf.transform(processed_plots) ndims = len(mod_tfidf.get_feature_names()) nmovies = len(tot_titles[:]) #delete all data MovieData.objects.all().delete() matr = np.empty([1,ndims]) titles = [] cnt=0 for m in xrange(nmovies): moviedata = MovieData() moviedata.title=tot_titles[m] moviedata.description=tot_textplots[m] moviedata.ndim= ndims moviedata.array=json.dumps(vec_tfidf[m].toarray()[0].tolist()) moviedata.save() newrow = moviedata.array if cnt==0: matr[0]=newrow else: matr = np.vstack([matr, newrow]) titles.append(moviedata.title) cnt+=1 #cached cache.set('data', matr) cache.set('titles', titles) cache.set('model',mod_tfidf) #load the utility matrix umatrixfile = options['umatrixfile'] df_umatrix = pd.read_csv(umatrixfile) Umatrix = df_umatrix.values[:,1:] cache.set('umatrix',Umatrix) #load rec methods... cf_itembased = CF_itembased(Umatrix) cache.set('cf_itembased',cf_itembased) llr = LogLikelihood(Umatrix,titles) cache.set('loglikelihood',llr) from scipy.stats import pearsonr from scipy.spatial.distance import cosine def sim(x,y,metric='cos'): if metric == 'cos': return 1.-cosine(x,y) else:#correlation return pearsonr(x,y)[0] class CF_itembased(object): ... class LogLikelihood(object): ...
To run the command the syntax is:
python manage.py load_data --input=plots.csv --nmaxwords=30000 --umatrixfile=umatrix.csv
The input parameter takes the movie's descriptions obtained using the get_plotsfromtitles
command and creates a tf-idf
model (see Chapter 4, Web-mining techniques) using a maximum of words specified by the nmaxwords
parameter. The data of each movie is also saved in a MovieData
object (title, tf-idf representation, description, and ndim
number of words of the tf-idf vocabulary). Note that the first time the command is run the stopwords
from nltk.download('stopwords')
(commented in the preceding code) need to be downloaded.
The tf-idf model, the title's list, and the matrix of the tf-idf movies' representations, are saved in the Django cache using the commands:
from django.core.cache import cache ... cache.set('model',mod_tfidf) cache.set('data', matr) cache.set('titles', titles)
In the same way, the utility matrix (umatrixfile
parameter) is used to initialize the two recommendation systems used by the application: item-based collaborative filtering and log-likelihood ratio method. Both methods are not written in the preceding code because they are essentially the same as the code described in Chapter 5, Recommendation systems (the full code can be seen in the chapter_7
folder of the author's GitHub repository as usual). The methods and the utility matrix are then loaded into the Django cache ready to use:
cache.set('umatrix',Umatrix) cache.set('cf_itembased',cf_itembased) cache.set('loglikelihood',llr)
Now the data (and models) can be used in the web pages just by calling the corresponding name, as we will see in the following sections.
3.145.8.8