The Cython programming language acts as glue between Python and C/C++. With the Cython tools, we can compile plain Python code, which is closer to the machine level. The following command will install Cython:
$ pip install cython
The cytoolz package contains utilities created by Cythonizing the handy Python toolz package. Install cytoolz as follows:
$ pip install cytoolz $ pip freeze|grep cytoolz cytoolz==0.7.0
Just as in cooking shows, we will show the results of Cythonizing before going through the process involved (deferred to the next section). The timeit
Python module measures time. We will use this module to measure different functions. Define the following function, which accepts as arguments a short code snippet, a function call, and the number of times the code will run:
def time(code, n): times = min(timeit.Timer(code, setup=setup).repeat(3, n)) return round(1000* np.array(times)/n, 3)
We predefine a large setup string containing all the code. The code is in the timeits.py
file in this book's code bundle (the code uses cython_module
built on your machine):
import timeit import numpy as np setup = ''' import nltk import cython_module as cm import collections from nltk.corpus import stopwords from nltk.corpus import movie_reviews from nltk.corpus import names import string import pandas as pd import cytoolz sw = set(stopwords.words('english')) punctuation = set(string.punctuation) all_names = set([name.lower() for name in names.words()]) txt = movie_reviews.words(movie_reviews.fileids()[0]) def isStopWord(w): return w in sw or w in punctuation def isStopWord2(w): return w in sw or w in punctuation or not w.isalpha() def isStopWord3(w): return w in sw or len(w) == 1 or not w.isalpha() or w in all_names def isStopWord4(w): return w in sw or len(w) == 1 def freq_dict(words): dd = collections.defaultdict(int) for word in words: dd[word] += 1 return dd def zero_init(): features = {} for word in set(txt): features['count (%s)' % word] = (0) def zero_init2(): features = {} for word in set(txt): features[word] = (0) keys = list(set(txt)) def zero_init3(): features = dict.fromkeys(keys, 0) zero_dict = dict.fromkeys(keys, 0) def dict_copy(): features = zero_dict.copy() ''' def time(code, n): times = min(timeit.Timer(code, setup=setup).repeat(3, n)) return round(1000* np.array(times)/n, 3) if __name__ == '__main__': print "Best of 3 times per loop in milliseconds" n = 10 print "zero_init ", time("zero_init()", n) print "zero_init2", time("zero_init2()", n) print "zero_init3", time("zero_init3()", n) print "dict_copy ", time("dict_copy()", n) print n = 10**2 print "isStopWord ", time('[w.lower() for w in txt if not isStopWord(w.lower())]', n) print "isStopWord2", time('[w.lower() for w in txt if not isStopWord2(w.lower())]', n) print "isStopWord3", time('[w.lower() for w in txt if not isStopWord3(w.lower())]', n) print "isStopWord4", time('[w.lower() for w in txt if not isStopWord4(w.lower())]', n) print "Cythonized isStopWord", time('[w.lower() for w in txt if not cm.isStopWord(w.lower())]', n) print "Cythonized filter_sw()", time('cm.filter_sw(txt)', n) print print "FreqDist", time("nltk.FreqDist(txt)", n) print "Default dict", time('freq_dict(txt)', n) print "Counter", time('collections.Counter(txt)', n) print "Series", time('pd.Series(txt).value_counts()', n) print "Cytoolz", time('cytoolz.frequencies(txt)', n) print "Cythonized freq_dict", time('cm.freq_dict(txt)', n)
So, we have several isStopword()
function versions with the following running times in milliseconds:
isStopWord 0.843 isStopWord2 0.902 isStopWord3 0.963 isStopWord4 0.869 Cythonized isStopWord 0.924 Cythonized filter_sw() 0.887
For comparison, we also have the time the running time of a plain pass
statement. The Cythonized isStopWord()
is based on the isStopWord3()
function (the most elaborate filter). If we look at the doc_features()
function in prof_demo.py
, it becomes obvious that we shouldn't go over each word feature. Instead, we should just intersect the set of words in a document and the words chosen as features. All the other word counts can be safely set to zero. In fact, it's best if we initialize all the values to zero once and copy this dictionary. For the corresponding functions, we get the following execution times:
zero_init 0.61 zero_init2 0.555 zero_init3 0.017 dict_copy 0.011
Another improvement is to use the Python defaultdict
class instead of the NLTK FreqDist
class. The related routines have the following run times:
FreqDist 2.206 Default dict 0.674 Counter 0.79 Series 7.006 Cytoolz 0.542 Cythonized freq_dict 0.616
As we can see, the Cythonized versions are consistently faster, although not by much.
18.118.226.240