Installing Cython

The Cython programming language acts as glue between Python and C/C++. With the Cython tools, we can compile plain Python code, which is closer to the machine level. The following command will install Cython:

$ pip install cython

The cytoolz package contains utilities created by Cythonizing the handy Python toolz package. Install cytoolz as follows:

$ pip install cytoolz
$ pip freeze|grep cytoolz
cytoolz==0.7.0

Just as in cooking shows, we will show the results of Cythonizing before going through the process involved (deferred to the next section). The timeit Python module measures time. We will use this module to measure different functions. Define the following function, which accepts as arguments a short code snippet, a function call, and the number of times the code will run:

def time(code, n):
    times = min(timeit.Timer(code, setup=setup).repeat(3, n))

    return round(1000* np.array(times)/n, 3)

We predefine a large setup string containing all the code. The code is in the timeits.py file in this book's code bundle (the code uses cython_module built on your machine):

import timeit
import numpy as np

setup = '''
import nltk
import cython_module as cm
import collections
from nltk.corpus import stopwords
from nltk.corpus import movie_reviews
from nltk.corpus import names
import string
import pandas as pd
import cytoolz


sw = set(stopwords.words('english'))
punctuation = set(string.punctuation)
all_names = set([name.lower() for name in names.words()])
txt = movie_reviews.words(movie_reviews.fileids()[0])

def isStopWord(w):
    return w in sw or w in punctuation

def isStopWord2(w):
    return w in sw or w in punctuation or not w.isalpha()

def isStopWord3(w):
    return w in sw or len(w) == 1 or not w.isalpha() or w in all_names

def isStopWord4(w):
    return w in sw or len(w) == 1

def freq_dict(words):
    dd = collections.defaultdict(int)

    for word in words:
        dd[word] += 1

    return dd

def zero_init():
    features = {}

    for word in set(txt):
        features['count (%s)' % word] = (0)

def zero_init2():
    features = {}
    for word in set(txt):
        features[word] = (0)

keys = list(set(txt))

def zero_init3():
    features = dict.fromkeys(keys, 0)

zero_dict = dict.fromkeys(keys, 0)

def dict_copy():
    features = zero_dict.copy()
'''

def time(code, n):
    times = min(timeit.Timer(code, setup=setup).repeat(3, n))

    return round(1000* np.array(times)/n, 3)

if __name__ == '__main__':
    print "Best of 3 times per loop in milliseconds"
    n = 10
    print "zero_init ", time("zero_init()", n)
    print "zero_init2", time("zero_init2()", n)
    print "zero_init3", time("zero_init3()", n)
    print "dict_copy ", time("dict_copy()", n)
    print

    n = 10**2
    print "isStopWord ", time('[w.lower() for w in txt if not isStopWord(w.lower())]', n)
    print "isStopWord2", time('[w.lower() for w in txt if not isStopWord2(w.lower())]', n)
    print "isStopWord3", time('[w.lower() for w in txt if not isStopWord3(w.lower())]', n)
    print "isStopWord4", time('[w.lower() for w in txt if not isStopWord4(w.lower())]', n)
    print "Cythonized isStopWord", time('[w.lower() for w in txt if not cm.isStopWord(w.lower())]', n)
    print "Cythonized filter_sw()", time('cm.filter_sw(txt)', n)
    print
    print "FreqDist", time("nltk.FreqDist(txt)", n)
    print "Default dict", time('freq_dict(txt)', n)
    print "Counter", time('collections.Counter(txt)', n)
    print "Series", time('pd.Series(txt).value_counts()', n)
    print "Cytoolz", time('cytoolz.frequencies(txt)', n)
    print "Cythonized freq_dict", time('cm.freq_dict(txt)', n)

So, we have several isStopword() function versions with the following running times in milliseconds:

isStopWord  0.843
isStopWord2 0.902
isStopWord3 0.963
isStopWord4 0.869
Cythonized isStopWord 0.924
Cythonized filter_sw() 0.887

For comparison, we also have the time the running time of a plain pass statement. The Cythonized isStopWord() is based on the isStopWord3() function (the most elaborate filter). If we look at the doc_features() function in prof_demo.py, it becomes obvious that we shouldn't go over each word feature. Instead, we should just intersect the set of words in a document and the words chosen as features. All the other word counts can be safely set to zero. In fact, it's best if we initialize all the values to zero once and copy this dictionary. For the corresponding functions, we get the following execution times:

zero_init  0.61
zero_init2 0.555
zero_init3 0.017
dict_copy  0.011

Another improvement is to use the Python defaultdict class instead of the NLTK FreqDist class. The related routines have the following run times:

FreqDist 2.206
Default dict 0.674
Counter 0.79
Series 7.006
Cytoolz 0.542
Cythonized freq_dict 0.616

As we can see, the Cythonized versions are consistently faster, although not by much.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.226.240