We can call C functions from Cython. The C string strlen()
function is the equivalent of the Python len()
function. Call this function from a Cython .pyx
file by importing it as follows:
from libc.string cimport strlen
We can then call strlen()
from somewhere else in the .pyx
file. The .pyx
file can contain any Python code. Have a look at the cython_module.pyx
file in this book's code bundle:
from collections import defaultdict from nltk.corpus import stopwords from nltk.corpus import names from libc.string cimport strlen sw = set(stopwords.words('english')) all_names = set([name.lower() for name in names.words()]) def isStopWord(w): return w in sw or strlen(w) == 1 or not w.isalpha() or w in all_names def filter_sw(words): return [w.lower() for w in words if not isStopWord(w.lower())] def freq_dict(words): dd = defaultdict(int) for word in words: dd[word] += 1 return dd
To compile this code we need a setup.py
file with the following contents:
from distutils.core import setup from Cython.Build import cythonize setup( ext_modules = cythonize("cython_module.pyx") )
Compile the code with the following command:
$ python setup.py build_ext –inplace
We can now modify the sentiment analysis program to call the Cython functions. We will also add the improvements mentioned in the previous section. As we are going to use some of the functions over and over again, these functions were extracted into the core.py
file in this book's code bundle. Check out the cython_demo.py
file in this book's code bundle (the code uses cython_module
built on your machine):
… NLTK imports omitted … import cython_module as cm import cytoolz from core import label_docs from core import filter_corpus from core import split_data def select_word_features(corpus): words = cytoolz.frequencies(filtered) sorted_words = sorted(words, key=words.get) N = int(.02 * len(sorted_words)) return sorted_words[-N:] def match(a, b): return set(a.keys()).intersection(b) def doc_features(doc): doc_words = cytoolz.frequencies(cm.filter_sw(doc)) # initialize to 0 features = zero_features.copy() word_matches = match(doc_words, word_features) for word in word_matches: features[word] = (doc_words[word]) return features def make_features(docs): return [(doc_features(d), c) for (d,c) in docs] if __name__ == "__main__": labeled_docs = label_docs() filtered = filter_corpus() word_features = select_word_features(filtered) zero_features = dict.fromkeys(word_features, 0) featuresets = make_features(labeled_docs) train_set, test_set = split_data(featuresets) classifier = NaiveBayesClassifier.train(train_set) print "Accuracy", accuracy(classifier, test_set) print classifier.show_most_informative_features()
The following table summarizes the results of the time
command (lowest values were placed between brackets):
Types of time |
Run 1 |
Run 2 |
Run 3 |
---|---|---|---|
real |
(9.974) |
9.995 |
10.024 |
user |
(9.618) |
9.682 |
9.713 |
sys |
0.404 |
0.365 |
(0.36) |
3.21.244.217