Calling C code

We can call C functions from Cython. The C string strlen() function is the equivalent of the Python len() function. Call this function from a Cython .pyx file by importing it as follows:

from libc.string cimport strlen

We can then call strlen() from somewhere else in the .pyx file. The .pyx file can contain any Python code. Have a look at the cython_module.pyx file in this book's code bundle:

from collections import defaultdict
from nltk.corpus import stopwords
from nltk.corpus import names
from libc.string cimport strlen

sw = set(stopwords.words('english'))
all_names = set([name.lower() for name in names.words()])

def isStopWord(w):
    return w in sw or strlen(w) == 1 or not w.isalpha() or w in all_names

def filter_sw(words):
    return [w.lower() for w in words if not isStopWord(w.lower())]

def freq_dict(words):
    dd = defaultdict(int)

    for word in words:
        dd[word] += 1

    return dd

To compile this code we need a setup.py file with the following contents:

from distutils.core import setup
from Cython.Build import cythonize

setup(
    ext_modules = cythonize("cython_module.pyx")
)

Compile the code with the following command:

$ python setup.py build_ext –inplace

We can now modify the sentiment analysis program to call the Cython functions. We will also add the improvements mentioned in the previous section. As we are going to use some of the functions over and over again, these functions were extracted into the core.py file in this book's code bundle. Check out the cython_demo.py file in this book's code bundle (the code uses cython_module built on your machine):

… NLTK imports omitted …
import cython_module as cm
import cytoolz
from core import label_docs
from core import filter_corpus
from core import split_data


def select_word_features(corpus):
    words = cytoolz.frequencies(filtered)
    sorted_words = sorted(words, key=words.get)
    N = int(.02 * len(sorted_words))

    return sorted_words[-N:]

def match(a, b):
    return set(a.keys()).intersection(b)

def doc_features(doc):
    doc_words = cytoolz.frequencies(cm.filter_sw(doc))

    # initialize to 0
    features = zero_features.copy()

    word_matches = match(doc_words, word_features)

    for word in word_matches:
        features[word] = (doc_words[word])

    return features

def make_features(docs):
    return [(doc_features(d), c) for (d,c) in docs]

if __name__ == "__main__":
    labeled_docs = label_docs()
    filtered = filter_corpus()
    word_features = select_word_features(filtered)
    zero_features = dict.fromkeys(word_features, 0)
    featuresets = make_features(labeled_docs)
    train_set, test_set = split_data(featuresets)
    classifier = NaiveBayesClassifier.train(train_set)
    print "Accuracy", accuracy(classifier, test_set)
    print classifier.show_most_informative_features()

The following table summarizes the results of the time command (lowest values were placed between brackets):

Types of time

Run 1

Run 2

Run 3

real

(9.974)

9.995

10.024

user

(9.618)

9.682

9.713

sys

0.404

0.365

(0.36)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.226.240