Touring powerful NLP libraries in Python

After a short list of real-world applications of NLP, we will be touring the essential stack of Python NLP libraries in this chapter. These packages handle a wide range of NLP tasks as mentioned above as well as others such as sentiment analysis, text classification, named entity recognition, and many more.

The most famous NLP libraries in Python include Natural Language Toolkit (NLTK), Gensim and TextBlob. The scikit-learn library also has NLP related features. NLTK (http://www.nltk.org/) was originally developed for education purposes and is now being widely used in industries as well. There is a saying that you can't talk about NLP without mentioning NLTK. It is the most famous and leading platform for building Python-based NLP applications. We can install it simply by running the sudo pip install -U nltk command in Terminal.

NLTK comes with over 50 collections of large and well-structured text datasets, which are called corpora in NLP. Corpora can be used as dictionaries for word occurrences checking and as training pools for model learning and validating. Some useful and interesting corpora include Web text corpus, Twitter samples, Shakespeare corpus sample, Sentiment Polarity, Names corpus (it contains lists of popular names, which we will be exploring very shortly), Wordnet, and the Reuters benchmark corpus. The full list can be found at http://www.nltk.org/nltk_data. Before using any of these corpus resources, we first need to download it by running the following scripts in Python interpreter:

>>> import nltk
>>> nltk.download()

A new window will pop up and ask us which package or specific corpus to download:

Installing the whole package, which is popular, is strongly recommended since it contains all important corpora needed for our current study and future research. Once the package is installed, we can now take a look at its Names corpus:
First, import the corpus:

>>> from nltk.corpus import names

The first ten names in the list can be displayed with the following:

>>> print names.words()[:10]
[u'Abagael', u'Abagail', u'Abbe', u'Abbey', u'Abbi', u'Abbie',
u'Abby', u'Abigael', u'Abigail', u'Abigale']

There are in total 7,944 names:

>>> print len(names.words())
7944

Other corpora are also fun to explore.

Besides the easy-to-use and abundant corpora pool, more importantly, NLTK is responsible for conquering many NLP and text analysis tasks, including the following:

  • Tokenization: Given a text sequence, tokenization is the task of breaking it into fragments separated with whitespaces. Meanwhile, certain characters are usually removed, such as punctuations, digits, emoticons. These fragments are the so-called tokens used for further processing. Moreover, tokens composed of one word are also called unigrams in computational linguistics; bigrams are composed of two consecutive words, trigrams of three consecutive words, and n-grams of n consecutive words. Here is an example of tokenization:
  • POS tagging: We can apply an off-the-shelf tagger or combine multiple NLTK taggers to customize the tagging process. It is easy to directly use the built-in tagging function pos_tag, as in pos_tag(input_tokens) for instance. But behind the scene, it is actually a prediction from a prebuilt supervised learning model. The model is trained based on a large corpus composed of words that are correctly tagged.
  • Named entities recognition: Given a text sequence, the task of named entities recognition is to locate and identify words or phrases that are of definitive categories, such as names of persons, companies, and locations. We will briefly mention it again in the next chapter.
  • Stemming and lemmatization: Stemming is a process of reverting an inflected or derived word to its root form. For instance, machine is the stem of machines, learning and learned are generated from learn. Lemmatization is a cautious version of stemming. It considers the POS of the word when conducting stemming. We will discuss these two text preprocessing techniques in further detail shortly. For now, let’s quickly take a look at how they are implemented respectively in NLTK:

First, import one of the three built-in stemmer algorithms (LancasterStemmer and SnowballStemmer are the rest two), and initialize a stemmer:

>>> from nltk.stem.porter import PorterStemmer
>>> porter_stemmer = PorterStemmer()

Stem machines, learning:

>>> porter_stemmer.stem('machines')
u'machin'
>>> porter_stemmer.stem('learning')
u'learn'

Note that stemming sometimes involves chopping off letters, if necessary, as we can see in machin.

Now import a lemmatization algorithm based on Wordnet corpus built-in, and initialize an lemmatizer:

>>> from nltk.stem import WordNetLemmatizer
>>> lemmatizer = WordNetLemmatizer()

Similarly, lemmatize machines, learning:

>>> lemmatizer.lemmatize('machines')
u'machine'
>>> lemmatizer.lemmatize('learning')
'learning'

Why learning is unchanged? It turns out that this algorithm only lemmatizes on nouns by default.

Gensim (https://radimrehurek.com/gensim/), developed by Radim Rehurek, has gained popularity in recent years. It was initially designed in 2008 to generate a list of similar articles, given an article, hence the name of this library (generate similar to Gensim). It was later drastically improved by Radim Rehurek in terms of its efficiency and scalability. Again, we can easily install it via pip by running the command pip install --upgrade genism in terminal. Just make sure the dependencies NumPy and SciPy are already installed.

Gensim is famous for its powerful semantic and topic modeling algorithms. Topic modeling is a typical text-mining task of discovering the hidden semantic structures in a document. Semantic structure in plain English is the distribution of word occurrences. It is obviously an unsupervised learning task. What we need to do is feed in plain text and let the model figure out the abstract topics.

In addition to the robust semantic modelling methods, Gensim also provides the following functionalities:

  • Similarity querying, which retrieves objects that are similar to the given query object
  • Word vectorization, which is an innovative way to represent words while preserving word co-occurrence features
  • Distributed computing, which makes it feasible to efficiently learn from millions of documents

TextBlob (https://textblob.readthedocs.io/en/dev/) is a relatively new library built on top of NLTK. It simplifies NLP and text analysis with easy-to-use built-in functions and methods and also wrappers around common tasks. We can install TextBlob by running the pip install -U textblob command in terminal.

Additionally, TextBlob has some useful features, which are not available in NLTK currently, such as spell checking and correction, language detection and translation.

Last but not least, as mentioned in the first chapter, scikit-learn is the main package we use throughout the entire book. Luckily, it provides all text processing features we need, such as tokenization, besides the comprehensive machine learning functionalities. Plus, it comes with a built-in loader for the 20 newsgroups dataset.

Now that the tools are available and properly installed, what about the data?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.113.229