Chapter 1. Getting Started

Introduction

Let’s begin by talking about the structure of this book. In the first part, we’ll go over the technologies and techniques we’ll be using with Spark NLP throughout this book. After that we’ll talk about the building blocks of Natural Language Processing (NLP). Finally, we’ll talk about NLP applications and sytems.

In this book I’ll cover how to use Spark NLP, as well fundamental natural language processing topics. Hopefully, at the end of this book you’ll have a new software tool for working with natural language, Spark NLP, and a suite of techniques and some understanding of why these techniques work.

When working on an application that requires NLP, there are three perspectives you should keep in mind: the software developer’s perspective, the linguist’s perspective, and the data scientist’s perspective. The software developer’s perspect helps you focus on what your application needs to do; this grounds the work in terms of the product you want to create. The linguist’s perspective helps you understand what it is in the data that you want to extract. The data scientist’s perspect helps see how you can extract the information you need from your data.

Here is a more detailed overview of the book

  • PART I: INTRODUCTION
    • Getting Started covers setting up your environment so you can follow along with the examples and exercises in the book
    • Natural Language Basics is a survey of some of the linguistic concepts that help in understanding why NLP techniques work, and how to use NLP techniques get the information you need from language.
    • NLP on Apache Spark is an introduction to Apache Spark and, most germane, the Spark NLP library.
    • Deep Learning Basics is a survey of some of the deep learning concepts that we’ll be using in this book. This book is not a tutorial on deep learning, but I’ll try and explain these techniques when necessary.
  • PART II: BUILDING BLOCKS
    • Processing Words covers the classic text processing techniques. Since NLP applications generally require a pipeline of transformations, understanding the early steps well is a necessity.
    • Information Retrieval covers the basic concepts of search engines. Not only is this a classic example of an application that uses text, but many NLP techniques used in other kinds of applications ultimately come from information retrieval.
    • Classification and Regression covers some well established techniques of using text features for classification and regression tasks.
    • Sequence Modeling introduces techniques used in modeling natural language text data as sequences. Since natural language is a sequence, these techniques are fundamental.
    • Information Extraction shows how we can extract facts and relationships from text.
    • Topic Modeling demonstrates techniques for finding topics in documents. Topic modeling is a great way to explore text.
    • Embeddings discusses one of the most popular modern techniques for creating features from text.
  • PART III: APPLICATIONS
    • Sentiment Analysis & Emotion Detection covers some basic applications that require identifying the sentiment of the author of a text, e.g. was a movie review positive or negative.
    • Building Knowledge Graphs explores creating an ontology, a collection of facts and relationships organized in a graph-like manner, from a corpus.
    • Semantic Search goes deeper into what can be done to improve a search engine. Improving is not just about improving the ranker, it’s also about facilitating the user with features like facets.
    • Conversational Chatbots demonstrates how to create a chatbot - this is a fun and interesting application. These kinds of applications are becoming more and more popular.
    • De-identification covers some basic techniques for removing personal information from text. With more governments and institutions looking at laws and regulations to protect people’s privacy, this something that is likely to be relevant to any application working with personally identifying information (PII).
    • Object Character Recognition (OCR) introduces converting text stored as images to text data. Not all texts are stored as text data. Handwriting and old texts are examples of texts we may receive as images. Sometimes, we also have to deal with non-handwritten text stored in images like PDF images and scans of printed documents. 
    • Briefs on Other NLP Tasks is a potpourri of topics that are too complex to introduce in a chapter, or are too arcane to cover in much depth in this book.
  • PART IV: BUILDING NLP SYSTEMS
    • Supporting Multiple Languages explores topics that an application creator should consider when preparing to work with multiple languages.
    • Human Labeling: Collecting Quality Training Data covers ways to use Humans to gather data about texts. Being able to efficiently use Humans to augment data can make an otherwise impossible project feasible.
    • Training and Publishing NLP Models covers creating models, Spark NLP pipelines and TensorFlow graphs, and publishing them for use in production.
    • Scaling and Performance Optimization discusses some of the performance concerns developers should keep in mind when designing a system that uses text.
    • Operating NLP Systems in Production covers quality and monitoring concerns that are unique to NLP applications.

Other Tools

In addition to Spark NLP, Apache Spark, and TensorFlow, we’ll make use of a number of other tools

  • Anaconda is an open source distribution of Python (and R which we are not using). It is maintained by Anaconda, Inc. who also offer an enterprise platform, and training courses. We’ll use the Anaconda package manager conda to create our environment. Learn more at https://www.anaconda.com/
  • Jupyter Notebook is a tool for executing code in the browser. Jupyter Notebook also allows you to write markdown, and display visualizations all in the browser. In fact, this book was written as Jupyter Notebooks before being converted to a publishable format. Jupyter Notebook is maintained by Project Jupyter which is a non-profit dedicated to supporting interactive data science tools. Learn more at https://jupyter.org/
  • Docker is a tool for easily creating virtual machines, often referred to as containers. We’ll use Docker as an alternative installation tool to setting up conda. It is maintained by Docker, Inc. Learn more at https://www.docker.com/

Setting up your environment

In this book, almost every chapter has exercises, so it is useful to make sure that the environment is working at the beginning. We’ll use Jupyter notebooks in this book, and the kernel we’ll use is the baseline Python 3.6 kernel. The instructions here use Continuum’s Anaconda to set up a Python virtual environment.

You can also use the `johnsnowlabs/spark-nlp-workshop` docker image for the necessary environment.

These instructions were created from the set up process for Ubuntu. There are additional set up instructions online at the project’s github page.

Prerequisites

  1. Anaconda
  2. Apache Spark
    • To set up Apache Spark, follow the instructions at - https://spark.apache.org/docs/latest/
    • Make sure that SPARK_HOME is set to the location of your Apache Spark installation
    • This was written on Apache Spark 2.4

Optional: Set up a password for your Jupyter notebook server - https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#automatic-password-setup

Starting Apache Spark

$ echo $SPARK_HOME
/path/to/your/spark/installation
$ spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.prope
rties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setL
ogLevel(newLevel).
...
Spark context Web UI available at localhost:4040
Spark context available as 'sc' (master = local[*], app id = ...).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _ / _ / _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_   version 2.3.2
      /_/

Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0
_102)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Checking out the code

  1. Go to github repo for this project - https://github.com/alexander-n-thomas/spark-nlp-book
  2. Check out the code, run the following code examples in a terminal
    1. Clone the repo
      git clone https://github.com/alexander-n-thomas/spark-nlp-book.git
      
    2. Create the conda environment - this will take a while
      conda env create -f environment.yml
      
    3. Activate the new environment
      source activate spark-nlp-book
      
    4. Create the kernel for this environment
      ipython kernel install --user --name=sparknlpbook
      
    5. Launch the notebook server
      jupyter notebook
      
    6. Go to your notebook page at localhost:8888

Getting Familiar with Apache Spark

Now that we’re all set up, let’s start using Spark NLP! We will be using the 20 Newsgroups data set from the University of California Irvine Machine Learning Repository. The data can be found at https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups. For this first example we use the mini_newsgroups data set found at https://archive.ics.uci.edu/ml/machine-learning-databases/20newsgroups-mld/mini_newsgroups.tar.gz. Download this tar file and extract it into the data folder for this project.

! ls ./data/mini_newsgroup
alt.atheism		  rec.autos	      sci.space
comp.graphics		  rec.motorcycles     soc.religion.christian
comp.os.ms-windows.misc   rec.sport.baseball  talk.politics.guns
comp.sys.ibm.pc.hardware  rec.sport.hockey    talk.politics.mideast
comp.sys.mac.hardware	  sci.crypt	      talk.politics.misc
comp.windows.x		  sci.electronics     talk.religion.misc
misc.forsale		  sci.me

Starting Apache Spark with Spark NLP

There are many ways we can use Apache Spark from Jupyter notebooks. We could use a specialized kernel, but I generally prefer using a simple kernel. Fortunately, Spark NLP gives us an easy way to start up.

import sparknlp
import pyspark
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import functions as fun
from pyspark.sql.types import *
%matplotlib inline
import matplotlib.pyplot as plt
packages = ','.join([
    "JohnSnowLabs:spark-nlp:1.6.3",
])

spark_conf = SparkConf()
spark_conf = spark_conf.setAppName('spark-nlp-book-p1c1')
spark_conf = spark_conf.setAppName('master[*]')
spark_conf = spark_conf.set("spark.jars.packages", packages)
%matplotlib inline
import matplotlib.pyplot as plt

Loading & viewing data in Apache Spark

Let’s look at how we can load data data with Apache Spark, and then some ways we can view the data.

import os

mini_newsgroups_path = os.path.join('data', 'mini_newsgroups')
schema = StructType([
    StructField('filename', StringType()),
    StructField('text', StringType()),
])
texts_df = spark.createDataFrame(texts, schema)
texts_df.show()
+--------------------+--------------------+
|            filename|                text|
+--------------------+--------------------+
|file:/home/alext/...|Path: cantaloupe....|
|file:/home/alext/...|Newsgroups: sci.e...|
|file:/home/alext/...|Newsgroups: sci.e...|
|file:/home/alext/...|Newsgroups: sci.e...|
|file:/home/alext/...|Xref: cantaloupe....|
|file:/home/alext/...|Path: cantaloupe....|
|file:/home/alext/...|Xref: cantaloupe....|
|file:/home/alext/...|Newsgroups: sci.e...|
|file:/home/alext/...|Newsgroups: sci.e...|
|file:/home/alext/...|Xref: cantaloupe....|
|file:/home/alext/...|Path: cantaloupe....|
|file:/home/alext/...|Newsgroups: sci.e...|
|file:/home/alext/...|Path: cantaloupe....|
|file:/home/alext/...|Path: cantaloupe....|
|file:/home/alext/...|Path: cantaloupe....|
|file:/home/alext/...|Xref: cantaloupe....|
|file:/home/alext/...|Path: cantaloupe....|
|file:/home/alext/...|Newsgroups: sci.e...|
|file:/home/alext/...|Newsgroups: sci.e...|
|file:/home/alext/...|Newsgroups: sci.e...|
+--------------------+--------------------+
only showing top 20 rows

Looking at the data is important in any data science project. When working with structured data, especially numerical data, it is common to explore data with aggregates. This is necessary since data sets are large, and looking at a small number of examples can easily misrepresent the data. Natural language data complicates this. On one hand, Humans are really good at interpreting language, on the other, Humans are also really good at jumping to conclusions and making hasty generalizations. So we still have the problem of creating a representitive summary for large data sets. We’ll talk about some techniques to do this in the chapters discussing topic modeling and embeddings.

For now, let’s talk about ways we can look at a small amount of data in DataFrames. As you can see in the above code example, we can show the output of a DataFrame using .show(). Let’s look at the arguments

  1. n Number of rows to show.
  2. truncate If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right.
  3. vertical If set to True, print output rows vertically (one line per column value).

Let’s try using some of these arguments 

texts_df.show(n=5, truncate=100, vertical=True)
-RECORD 0--------------------------------------------------------------------------------------------------------
 filename | file:/home/alext/projects/spark-nlp-book/data/mini_newsgroups/sci.electronics/54165                  
 text     | Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zap... 
-RECORD 1--------------------------------------------------------------------------------------------------------
 filename | file:/home/alext/projects/spark-nlp-book/data/mini_newsgroups/sci.electronics/54057                  
 text     | Newsgroups: sci.electronics
Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cm... 
-RECORD 2--------------------------------------------------------------------------------------------------------
 filename | file:/home/alext/projects/spark-nlp-book/data/mini_newsgroups/sci.electronics/53712                  
 text     | Newsgroups: sci.electronics
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!how... 
-RECORD 3--------------------------------------------------------------------------------------------------------
 filename | file:/home/alext/projects/spark-nlp-book/data/mini_newsgroups/sci.electronics/53529                  
 text     | Newsgroups: sci.electronics
Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.c... 
-RECORD 4--------------------------------------------------------------------------------------------------------
 filename | file:/home/alext/projects/spark-nlp-book/data/mini_newsgroups/sci.electronics/54042                  
 text     | Xref: cantaloupe.srv.cs.cmu.edu comp.os.msdos.programmer:23292 alt.msdos.programmer:6797 sci.elec... 
only showing top 5 rows 

The .show() method is good for a quick view of data, but if the data is complicated, doesn’t work as well. In the Jupyter environment, there are some special integrations with Pandas, and Pandas DataFrames are displayed a little nicer. Here’s an example.

texts_df.limit(5).toPandas()
Table 1-1. pandas DataFrame output in jupyter notebook
  filename text
0 file:/home/alext/projects/spark-nlp-book/data... Path: cantaloupe.srv.cs.cmu.edu!magne...
2 file:/home/alext/projects/spark-nlp-book/data... Newsgroups: sci.electronics Path: cant...
3 file:/home/alext/projects/spark-nlp-book/data... Newsgroups: sci.electronics Path: cant...
4 file:/home/alext/projects/spark-nlp-book/data... Newsgroups: sci.electronics Path: cant...
5 file:/home/alext/projects/spark-nlp-book/data... Xref: cantaloupe.srv.cs.cmu.edu comp.o...

Notice the use of .limit(). The .toPandas() method pulls the Spark DataFrame into memory to create a Pandas DataFrame. Converting to Pandas can also be useful for using tools available in the Python, since Pandas DataFrame  is widely supported in the Python ecosystem.

For other types of visualizations, we’ll primarily use the Python libraries matplotlib and seaborn. In order to use these libraries we will need to create Pandas DataFrames, so we will either aggregate or sample Spark DataFrames into a manageable size.

Hello World with Spark NLP

We have some data, so let’s use Spark NLP to process it. First, let’s extract the newsgroup name from the filename. We can see newsgroup as the last folder in the filename.

texts_df = texts_df.withColumn(
    'newsgroup', 
    fun.split('filename', '/').getItem(7)
)
texts_df.limit(5).toPandas()
Table 1-2. table with newsgroup column
  filename text newsgroup
0 file:/home/alext/projects/spark... Path: cantaloupe.srv.cs.cmu.edu!mag... sci.electronics
1 file:/home/alext/projects/spark... Newsgroups: sci.electronics Path: ca... sci.electronics
2 file:/home/alext/projects/spark... Newsgroups: sci.electronics Path: ca... sci.electronics
3 file:/home/alext/projects/spark... Newsgroups: sci.electronics Path: ca... sci.electronics
4 file:/home/alext/projects/spark... Xref: cantaloupe.srv.cs.cmu.edu comp... sci.electronics
newsgroup_counts = texts_df.groupBy('newsgroup').count().toPandas()
newsgroup_counts.plot(kind='bar', figsize=(10, 5))
plt.xticks(
    ticks=range(len(newsgroup_counts)), 
    labels=newsgroup_counts['newsgroup']
)
plt.show()
Figure 1-1. mini-newsgroups counts

  1.1_newsgroup_counts.jpg

Because the mini_newsgroups data set is a subset of the 20Newsgroups data set, we have the same number of documents in each newsgroup. Now, let’s use the BasicPipeline

from sparknlp.pretrained import PretrainedPipeline

The explain_document_ml is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. In order to undetstand what the explain_document_ml is doing, it is necessary to give a brief description of what the annotators are. An annotator is a representation of a specific NLP technique. We will go more in depth when we get to the NLP on Apache Spark chapter.

  • sentence (segment) - sentence detector or sentence segmenter or sentence tokenizer
  • token - tokenizer
  • spell checker
  • lemma - lemmatizer
  • stem - stemmer
  • POS tag - POS tagger

The annotators work on a document which is the text, associated metadata, and any previously discovered annotations. This design helps annotators re-use work of previous annotators. The downside, is that it is more complex than libraries like NLTK which are un-coupled collections of NLP functions.

The explain_document_ml has one Transformer and four annotators

  1. DocumentAssembler: a Transformer that creates a column that contains documents
  2. Sentence Segmenter: an annotator that produces the sentences of the document
  3. Tokenizer: an annotator that produces the tokens of the sentences
  4. SpellChecker: an annotator that produces the spelling-corrected tokens
  5. Stemmer: an annotator that produces the stems of the tokens
  6. Lemmatizer: an annotator that produces the lemmas of the tokens.
  7. POS Tagger: an annotator that produces the parts of speech of the associated tokens

There are some new terms introduced here that we’ll discuss more in upcoming chapters.

pipeline = PretrainedPipeline('explain_document_ml', lang='en')

The .annotate() method of the BasicPipeline can be used to annotate singular strings, as well as DataFrames. Let’s look at what it produces.

pipeline.annotate('Hellu wrold!')
{'document': ['Hellu wrold!'],
'spell': ['Hello', 'world', '!'],
'pos': ['UH', 'NN', '.'],
'lemmas': ['Hello', 'world', '!'],
'token': ['Hellu', 'wrold', '!'],
'stems': ['hello', 'world', '!'],
'sentence': ['Hellu wrold!']}

This a good amount of additional information. This brings up something that you will want to keep in mind - annotations can produce a lot of extra data. Let’s look at the schema of the raw data.

texts_df.printSchema()
root
 |-- filename: string (nullable = true)
 |-- text: string (nullable = true)
 |-- newsgroup: string (nullable = true)

Now, let’s annotate our DataFrame and look at the new schema.

procd_texts_df = basic_pipeline.annotate(texts_df, 'text')
procd_texts_df.printSchema()
root
 |-- filename: string (nullable = true)
 |-- text: string (nullable = true)
 |-- newsgroup: string (nullable = true)
 |-- document: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
 |    |    |-- begin: integer (nullable = false)
 |    |    |-- end: integer (nullable = false)
 |    |    |-- result: string (nullable = true)
 |    |    |-- metadata: map (nullable = true)
 |    |    |    |-- key: string
 |    |    |    |-- value: string (valueContainsNull = true)
 |    |    |-- embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |    |    |-- sentence_embeddings: array (nullable = true)
 |    |    |    |-- element: float (containsNull = false)
 |-- sentence: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- annotatorType: string (nullable = true)
...

That is quite complex! To break it down, let’s look at the token column. It has an Array type column, and each element is a Struct. Each element has the following

  • annotatorType: the type of annotation
  • begin: the starting character position of the annotation
  • end: the character position after the end of the annotation
  • result: the output of the annotator
  • metadata: a Map from String to String containing additional, potentially helpful, information about the annotation

Let’s look at some of the data using .show()

procd_texts_df.show(n=2)
+--------------------+--------------------+---------------+---------
-----------+--------------------+--------------------+--------------
------+--------------------+--------------------+-------------------
-+
|            filename|                text|      newsgroup|         
   document|            sentence|               token|              
 spell|              lemmas|               stems|                 po
s|
+--------------------+--------------------+---------------+---------
-----------+--------------------+--------------------+--------------
------+--------------------+--------------------+-------------------
-+
|file:/home/alext/...|Path: cantaloupe....|sci.electronics|[[documen
t, 0, 90...|[[document, 0, 46...|[[token, 0, 3, Pa...|[[token, 0, 3,
 Pa...|[[token, 0, 3, Pa...|[[token, 0, 3, pa...|[[pos, 0, 3, NNP,..
.|
|file:/home/alext/...|Newsgroups: sci.e...|sci.electronics|[[documen
t, 0, 19...|[[document, 0, 40...|[[token, 0, 9, Ne...|[[token, 0, 9,
 Ne...|[[token, 0, 9, Ne...|[[token, 0, 9, ne...|[[pos, 0, 9, NNP,..
.|
+--------------------+--------------------+---------------+---------
-----------+--------------------+--------------------+--------------
------+--------------------+--------------------+-------------------
-+
only showing top 2 rows

This is not very readable. Not only is the automatic formatting doing poorly with this data, but we can hardly see our annotations. Let’s try using some other arguments.

procd_texts_df.show(n=2, truncate=100, vertical=True)
-RECORD 0-----------------------------------------------------------
----------------------------------------------
 filename  | file:/home/alext/projects/spark-nlp-book/data/mini_news
groups/sci.electronics/54165                  
 text      | Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.e
du!news.sei.cmu.edu!cis.ohio-state.edu!zap... 
 newsgroup | sci.electronics                                        
                                              
 document  | [[document, 0, 903, Path: cantaloupe.srv.cs.cmu.edu!mag
nesium.club.cc.cmu.edu!news.sei.cmu.edu!ci... 
 sentence  | [[document, 0, 468, Path: cantaloupe.srv.cs.cmu.edu!mag
nesium.club.cc.cmu.edu!news.sei.cmu.edu!ci... 
 token     | [[token, 0, 3, Path, [sentence -> 0], [], []], [token, 
4, 4, :, [sentence -> 0], [], []], [token,... 
 spell     | [[token, 0, 3, Path, [sentence -> 0], [], []], [token, 
4, 4, :, [sentence -> 0], [], []], [token,... 
 lemmas    | [[token, 0, 3, Path, [sentence -> 0], [], []], [token, 
4, 4, :, [sentence -> 0], [], []], [token,... 
 stems     | [[token, 0, 3, path, [sentence -> 0], [], []], [token, 
4, 4, :, [sentence -> 0], [], []], [token,... 
 pos       | [[pos, 0, 3, NNP, [word -> Path], [], []], [pos, 4, 4, 
:, [word -> :], [], []], [pos, 6, 157, JJ,... 
-RECORD 1-----------------------------------------------------------
----------------------------------------------
 filename  | file:/home/alext/projects/spark-nlp-book/data/mini_news
groups/sci.electronics/54057                  
 text      | Newsgroups: sci.electronics
Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.c
m... 
 newsgroup | sci.electronics                                        
                                              
 document  | [[document, 0, 1944, Newsgroups: sci.electronics Path: 
cantaloupe.srv.cs.cmu.edu!magnesium.club.c... 
 sentence  | [[document, 0, 408, Newsgroups: sci.electronics Path: c
antaloupe.srv.cs.cmu.edu!magnesium.club.cc... 
 token     | [[token, 0, 9, Newsgroups, [sentence -> 0], [], []], [t
oken, 10, 10, :, [sentence -> 0], [], []],... 
 spell     | [[token, 0, 9, Newsgroups, [sentence -> 0], [], []], [t
oken, 10, 10, :, [sentence -> 0], [], []],... 
 lemmas    | [[token, 0, 9, Newsgroups, [sentence -> 0], [], []], [t
oken, 10, 10, :, [sentence -> 0], [], []],... 
 stems     | [[token, 0, 9, newsgroup, [sentence -> 0], [], []], [to
ken, 10, 10, :, [sentence -> 0], [], []], ... 
 pos       | [[pos, 0, 9, NNP, [word -> Newsgroups], [], []], [pos, 
10, 10, :, [word -> :], [], []], [pos, 12,... 
only showing top 2 rows

Better, but this is still not useful for getting a general understanding of our corpus. We at least have a glimpse of what our pipeline is doing.

Now, we need to pull out the information we might want to use in other process, that is why there is the Finisher Transformer. The Finisher takes annotations and pulls out the pieces of data that we will be using in downstream processes. For now, let’s pull out all the lemmas and put them into a String seperated by spaces.

from sparknlp import Finisher
finisher = Finisher()
finisher = finisher
# taking the lemma column
finisher = finisher.setInputCols(['lemmas'])
# seperating lemmas by a single space
finisher = finisher.setAnnotationSplitSymbol(' ')
finished_texts_df = finisher.transform(procd_texts_df)
finished_texts_df.show(n=1, truncate=100, vertical=True)
-RECORD 0-----------------------------------------------------------
----------------------------------------------------
filename        | file:/home/alext/projects/spark-nlp-book/data/mini
_newsgroups/sci.electronics/54165                  
text            | Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.
cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zap...
newsgroup       | sci.electronics                                   
                                                   
finished_lemmas | [Path, :, cantaloupe.srv.cs.cmu.edu!magnesium.club
.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu...
only showing top 1 row

Normally, we’ll be using the .setOutputAsArray(True) option so that the output is an Arrayinstead of a String.

Let’s look at the final result on the first document.

finished_texts_df.select('finished_lemmas').take(1)
[Row(finished_lemmas=['Path', ':', 'cantaloupe.srv.cs.cmu.edu!magnes
ium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.o
hio-state.edu!news.acns.nwu.edu!uicvm.uic.edu!u19250', 'Organization
', ':', 'University', 'of', 'Illinois', 'at', 'Chicago', ',', 'acade
mic', 'Computer', 'Center', 'Date', ':', 'Sat', ',', '24', 'Apr', '1
993', '14:28:35', 'CDT', 'From', ':', '<[email protected]>', 'Mes
sage-ID', ':', '<[email protected]>', 'Newsgroups', '
:', 'sci.electronics', 'Subject', ':', 'multiple', 'input', 'for', '
PC', 'Lines', ':', '8', 'Can', 'anyone', 'offer', 'a', 'suggestion',
 'on', 'a', 'problem', 'I', 'be', 'have', '?', 'I', 'have', 'several
', 'board', 'whose', 'sole', 'purpose', 'be', 'to', 'decode', 'DTMF'
, 'tone', 'and', 'send', 'the', 'resultant', 'in', 'ASCII', 'to', 'a
', 'PC', '.', 'These', 'board', 'run', 'on', 'the', 'serial', 'inter
face', '.', 'I', 'need', 'to', 'run', '*', 'of', 'the', 'board', 'so
mewhat', 'simultaneously', '.', 'I', 'need', 'to', 'be', 'able', 'to
', 'ho', 'ok', 'they', 'up', 'to', 'a', 'PC', '>', 'The', 'problem',
 'be', ',', 'how', 'do', 'I', 'hook', 'up', '8', '+', 'serial', 'dev
ice', 'to', 'one', 'PC', 'inexpensively', ',', 'so', 'that', 'all', 
'can', 'send', 'data', 'simultaneously', '(', 'or', 'close', 'to', '
it', ')', '?', 'Any', 'help', 'would', 'be', 'greatly', 'appreciate'
, '!', 'Achin', 'Single'])]

It doesn’t look like much has been done here, but there is still a lot to unpack. In the next chapter, we will explain some basics of linguistics that will help us understand what these annotators are doing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.218.1