Let’s begin by talking about the structure of this book. In the first part, we’ll go over the technologies and techniques we’ll be using with Spark NLP throughout this book. After that we’ll talk about the building blocks of Natural Language Processing (NLP). Finally, we’ll talk about NLP applications and sytems.
In this book I’ll cover how to use Spark NLP, as well fundamental natural language processing topics. Hopefully, at the end of this book you’ll have a new software tool for working with natural language, Spark NLP, and a suite of techniques and some understanding of why these techniques work.
When working on an application that requires NLP, there are three perspectives you should keep in mind: the software developer’s perspective, the linguist’s perspective, and the data scientist’s perspective. The software developer’s perspect helps you focus on what your application needs to do; this grounds the work in terms of the product you want to create. The linguist’s perspective helps you understand what it is in the data that you want to extract. The data scientist’s perspect helps see how you can extract the information you need from your data.
Here is a more detailed overview of the book
In addition to Spark NLP, Apache Spark, and TensorFlow, we’ll make use of a number of other tools
conda
to create our environment. Learn more at https://www.anaconda.com/In this book, almost every chapter has exercises, so it is useful to make sure that the environment is working at the beginning. We’ll use Jupyter notebooks in this book, and the kernel we’ll use is the baseline Python 3.6 kernel. The instructions here use Continuum’s Anaconda to set up a Python virtual environment.
You can also use the `johnsnowlabs/spark-nlp-workshop` docker image for the necessary environment.
These instructions were created from the set up process for Ubuntu. There are additional set up instructions online at the project’s github page.
SPARK_HOME
is set to the location of your Apache Spark installationOptional: Set up a password for your Jupyter notebook server - https://jupyter-notebook.readthedocs.io/en/stable/public_server.html#automatic-password-setup
$
echo
$SPARK_HOME
/path/to/your/spark/installation
$
spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.prope
rties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setL
ogLevel(newLevel).
...
Spark context Web UI available at localhost:4040
Spark context available as 'sc' (master = local[*], app id = ...).
Spark session available as 'spark'.
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_ version 2.3.2
/_/
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0
_102)
Type in expressions to have them evaluated.
Type :help for more information.
scala>
git clone https://github.com/alexander-n-thomas/spark-nlp-book.git
conda env create -f environment.yml
source
activate
spark-nlp-book
ipython
kernel
install
--user
--name
=
sparknlpbook
jupyter notebook
Now that we’re all set up, let’s start using Spark NLP! We will be using the 20 Newsgroups data set from the University of California Irvine Machine Learning Repository. The data can be found at https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups. For this first example we use the mini_newsgroups data set found at https://archive.ics.uci.edu/ml/machine-learning-databases/20newsgroups-mld/mini_newsgroups.tar.gz. Download this tar file and extract it into the data folder for this project.
!
ls
./
data
/
mini_newsgroup
There are many ways we can use Apache Spark from Jupyter notebooks. We could use a specialized kernel, but I generally prefer using a simple kernel. Fortunately, Spark NLP gives us an easy way to start up.
import
sparknlp
import
pyspark
from
pyspark
import
SparkConf
from
pyspark.sql
import
SparkSession
from
pyspark.sql
import
functions
as
fun
from
pyspark.sql.types
import
*
%
matplotlib
inline
import
matplotlib.pyplot
as
plt
packages
=
','
.
join
([
"JohnSnowLabs:spark-nlp:1.6.3"
,
])
spark_conf
=
SparkConf
()
spark_conf
=
spark_conf
.
setAppName
(
'spark-nlp-book-p1c1'
)
spark_conf
=
spark_conf
.
setAppName
(
'master[*]'
)
spark_conf
=
spark_conf
.
set
(
"spark.jars.packages"
,
packages
)
%
matplotlib
inline
import
matplotlib.pyplot
as
plt
Let’s look at how we can load data data with Apache Spark, and then some ways we can view the data.
import
os
mini_newsgroups_path
=
os
.
path
.
join
(
'data'
,
'mini_newsgroups'
)
schema
=
StructType
([
StructField
(
'filename'
,
StringType
()),
StructField
(
'text'
,
StringType
()),
])
texts_df
=
spark
.
createDataFrame
(
texts
,
schema
)
texts_df
.
show
()
+--------------------+--------------------+ | filename| text| +--------------------+--------------------+ |file:/home/alext/...|Path: cantaloupe....| |file:/home/alext/...|Newsgroups: sci.e...| |file:/home/alext/...|Newsgroups: sci.e...| |file:/home/alext/...|Newsgroups: sci.e...| |file:/home/alext/...|Xref: cantaloupe....| |file:/home/alext/...|Path: cantaloupe....| |file:/home/alext/...|Xref: cantaloupe....| |file:/home/alext/...|Newsgroups: sci.e...| |file:/home/alext/...|Newsgroups: sci.e...| |file:/home/alext/...|Xref: cantaloupe....| |file:/home/alext/...|Path: cantaloupe....| |file:/home/alext/...|Newsgroups: sci.e...| |file:/home/alext/...|Path: cantaloupe....| |file:/home/alext/...|Path: cantaloupe....| |file:/home/alext/...|Path: cantaloupe....| |file:/home/alext/...|Xref: cantaloupe....| |file:/home/alext/...|Path: cantaloupe....| |file:/home/alext/...|Newsgroups: sci.e...| |file:/home/alext/...|Newsgroups: sci.e...| |file:/home/alext/...|Newsgroups: sci.e...| +--------------------+--------------------+ only showing top 20 rows
Looking at the data is important in any data science project. When working with structured data, especially numerical data, it is common to explore data with aggregates. This is necessary since data sets are large, and looking at a small number of examples can easily misrepresent the data. Natural language data complicates this. On one hand, Humans are really good at interpreting language, on the other, Humans are also really good at jumping to conclusions and making hasty generalizations. So we still have the problem of creating a representitive summary for large data sets. We’ll talk about some techniques to do this in the chapters discussing topic modeling and embeddings.
For now, let’s talk about ways we can look at a small amount of data in DataFrame
s. As you can see in the above code example, we can show the output of a DataFrame
using .show()
. Let’s look at the arguments
n
Number of rows to show.truncate
If set to True, truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate
and align cells right.vertical
If set to True, print output rows vertically (one line per column value).Let’s try using some of these arguments
texts_df
.
show
(
n
=
5
,
truncate
=
100
,
vertical
=
True
)
-RECORD 0-------------------------------------------------------------------------------------------------------- filename | file:/home/alext/projects/spark-nlp-book/data/mini_newsgroups/sci.electronics/54165 text | Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zap... -RECORD 1-------------------------------------------------------------------------------------------------------- filename | file:/home/alext/projects/spark-nlp-book/data/mini_newsgroups/sci.electronics/54057 text | Newsgroups: sci.electronics Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.cm... -RECORD 2-------------------------------------------------------------------------------------------------------- filename | file:/home/alext/projects/spark-nlp-book/data/mini_newsgroups/sci.electronics/53712 text | Newsgroups: sci.electronics Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!noc.near.net!how... -RECORD 3-------------------------------------------------------------------------------------------------------- filename | file:/home/alext/projects/spark-nlp-book/data/mini_newsgroups/sci.electronics/53529 text | Newsgroups: sci.electronics Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.cs.cmu.edu!bb3.andrew.c... -RECORD 4-------------------------------------------------------------------------------------------------------- filename | file:/home/alext/projects/spark-nlp-book/data/mini_newsgroups/sci.electronics/54042 text | Xref: cantaloupe.srv.cs.cmu.edu comp.os.msdos.programmer:23292 alt.msdos.programmer:6797 sci.elec... only showing top 5 rows
The .show()
method is good for a quick view of data, but if the data is complicated, doesn’t work as well. In the Jupyter environment, there are some special integrations with Pandas, and Pandas DataFrame
s are displayed a little nicer. Here’s an example.
texts_df.limit(5).toPandas()
filename | text | |
---|---|---|
0 | file:/home/alext/projects/spark-nlp-book/data... | Path: cantaloupe.srv.cs.cmu.edu!magne... |
2 | file:/home/alext/projects/spark-nlp-book/data... | Newsgroups: sci.electronics Path: cant... |
3 | file:/home/alext/projects/spark-nlp-book/data... | Newsgroups: sci.electronics Path: cant... |
4 | file:/home/alext/projects/spark-nlp-book/data... | Newsgroups: sci.electronics Path: cant... |
5 | file:/home/alext/projects/spark-nlp-book/data... | Xref: cantaloupe.srv.cs.cmu.edu comp.o... |
Notice the use of .limit()
. The .toPandas()
method pulls the Spark DataFrame
into memory to create a Pandas DataFrame
. Converting to Pandas can also be useful for using tools available in the Python, since Pandas DataFrame
is widely supported in the Python ecosystem.
For other types of visualizations, we’ll primarily use the Python libraries matplotlib and seaborn. In order to use these libraries we will need to create Pandas DataFrame
s, so we will either aggregate or sample Spark DataFrame
s into a manageable size.
We have some data, so let’s use Spark NLP to process it. First, let’s extract the newsgroup name from the filename. We can see newsgroup as the last folder in the filename.
texts_df
=
texts_df
.
withColumn
(
'newsgroup'
,
fun
.
split
(
'filename'
,
'/'
)
.
getItem
(
7
)
)
texts_df
.
limit
(
5
)
.
toPandas
()
filename | text | newsgroup | |
---|---|---|---|
0 | file:/home/alext/projects/spark... | Path: cantaloupe.srv.cs.cmu.edu!mag... | sci.electronics |
1 | file:/home/alext/projects/spark... | Newsgroups: sci.electronics Path: ca... | sci.electronics |
2 | file:/home/alext/projects/spark... | Newsgroups: sci.electronics Path: ca... | sci.electronics |
3 | file:/home/alext/projects/spark... | Newsgroups: sci.electronics Path: ca... | sci.electronics |
4 | file:/home/alext/projects/spark... | Xref: cantaloupe.srv.cs.cmu.edu comp... | sci.electronics |
newsgroup_counts = texts_df.groupBy('newsgroup').count().toPandas()
newsgroup_counts.plot(kind='bar', figsize=(10, 5)) plt.xticks( ticks=range(len(newsgroup_counts)), labels=newsgroup_counts['newsgroup'] ) plt.show()
Because the mini_newsgroups data set is a subset of the 20Newsgroups data set, we have the same number of documents in each newsgroup. Now, let’s use the BasicPipeline
from
sparknlp.pretrained
import
PretrainedPipeline
The explain_document_ml
is a pretrained pipeline that we can use to process text with a simple pipeline that performs basic processing steps. In order to undetstand what the explain_document_ml
is doing, it is necessary to give a brief description of what the annotators are. An annotator is a representation of a specific NLP technique. We will go more in depth when we get to the NLP on Apache Spark chapter.
The annotators work on a document which is the text, associated metadata, and any previously discovered annotations. This design helps annotators re-use work of previous annotators. The downside, is that it is more complex than libraries like NLTK which are un-coupled collections of NLP functions.
The explain_document_ml
has one Transformer
and four annotators
DocumentAssembler
: a Transformer
that creates a column that contains documentsSentence Segmenter
: an annotator that produces the sentences of the documentTokenizer
: an annotator that produces the tokens of the sentencesSpellChecker
: an annotator that produces the spelling-corrected tokensStemmer
: an annotator that produces the stems of the tokensLemmatizer
: an annotator that produces the lemmas of the tokens.POS Tagger
: an annotator that produces the parts of speech of the associated tokensThere are some new terms introduced here that we’ll discuss more in upcoming chapters.
pipeline
=
PretrainedPipeline
(
'explain_document_ml'
,
lang
=
'en'
)
The .annotate()
method of the BasicPipeline
can be used to annotate singular strings, as well as DataFrame
s. Let’s look at what it produces.
pipeline
.
annotate
(
'Hellu wrold!'
)
{'document': ['Hellu wrold!'], 'spell': ['Hello', 'world', '!'], 'pos': ['UH', 'NN', '.'], 'lemmas': ['Hello', 'world', '!'], 'token': ['Hellu', 'wrold', '!'], 'stems': ['hello', 'world', '!'], 'sentence': ['Hellu wrold!']}
This a good amount of additional information. This brings up something that you will want to keep in mind - annotations can produce a lot of extra data. Let’s look at the schema of the raw data.
texts_df.printSchema()
root |-- filename: string (nullable = true) |-- text: string (nullable = true) |-- newsgroup: string (nullable = true)
Now, let’s annotate our DataFrame
and look at the new schema.
procd_texts_df = basic_pipeline.annotate(texts_df, 'text')
procd_texts_df.printSchema()
root |-- filename: string (nullable = true) |-- text: string (nullable = true) |-- newsgroup: string (nullable = true) |-- document: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- annotatorType: string (nullable = true) | | |-- begin: integer (nullable = false) | | |-- end: integer (nullable = false) | | |-- result: string (nullable = true) | | |-- metadata: map (nullable = true) | | | |-- key: string | | | |-- value: string (valueContainsNull = true) | | |-- embeddings: array (nullable = true) | | | |-- element: float (containsNull = false) | | |-- sentence_embeddings: array (nullable = true) | | | |-- element: float (containsNull = false) |-- sentence: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- annotatorType: string (nullable = true) ...
That is quite complex! To break it down, let’s look at the token column. It has an Array
type column, and each element is a Struct
. Each element has the following
annotatorType
: the type of annotationbegin
: the starting character position of the annotationend
: the character position after the end of the annotationresult
: the output of the annotatormetadata
: a Map
from String
to String
containing additional, potentially helpful, information about the annotationLet’s look at some of the data using .show()
procd_texts_df.show(n=2)
+--------------------+--------------------+---------------+--------- -----------+--------------------+--------------------+-------------- ------+--------------------+--------------------+------------------- -+ | filename| text| newsgroup| document| sentence| token| spell| lemmas| stems| po s| +--------------------+--------------------+---------------+--------- -----------+--------------------+--------------------+-------------- ------+--------------------+--------------------+------------------- -+ |file:/home/alext/...|Path: cantaloupe....|sci.electronics|[[documen t, 0, 90...|[[document, 0, 46...|[[token, 0, 3, Pa...|[[token, 0, 3, Pa...|[[token, 0, 3, Pa...|[[token, 0, 3, pa...|[[pos, 0, 3, NNP,.. .| |file:/home/alext/...|Newsgroups: sci.e...|sci.electronics|[[documen t, 0, 19...|[[document, 0, 40...|[[token, 0, 9, Ne...|[[token, 0, 9, Ne...|[[token, 0, 9, Ne...|[[token, 0, 9, ne...|[[pos, 0, 9, NNP,.. .| +--------------------+--------------------+---------------+--------- -----------+--------------------+--------------------+-------------- ------+--------------------+--------------------+------------------- -+ only showing top 2 rows
This is not very readable. Not only is the automatic formatting doing poorly with this data, but we can hardly see our annotations. Let’s try using some other arguments.
procd_texts_df.show(n=2, truncate=100, vertical=True)
-RECORD 0----------------------------------------------------------- ---------------------------------------------- filename | file:/home/alext/projects/spark-nlp-book/data/mini_news groups/sci.electronics/54165 text | Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.e du!news.sei.cmu.edu!cis.ohio-state.edu!zap... newsgroup | sci.electronics document | [[document, 0, 903, Path: cantaloupe.srv.cs.cmu.edu!mag nesium.club.cc.cmu.edu!news.sei.cmu.edu!ci... sentence | [[document, 0, 468, Path: cantaloupe.srv.cs.cmu.edu!mag nesium.club.cc.cmu.edu!news.sei.cmu.edu!ci... token | [[token, 0, 3, Path, [sentence -> 0], [], []], [token, 4, 4, :, [sentence -> 0], [], []], [token,... spell | [[token, 0, 3, Path, [sentence -> 0], [], []], [token, 4, 4, :, [sentence -> 0], [], []], [token,... lemmas | [[token, 0, 3, Path, [sentence -> 0], [], []], [token, 4, 4, :, [sentence -> 0], [], []], [token,... stems | [[token, 0, 3, path, [sentence -> 0], [], []], [token, 4, 4, :, [sentence -> 0], [], []], [token,... pos | [[pos, 0, 3, NNP, [word -> Path], [], []], [pos, 4, 4, :, [word -> :], [], []], [pos, 6, 157, JJ,... -RECORD 1----------------------------------------------------------- ---------------------------------------------- filename | file:/home/alext/projects/spark-nlp-book/data/mini_news groups/sci.electronics/54057 text | Newsgroups: sci.electronics Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc.cmu.edu!news.sei.c m... newsgroup | sci.electronics document | [[document, 0, 1944, Newsgroups: sci.electronics Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.c... sentence | [[document, 0, 408, Newsgroups: sci.electronics Path: c antaloupe.srv.cs.cmu.edu!magnesium.club.cc... token | [[token, 0, 9, Newsgroups, [sentence -> 0], [], []], [t oken, 10, 10, :, [sentence -> 0], [], []],... spell | [[token, 0, 9, Newsgroups, [sentence -> 0], [], []], [t oken, 10, 10, :, [sentence -> 0], [], []],... lemmas | [[token, 0, 9, Newsgroups, [sentence -> 0], [], []], [t oken, 10, 10, :, [sentence -> 0], [], []],... stems | [[token, 0, 9, newsgroup, [sentence -> 0], [], []], [to ken, 10, 10, :, [sentence -> 0], [], []], ... pos | [[pos, 0, 9, NNP, [word -> Newsgroups], [], []], [pos, 10, 10, :, [word -> :], [], []], [pos, 12,... only showing top 2 rows
Better, but this is still not useful for getting a general understanding of our corpus. We at least have a glimpse of what our pipeline is doing.
Now, we need to pull out the information we might want to use in other process, that is why there is the Finisher
Transformer
. The Finisher
takes annotations and pulls out the pieces of data that we will be using in downstream processes. For now, let’s pull out all the lemmas and put them into a String
seperated by spaces.
from sparknlp import Finisher
finisher = Finisher() finisher = finisher # taking the lemma column finisher = finisher.setInputCols(['lemmas']) # seperating lemmas by a single space finisher = finisher.setAnnotationSplitSymbol(' ')
finished_texts_df = finisher.transform(procd_texts_df)
finished_texts_df.show(n=1, truncate=100, vertical=True)
-RECORD 0----------------------------------------------------------- ---------------------------------------------------- filename | file:/home/alext/projects/spark-nlp-book/data/mini _newsgroups/sci.electronics/54165 text | Path: cantaloupe.srv.cs.cmu.edu!magnesium.club.cc. cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zap... newsgroup | sci.electronics finished_lemmas | [Path, :, cantaloupe.srv.cs.cmu.edu!magnesium.club .cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu... only showing top 1 row
Normally, we’ll be using the .setOutputAsArray(True)
option so that the output is an Array
instead of a String
.
Let’s look at the final result on the first document.
finished_texts_df.select('finished_lemmas').take(1)
[Row(finished_lemmas=['Path', ':', 'cantaloupe.srv.cs.cmu.edu!magnes ium.club.cc.cmu.edu!news.sei.cmu.edu!cis.ohio-state.edu!zaphod.mps.o hio-state.edu!news.acns.nwu.edu!uicvm.uic.edu!u19250', 'Organization ', ':', 'University', 'of', 'Illinois', 'at', 'Chicago', ',', 'acade mic', 'Computer', 'Center', 'Date', ':', 'Sat', ',', '24', 'Apr', '1 993', '14:28:35', 'CDT', 'From', ':', '<[email protected]>', 'Mes sage-ID', ':', '<[email protected]>', 'Newsgroups', ' :', 'sci.electronics', 'Subject', ':', 'multiple', 'input', 'for', ' PC', 'Lines', ':', '8', 'Can', 'anyone', 'offer', 'a', 'suggestion', 'on', 'a', 'problem', 'I', 'be', 'have', '?', 'I', 'have', 'several ', 'board', 'whose', 'sole', 'purpose', 'be', 'to', 'decode', 'DTMF' , 'tone', 'and', 'send', 'the', 'resultant', 'in', 'ASCII', 'to', 'a ', 'PC', '.', 'These', 'board', 'run', 'on', 'the', 'serial', 'inter face', '.', 'I', 'need', 'to', 'run', '*', 'of', 'the', 'board', 'so mewhat', 'simultaneously', '.', 'I', 'need', 'to', 'be', 'able', 'to ', 'ho', 'ok', 'they', 'up', 'to', 'a', 'PC', '>', 'The', 'problem', 'be', ',', 'how', 'do', 'I', 'hook', 'up', '8', '+', 'serial', 'dev ice', 'to', 'one', 'PC', 'inexpensively', ',', 'so', 'that', 'all', 'can', 'send', 'data', 'simultaneously', '(', 'or', 'close', 'to', ' it', ')', '?', 'Any', 'help', 'would', 'be', 'greatly', 'appreciate' , '!', 'Achin', 'Single'])]
It doesn’t look like much has been done here, but there is still a lot to unpack. In the next chapter, we will explain some basics of linguistics that will help us understand what these annotators are doing.
18.217.218.1