Front Matter

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Previous Chapter

Cover

Next Chapter

1. Extracting the Data

Akshay Kulkarni and Adarsha Shivananda

Natural Language Processing RecipesUnlocking Text Data with Machine Learning and Deep Learning using Python

../images/475440_1_En_BookFrontmatter_Figa_HTML.png

Akshay Kulkarni

Bangalore, Karnataka, India

Adarsha Shivananda

Bangalore, Karnataka, India

ISBN 978-1-4842-4266-7e-ISBN 978-1-4842-4267-4

https://doi.org/10.1007/978-1-4842-4267-4

Library of Congress Control Number: 2019931849

Standard Apress

Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.

Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

To our family

Introduction

According to industry estimates, more than 80% of the data being generated is in an unstructured format, maybe in the form of text, image, audio, video, etc. Data is getting generated as we speak, as we write, as we tweet, as we use social media platforms, as we send messages on various messaging platforms, as we use e-commerce for shopping and in various other activities. The majority of this data exists in the textual form.

../images/475440_1_En_BookFrontmatter_Figf_HTML.png

So, what is unstructured data? Unstructured data is the information that doesn’t reside in a traditional relational database. Examples include documents, blogs, social media feeds, pictures, and videos.

Most of the insight is locked within different types of unstructured data. Unlocking all these unstructured data plays a vital role in every organization to make improved and better decisions. In this book, let us unlock the potential of text data.

Text data is most common and covers more than 50% of the unstructured data. A few examples include – tweets/posts on social media, chat conversations, news, blogs and articles, product or services reviews, and patient records in the health care sector. A few more recent ones include voice-driven bots like Siri, Alexa, etc.

In order to produce significant and actionable insights from text data, to unlock the potential of text data, we use Natural Language Processing coupled with machine learning and deep learning.

But what is Natural Language Processing - popularly known as NLP? We all know that machines/algorithms cannot understand texts or characters, so it is very important to convert these text data into machine understandable format (like numbers or binary) to perform any kind of analysis on text data. The ability to make machines understand and interpret the human language (text data) is termed as natural language processing.

So, if you want to use the power of unstructured text, this book is the right starting point. This book unearths the concepts and implementation of natural language processing and its applications in the real world. Natural Language Processing (NLP) offers unbounded opportunities for solving interesting problems in artificial intelligence, making it the latest frontier for developing intelligent, deep learning-based applications.

What This Book Covers

Natural Language Processing Recipes is your handy problem-solution reference for learning and implementing NLP solutions using Python. The book is packed with thousands of code and approaches that help you to quickly learn and implement the basic and advanced Natural Language Processing techniques. You will learn how to efficiently use a wide range of NLP packages and implement text classification, identify parts of speech, topic modeling, text summarization, text generation, sentiment analysis, and many more applications of NLP.

This book starts off by ways of extracting text data along with web scraping. You will also learn how to clean and preprocess text data and ways to analyze them with advanced algorithms. During the course of the book, you will explore the semantic as well as syntactic analysis of the text. We will be covering complex NLP solutions that will involve text normalization, various advanced preprocessing methods, POS tagging, text similarity, text summarization, sentiment analysis, topic modeling, NER, word2vec, seq2seq, and much more. In this book, we will cover the various fundamentals necessary for applications of machine learning and deep learning in natural language processing, and the other state-of-the-art techniques. Finally, we close it with some of the advanced industrial applications of NLP with the solution approach and implementation, also leveraging the power of deep learning techniques for Natural Language Processing and Natural Language Generation problems. Employing state-of-the-art advanced RNNs, like long short-term memory, to solve complex text generation tasks. Also, we explore word embeddings.

Each chapter includes several code examples and illustrations.

By the end of the book, the reader will have a clear understanding of implementing natural language processing and will have worked on multiple examples that implement NLP techniques in the real world. The reader will be comfortable with various NLP techniques coupled with machine learning and deep learning and its industrial applications, which make the NLP journey much more interesting and will definitely help improve Python coding skills as well. You will learn about all the ingredients that you need to, to become successful in the NLP space.

Who This Book Is For

Fundamental Python skills are assumed, as well as some knowledge of machine learning. If you are an NLP or machine learning enthusiast and an intermediate Python programmer who wants to quickly master natural language processing, then this learning path will do you a lot of good. All you need are the basics of machine learning and Python to enjoy this book.

What you will learn:

1)
Core concepts implementation of NLP and various approaches to natural language processing, NLP using Python libraries such as NLTK, TextBlob, SpaCy, Stanford CoreNLP, and so on.
2)
Learn about implementing text preprocessing and feature engineering in NLP, along with advanced methods of feature engineering like word embeddings.
3)
Understand and implement the concepts of information retrieval, text summarization, sentiment analysis, text classification, text generation, and other advanced NLP techniques solved by leveraging machine learning and deep learning.
4)
After reading this book, the reader should get a good hold of the problems faced by different industries and how to implement them using NLP techniques.
5)
Implementing an end-to-end pipeline of the NLP life cycle, which includes framing the problem, finding the data, collecting, preprocessing the data, and solving it using state-of-the-art techniques .

What You Need For This Book

To perform all the recipes of this book successfully, you will need Python 3.x or higher running on any Windows- or Unix-based operating system with a processor of 2.0 GHz or higher and a minimum of 4 GB RAM. You can download Python from Anaconda and leverage Jupyter notebook for all coding purposes. This book assumes you know Keras’s basics and how to install the basic libraries of machine learning and deep learning.

Please make sure you upgrade or install the latest version of all the libraries.

Python is the most popular and widely used tool for building NLP applications. It has a huge number of sophisticated libraries to perform NLP tasks starting from basic preprocessing to advanced techniques.

To install any library in Python Jupyter notebook. use “!” before the pip install.

NLTK : Natural language toolkit and commonly called the mother of all NLP libraries. It is one of the mature primary resources when it comes to Python and NLP.

!pip install nltk

nltk.download()

SpaCy : SpaCy is recently a trending library, as it comes with the added flavors of a deep learning framework. While SpaCy doesn’t cover all of the NLP functionalities, the things that it does do, it does really well.

!pip install spacy

#if above doesn't work, try this in your terminal/ command prompt

conda install spacy

python -m spacy.en.download all

#then load model via

spacy.load('en')

TextBlob : This is one of the data scientist’s favorite library when it comes to implementing NLP tasks. It is based on both NLTK and Pattern. However, TextBlob certainly isn’t the fastest or most complete library.

!pip install textblob

CoreNLP : It is a Python wrapper for Stanford CoreNLP . The toolkit provides very robust, accurate, and optimized techniques for tagging, parsing, and analyzing text in various languages.

!pip install CoreNLP

These are not the only ones; there are hundreds of NLP libraries. But we have covered widely used and important ones.

Motivation : There is an immense number of industrial applications of NLP that are leveraged to uncover insights. By the end of the book, you will have implemented most of these use cases end to end, right from framing the business problem to building applications and drawing business insights.

Sentiment analysis: Customer’s emotions toward products offered by the business.
Topic modeling: Extract the unique topics from the group of documents.
Complaint classifications/Email classifications/E-commerce product classification, etc.
Document categorization/management using different clustering techniques.
Resume shortlisting and job description matching using similarity methods.
Advanced feature engineering techniques (word2vec and fastText) to capture context.
Information/Document Retrieval Systems, for example, search engine.
Chatbot, Q & A, and Voice-to-Text applications like Siri and Alexa.
Language detection and translation using neural networks.
Text summarization using graph methods and advanced techniques.
Text generation/predicting the next sequence of words using deep learning algorithms .

Acknowledgments

We are grateful to our mother, father, and loving brother and sister. We thank all of them for their motivation and constant support.

We would like to express our gratitude to mentors and friends for their inputs, inspiration, and support. A special thanks to Anoosh R. Kulkarni, Data Scientist at Awok.com for all his support in writing this book and his technical inputs. Big thanks to the Apress team for their constant support and help.

Finally, we would like to thank you, the reader, for showing an interest in this book and believe that you can make your natural language processing journey more interesting and exciting.

Note that the views expressed in this book are the authors’ personal ones.

Chapter 1: Extracting the Data 1

Introduction 1

Recipe 1-1. Collecting Data 3

Problem 3

Solution 3

How It Works 3

Recipe 1-2. Collecting Data from PDFs 5

Problem 5

Solution 5

How It Works 5

Recipe 1-3. Collecting Data from Word Files 7

Problem 7

Solution 7

How It Works 7

Recipe 1-4. Collecting Data from JSON 8

Problem 8

Solution 8

How It Works 9

Recipe 1-5. Collecting Data from HTML 11

Problem 11

Solution 11

How It Works 11

Recipe 1-6. Parsing Text Using Regular Expressions 15

Problem 16

Solution 16

How It Works 16

Recipe 1-7. Handling Strings 26

Problem 26

Solution 26

How It Works 27

Recipe 1-8. Scraping Text from the Web 28

Problem 29

Solution 29

How It Works 29

Chapter 2: Exploring and Processing Text Data 37

Recipe 2-1. Converting Text Data to Lowercase 38

Problem 38

Solution 39

How It Works 39

Recipe 2-2. Removing Punctuation 41

Problem 41

Solution 41

How It Works 41

Recipe 2-3. Removing Stop Words 43

Problem 44

Solution 44

How It Works 44

Recipe 2-4. Standardizing Text 46

Problem 46

Solution 46

How It Works 46

Recipe 2-5. Correcting Spelling 47

Problem 48

Solution 48

How It Works 48

Recipe 2-6. Tokenizing Text 50

Problem 50

Solution 50

How It Works 51

Recipe 2-7. Stemming 52

Problem 53

Solution 53

How It Works 53

Recipe 2-8. Lemmatizing 54

Problem 55

Solution 55

How It Works 55

Recipe 2-9. Exploring Text Data 56

Problem 56

Solution 56

How It Works 57

Recipe 2-10. Building a Text Preprocessing Pipeline 62

Problem 62

Solution 62

How It Works 62

Chapter 3: Converting Text to Features 67

Recipe 3-1. Converting Text to Features Using One Hot Encoding 68

Problem 68

Solution 68

How It Works 69

Recipe 3-2. Converting Text to Features Using Count Vectorizing 70

Problem 70

Solution 70

How It Works 71

Recipe 3-3. Generating N-grams 72

Problem 72

Solution 72

How It Works 73

Recipe 3-4. Generating Co-occurrence Matrix 75

Problem 75

Solution 75

How It Works 75

Recipe 3-5. Hash Vectorizing 78

Problem 78

Solution 78

How It Works 78

Recipe 3-6. Converting Text to Features Using TF-IDF 79

Problem 80

Solution 80

How It Works 80

Recipe 3-7. Implementing Word Embeddings 82

Problem 84

Solution 84

How It Works 85

Recipe 3-8 Implementing fastText 93

Problem 93

Solution 94

How It Works 94

Chapter 4: Advanced Natural Language Processing 97

Recipe 4-1. Extracting Noun Phrases 100

Problem 100

Solution 100

How It Works 100

Recipe 4-2. Finding Similarity Between Texts 101

Problem 101

Solution 101

How It Works 102

Recipe 4-3. Tagging Part of Speech 104

Problem 104

Solution 104

How It Works 105

Recipe 4-4. Extract Entities from Text 108

Problem 108

Solution 108

How It Works 108

Recipe 4-5. Extracting Topics from Text 110

Problem 110

Solution 110

How It Works 110

Recipe 4-6. Classifying Text 114

Problem 114

Solution 114

How It Works 115

Recipe 4-7. Carrying Out Sentiment Analysis 119

Problem 119

Solution 119

How It Works 119

Recipe 4-8. Disambiguating Text 121

Problem 121

Solution 121

How It Works 121

Recipe 4-9. Converting Speech to Text 123

Problem 123

Solution 123

How It Works 123

Recipe 4-10. Converting Text to Speech 126

Problem 126

Solution 126

How It Works 126

Recipe 4-11. Translating Speech 127

Problem 127

Solution 127

How It Works 128

Chapter 5: Implementing Industry Applications 129

Recipe 5-1. Implementing Multiclass Classification 130

Problem 130

Solution 130

How It Works 130

Recipe 5-2. Implementing Sentiment Analysis 139

Problem 139

Solution 139

How It Works 139

Recipe 5-3. Applying Text Similarity Functions 152

Problem 152

Solution 152

How It Works 152

Recipe 5-4. Summarizing Text Data 165

Problem 165

Solution 165

How It Works 166

Recipe 5-5. Clustering Documents 172

Problem 172

Solution 173

How It Works 173

Recipe 5-6. NLP in a Search Engine 180

Problem 180

Solution 180

How It Works 181

Chapter 6: Deep Learning for NLP 185

Introduction to Deep Learning 185

Convolutional Neural Networks 187

Recurrent Neural Networks 192

Recipe 6-1. Retrieving Information 194

Problem 195

Solution 195

How It Works 196

Recipe 6-2. Classifying Text with Deep Learning 202

Problem 203

Solution 203

How It Works 203

Recipe 6-3. Next Word Prediction 218

Problem 218

Solution 219

How It Works 219

Index 229

About the Authors and About the Technical Reviewers

About the Authors

Akshay Kulkarni

../images/475440_1_En_BookFrontmatter_Figb_HTML.png

is an Artificial Intelligence and Machine learning evangelist. Akshay has a rich experience of building and scaling AI and Machine Learning businesses and creating significant client impact. He is currently the Senior Data Scientist at SapientRazorfish’s core data science team where he is part of strategy and transformation interventions through AI and works on various Machine Learning, Deep Learning, and Artificial Intelligence engagements by applying state-of-the-art techniques in this space. Previously he was part of Gartner and Accenture, where he scaled the analytics and data science business.

Akshay is a regular speaker at major data science conferences. He is a visiting faculty at few of the top graduate institutes in India. In his spare time, he enjoys reading, writing, coding, and helping aspiring data scientists. He lives in Bangalore with his family.

Adarsha Shivananda

../images/475440_1_En_BookFrontmatter_Figc_HTML.png

is a Senior Data Scientist at Indegene’s Product and Technology team where he is working on building Machine Learning and AI capabilities to pharma products. He is aiming to build a pool of exceptional data scientists within and outside of the organization to solve greater problems through brilliant training programs; and he always wants to stay ahead of the curve. Previously he was working with Tredence Analytics and IQVIA. Adarsha has extensively worked on pharma, health care, retail, and marketing domains.

He lives in Bangalore and loves to read, ride, and teach data science.

About the Technical Reviewers

Dikshant Shahi

../images/475440_1_En_BookFrontmatter_Figd_HTML.png

is a Software Architect with expertise in Search Engines, Semantic Technologies, and Natural Language Processing. He is currently focusing on building semantic search platforms and related enterprise applications. He has been building search engines for more than a decade and is also the author of the book Apache Solr: A Practical Approach to Enterprise Search (Apress, 2015).

Dikshant lives in Bangalore, India. When not at work, you can find him backpacking.

Krishnendu Dasgupta

../images/475440_1_En_BookFrontmatter_Fige_HTML.png

is a Senior Consultant with 8 years of experience. He has worked on different cloud platforms and has designed data mining architectures. He is working and contributing toward NLP and Artificial Intelligence through his work. He has worked with major consulting firms and has experience in supply chain and banking domains.

Krishnendu is accredited by the Global Innovation and Entrepreneurship Bootcamp – Class of 2018, held by the Massachusetts Institute of Technology.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
Front Matter

What This Book Covers

Who This Book Is For

What You Need For This Book

Table of Contents

About the Authors and About the Technical Reviewers

About the Authors

About the Technical Reviewers

Table of Contents for Front Matter

Create new playlist

Sign In

Sign Up

What This Book Covers

Who This Book Is For

What You Need For This Book

Table of Contents

About the Authors and About the Technical Reviewers

About the Authors

About the Technical Reviewers

Table of Contents for
Front Matter