Akshay Kulkarni and Adarsha Shivananda
Natural Language Processing RecipesUnlocking Text Data with Machine Learning and Deep Learning using Python
Akshay Kulkarni
Bangalore, Karnataka, India
Adarsha Shivananda
Bangalore, Karnataka, India
ISBN 978-1-4842-4266-7e-ISBN 978-1-4842-4267-4
Library of Congress Control Number: 2019931849
© Akshay Kulkarni and Adarsha Shivananda 2019
Standard Apress
Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein.
Distributed to the book trade worldwide by Springer Science+Business Media New York, 233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail [email protected], or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation.

To our family

Introduction
According to industry estimates, more than 80% of the data being generated is in an unstructured format, maybe in the form of text, image, audio, video, etc. Data is getting generated as we speak, as we write, as we tweet, as we use social media platforms, as we send messages on various messaging platforms, as we use e-commerce for shopping and in various other activities. The majority of this data exists in the textual form.
../images/475440_1_En_BookFrontmatter_Figf_HTML.png

So, what is unstructured data? Unstructured data is the information that doesn’t reside in a traditional relational database. Examples include documents, blogs, social media feeds, pictures, and videos.

Most of the insight is locked within different types of unstructured data. Unlocking all these unstructured data plays a vital role in every organization to make improved and better decisions. In this book, let us unlock the potential of text data.

Text data is most common and covers more than 50% of the unstructured data. A few examples include – tweets/posts on social media, chat conversations, news, blogs and articles, product or services reviews, and patient records in the health care sector. A few more recent ones include voice-driven bots like Siri, Alexa, etc.

In order to produce significant and actionable insights from text data, to unlock the potential of text data, we use Natural Language Processing coupled with machine learning and deep learning.

But what is Natural Language Processing - popularly known as NLP? We all know that machines/algorithms cannot understand texts or characters, so it is very important to convert these text data into machine understandable format (like numbers or binary) to perform any kind of analysis on text data. The ability to make machines understand and interpret the human language (text data) is termed as natural language processing.

So, if you want to use the power of unstructured text, this book is the right starting point. This book unearths the concepts and implementation of natural language processing and its applications in the real world. Natural Language Processing (NLP) offers unbounded opportunities for solving interesting problems in artificial intelligence, making it the latest frontier for developing intelligent, deep learning-based applications.

What This Book Covers

Natural Language Processing Recipes is your handy problem-solution reference for learning and implementing NLP solutions using Python. The book is packed with thousands of code and approaches that help you to quickly learn and implement the basic and advanced Natural Language Processing techniques. You will learn how to efficiently use a wide range of NLP packages and implement text classification, identify parts of speech, topic modeling, text summarization, text generation, sentiment analysis, and many more applications of NLP.

This book starts off by ways of extracting text data along with web scraping. You will also learn how to clean and preprocess text data and ways to analyze them with advanced algorithms. During the course of the book, you will explore the semantic as well as syntactic analysis of the text. We will be covering complex NLP solutions that will involve text normalization, various advanced preprocessing methods, POS tagging, text similarity, text summarization, sentiment analysis, topic modeling, NER, word2vec, seq2seq, and much more. In this book, we will cover the various fundamentals necessary for applications of machine learning and deep learning in natural language processing, and the other state-of-the-art techniques. Finally, we close it with some of the advanced industrial applications of NLP with the solution approach and implementation, also leveraging the power of deep learning techniques for Natural Language Processing and Natural Language Generation problems. Employing state-of-the-art advanced RNNs, like long short-term memory, to solve complex text generation tasks. Also, we explore word embeddings.

Each chapter includes several code examples and illustrations.

By the end of the book, the reader will have a clear understanding of implementing natural language processing and will have worked on multiple examples that implement NLP techniques in the real world. The reader will be comfortable with various NLP techniques coupled with machine learning and deep learning and its industrial applications, which make the NLP journey much more interesting and will definitely help improve Python coding skills as well. You will learn about all the ingredients that you need to, to become successful in the NLP space.

Who This Book Is For

Fundamental Python skills are assumed, as well as some knowledge of machine learning. If you are an NLP or machine learning enthusiast and an intermediate Python programmer who wants to quickly master natural language processing, then this learning path will do you a lot of good. All you need are the basics of machine learning and Python to enjoy this book.

What you will learn:
  1. 1)

    Core concepts implementation of NLP and various approaches to natural language processing, NLP using Python libraries such as NLTK, TextBlob, SpaCy, Stanford CoreNLP, and so on.

     
  2. 2)

    Learn about implementing text preprocessing and feature engineering in NLP, along with advanced methods of feature engineering like word embeddings.

     
  3. 3)

    Understand and implement the concepts of information retrieval, text summarization, sentiment analysis, text classification, text generation, and other advanced NLP techniques solved by leveraging machine learning and deep learning.

     
  4. 4)

    After reading this book, the reader should get a good hold of the problems faced by different industries and how to implement them using NLP techniques.

     
  5. 5)

    Implementing an end-to-end pipeline of the NLP life cycle, which includes framing the problem, finding the data, collecting, preprocessing the data, and solving it using state-of-the-art techniques .

     

What You Need For This Book

To perform all the recipes of this book successfully, you will need Python 3.x or higher running on any Windows- or Unix-based operating system with a processor of 2.0 GHz or higher and a minimum of 4 GB RAM. You can download Python from Anaconda and leverage Jupyter notebook for all coding purposes. This book assumes you know Keras’s basics and how to install the basic libraries of machine learning and deep learning.

Please make sure you upgrade or install the latest version of all the libraries.

Python is the most popular and widely used tool for building NLP applications. It has a huge number of sophisticated libraries to perform NLP tasks starting from basic preprocessing to advanced techniques.

To install any library in Python Jupyter notebook. use “!” before the pip install.

NLTK : Natural language toolkit and commonly called the mother of all NLP libraries. It is one of the mature primary resources when it comes to Python and NLP.

!pip install nltk
nltk.download()

SpaCy : SpaCy is recently a trending library, as it comes with the added flavors of a deep learning framework. While SpaCy doesn’t cover all of the NLP functionalities, the things that it does do, it does really well.

!pip install spacy
#if above doesn't work, try this in your terminal/ command prompt
conda install spacy
python -m spacy.en.download all
#then load model via
spacy.load('en')

TextBlob : This is one of the data scientist’s favorite library when it comes to implementing NLP tasks. It is based on both NLTK and Pattern. However, TextBlob certainly isn’t the fastest or most complete library.

!pip install textblob

CoreNLP : It is a Python wrapper for Stanford CoreNLP . The toolkit provides very robust, accurate, and optimized techniques for tagging, parsing, and analyzing text in various languages.

!pip install CoreNLP

These are not the only ones; there are hundreds of NLP libraries. But we have covered widely used and important ones.

Motivation : There is an immense number of industrial applications of NLP that are leveraged to uncover insights. By the end of the book, you will have implemented most of these use cases end to end, right from framing the business problem to building applications and drawing business insights.
  • Sentiment analysis: Customer’s emotions toward products offered by the business.

  • Topic modeling: Extract the unique topics from the group of documents.

  • Complaint classifications/Email classifications/E-commerce product classification, etc.

  • Document categorization/management using different clustering techniques.

  • Resume shortlisting and job description matching using similarity methods.

  • Advanced feature engineering techniques (word2vec and fastText) to capture context.

  • Information/Document Retrieval Systems, for example, search engine.

  • Chatbot, Q & A, and Voice-to-Text applications like Siri and Alexa.

  • Language detection and translation using neural networks.

  • Text summarization using graph methods and advanced techniques.

  • Text generation/predicting the next sequence of words using deep learning algorithms .

Acknowledgments

We are grateful to our mother, father, and loving brother and sister. We thank all of them for their motivation and constant support.

We would like to express our gratitude to mentors and friends for their inputs, inspiration, and support. A special thanks to Anoosh R. Kulkarni, Data Scientist at Awok.com for all his support in writing this book and his technical inputs. Big thanks to the Apress team for their constant support and help.

Finally, we would like to thank you, the reader, for showing an interest in this book and believe that you can make your natural language processing journey more interesting and exciting.

Note that the views expressed in this book are the authors’ personal ones.

Table of Contents

Index 229

About the Authors and About the Technical Reviewers

About the Authors

Akshay Kulkarni
../images/475440_1_En_BookFrontmatter_Figb_HTML.png
is an Artificial Intelligence and Machine learning evangelist. Akshay has a rich experience of building and scaling AI and Machine Learning businesses and creating significant client impact. He is currently the Senior Data Scientist at SapientRazorfish’s core data science team where he is part of strategy and transformation interventions through AI and works on various Machine Learning, Deep Learning, and Artificial Intelligence engagements by applying state-of-the-art techniques in this space. Previously he was part of Gartner and Accenture, where he scaled the analytics and data science business.

Akshay is a regular speaker at major data science conferences. He is a visiting faculty at few of the top graduate institutes in India. In his spare time, he enjoys reading, writing, coding, and helping aspiring data scientists. He lives in Bangalore with his family.

 
Adarsha Shivananda
../images/475440_1_En_BookFrontmatter_Figc_HTML.png
is a Senior Data Scientist at Indegene’s Product and Technology team where he is working on building Machine Learning and AI capabilities to pharma products. He is aiming to build a pool of exceptional data scientists within and outside of the organization to solve greater problems through brilliant training programs; and he always wants to stay ahead of the curve. Previously he was working with Tredence Analytics and IQVIA. Adarsha has extensively worked on pharma, health care, retail, and marketing domains.

He lives in Bangalore and loves to read, ride, and teach data science.

 

About the Technical Reviewers

Dikshant Shahi
../images/475440_1_En_BookFrontmatter_Figd_HTML.png
is a Software Architect with expertise in Search Engines, Semantic Technologies, and Natural Language Processing. He is currently focusing on building semantic search platforms and related enterprise applications. He has been building search engines for more than a decade and is also the author of the book Apache Solr: A Practical Approach to Enterprise Search (Apress, 2015).

Dikshant lives in Bangalore, India. When not at work, you can find him backpacking.

 
Krishnendu Dasgupta
../images/475440_1_En_BookFrontmatter_Fige_HTML.png
is a Senior Consultant with 8 years of experience. He has worked on different cloud platforms and has designed data mining architectures. He is working and contributing toward NLP and Artificial Intelligence through his work. He has worked with major consulting firms and has experience in supply chain and banking domains.

Krishnendu is accredited by the Global Innovation and Entrepreneurship Bootcamp – Class of 2018, held by the Massachusetts Institute of Technology.

 
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.15.84