0%

Book Description

If you want to build an enterprise-quality application that uses natural language text, but aren’t sure where to begin or what tools to use, this practical guide will help get you started. You’ll explore special concerns for developing text-based applications, such as performance.

Alex Thomas, data scientist at Indeed, shows software engineers and data scientists how to build scalable NLP applications using deep learning and the Apache Spark NLP library. Through concrete examples, practical and theoretical explanations, and hands-on exercises for using NLP on the Spark processing framework, this book teaches you everything from NLP basics to applications of powerful modern techniques.

  • Process text in a distributed environment using Spark NLP, a production-ready library for NLP built on Spark
  • Create, tune, and deploy your own word embeddings
  • Adapt your NLP applications to multiple languages
  • Use text in machine learning and deep learning
  • Learn why these techniques work from a machine learning, linguistic, and practical point of view

Table of Contents

  1. Preface
    1. Why Natural Language Processing can be difficult
    2. Background
  2. I. Part 1: Basics
  3. 1. Getting Started
    1. Introduction
    2. Setting up your environment
      1. Prerequisites
      2. Starting Apache Spark
      3. Checking out the code
    3. Getting Familiar with Apache Spark
      1. Starting Apache Spark with Spark NLP
      2. Loading & viewing data in Apache Spark
    4. Hello World with Spark NLP
  4. 2. Natural Language Basics
    1. What is natural language?
      1. Origins of language
      2. Spoken language vs. written language
    2. Linguistics
      1. Phonetics & Phonology
      2. Morphology
      3. Syntax
      4. Semantics
    3. Sociolinguistics: Dialects, Registers, and Other Varieties
      1. Formality
      2. Context
    4. Pragmatics
      1. Roman Jakobson
      2. How to use Pragmatics
    5. Writing Systems
      1. Origins
      2. Alphabets
      3. Abjads
      4. Abugidas
      5. Syllabaries
      6. Logographs
    6. Encodings
      1. ASCII
      2. Unicode
      3. UTF-8
    7. Exercise: Tokenizing
    8. Resources
  5. 3. NLP on Apache Spark
    1. Parallelism, concurrency, distributing computation
      1. Parallelization before Apache Hadoop
      2. MapReduce and Apache Hadoop
      3. Apache Spark
    2. Architecture of Apache Spark
      1. Physical architecture
      2. Logical architecture
    3. Section 3.3 - SparkSQL and Spark MLLib
      1. Transformers
      2. Estimators and Models
      3. Evaluators
    4. NLP libraries
      1. Functionality Libraries
      2. Annotation Libraries
      3. NLP in other libraries
    5. Spark NLP
      1. An annotation library
      2. Stages
      3. Pretrained pipelines
      4. Finisher
      5. Exercises
  6. 4. Deep Learning Basics
    1. Gradient Descent
    2. Backpropagation
    3. Convolutional Neural Networks
    4. Recurrent Neural Networks
    5. Exercises
    6. Resources
  7. II. Building Blocks
  8. 5. Processing Words
    1. Tokenization
    2. Vocabulary reduction
    3. Bag-of-Words
    4. n-Grams
    5. Visualizing: Word and Document Distributions
    6. Exercises
    7. Resources
  9. 6. Information Retrieval
    1. Inverted Indices
    2. Vector Space Model
      1. Stop word removal
      2. Inverse Document Frequency
      3. In Spark
      4. Exercises
      5. Resources
  10. 7. Classification and Regression
    1. Bag-of-Words Features
    2. Regular Expression Features
    3. Feature Selection
    4. Modeling
    5. Iteration
    6. Exercises
  11. 8. Sequence Modeling with Keras
    1. Sentence segmentation
    2. Section segmentation
    3. Part-of-speech tagging
    4. Chunking and Syntactic Parsing
    5. Language models
    6. Recurrent Neural Networks
    7. Exercises
    8. Resources
  12. 9. Information Extraction
    1. Named Entity Recognition
    2. Coreference Resolution
    3. Assertion Status Detection
    4. Relationship Extraction
    5. Summary
    6. Exercises
  13. 10. Topic Modeling
    1. K-Means
    2. Exercises
  14. 11. Embeddings
    1. word2vec
    2. GloVe
    3. fastText
    4. Transformer
    5. ELMo, BERT, and XLNet
    6. doc2vec
    7. Exercise
  15. III. Applications
  16. 12. Sentiment Analysis & Emotion Detection
    1. Problem statement & Constraints
    2. Plan the project
    3. Design the solution
    4. Implement the solution
    5. Test & Measure the solution
    6. Review
  17. 13. Building Knowledge Bases
    1. Problem statement & Constraints
    2. Plan the project
    3. Design the solution
    4. Implement the solution
    5. Test & Measure the solution
      1. Business metrics
      2. Model-centric metrics
      3. Infrastructure metrics
      4. Process metrics
  18. 14. Semantic Search
    1. Problem statement & Constraints
    2. Plan the project
    3. Design the solution
    4. Implement the solution
3.137.174.216