0%

Book Description

Turning text into valuable information is essential for many businesses looking to gain a competitive advantage. There have been many improvements in natural language processing and users have a lot of options when choosing to work on a problem. However, it’s not always clear which NLP tools or libraries would work for a business use—or which techniques you should use and in what order.

This practical book provides theoretical background and real-world case studies with detailed code examples to help developers and data scientists obtain insight from text online. Authors Jens Albrecht, Sidharth Ramachandran, and Christian Winkler use blueprints for text-related problems that apply state-of-the-art machine learning methods in Python.

If you have a fundamental understanding of statistics and machine learning along with basic programming experience in Python, you’re ready to get started. You’ll learn how to:

  • Crawl and clean then explore and visualize textual data in different formats
  • Preprocess and vectorize text for machine learning
  • Apply methods for classification, topic analysis, summarization, and knowledge extraction
  • Use semantic word embeddings and deep learning approaches for complex problems
  • Work with Python NLP libraries like spaCy, NLTK, and Gensim in combination with scikit-learn, Pandas, and PyTorch

Table of Contents

  1. 1. Gaining Early Insights from Textual Data
    1. Exploratory Data Analysis
    2. Introducing the Dataset
    3. Blueprint: Building a Simple Text Preprocessing Pipeline
    4. Blueprints for Word Frequency Analysis
    5. Blueprint: Finding a Keyword in Context (KWIC)
    6. Blueprint: Analyzing N-Grams
    7. Blueprint: Comparing Frequencies across Time-Intervals and Categories
    8. Closing Remarks
  2. 2. Scraping Websites and Extracting Data
    1. What You’ll Learn and What We Will Build
    2. Scraping and Data Extraction
    3. Introducing the Reuters News Archive
    4. URL Generation
    5. Downloading Data
    6. Extracting Semi-structured Data
    7. Blueprint: Spidering
    8. Density-based Text Extraction
    9. All-in-one Approach
    10. Possible Problems with Scraping
    11. Closing Remarks and Recommendation
  3. 3. How to use text classification algorithms to identify and classify text into multiple categories
    1. Introducing the Java Development Tools Bug Dataset
    2. Blueprint: Building a Text Classification system
    3. Final Blueprint for Text Classification
    4. Cross-Validation
    5. Hyperparameter Tuning with Grid Search
    6. Blueprint recap and conclusion
    7. Closing Remarks
    8. Further Reading
3.15.151.140