Overview
This chapter describes the use of Topic Modeling to understand common themes in a document set by analyzing documents using Amazon Comprehend. You will learn the fundamentals of the algorithm used for Topic Modeling, Latent Dirichlet Allocation (LDA). Learning LDA will allow you to apply Topic Modeling to a multitude of unique business cases. You will then perform Topic Modeling on two documents with a known topic structure. By the end of this chapter, you will be able to extract and analyze common themes through Topic Modeling with Amazon Comprehend and describe the basics of Topic Modeling analysis. You will also be able to perform Topic Modeling on a set of documents and analyze the results.
Topic Modeling is an important capability for business systems to make sense of unstructured information, ranging from support tickets to customer feedback and complaints, to business documents. Topic Modeling helps process automation to route customer feedback and mail; it enables a business to categorize and then effectively respond to social media posts, reviews, and other user-generated content from the various channels. It enables businesses to respond faster to critical items by understanding the topics and themes on incoming omnichannel interactions as well as responding most effectively by routing the materials to the most appropriate teams. Another two areas where Topic Modeling helps are knowledge management and brand monitoring.
The subjects or common themes of a set of documents can be determined with Amazon Comprehend. For example, you have a movie review website with two message boards, and you want to determine which message board is discussing two newly released movies (one about sport and the other about a political topic). You can provide the message board text data to Amazon Comprehend to discover the most prominent topics discussed on each message board.
The machine learning algorithm that Amazon Comprehend uses to perform Topic Modeling is called Latent Dirichlet Allocation (LDA). LDA is a learning-based model that's used to determine the most important topics in a collection of documents.
How LDA works is that it considers every document to be a combination of topics, and each word in the document is associated with one of these topics.
For example, if the first paragraph of a document consists of words such as eat, chicken, restaurant, and cook, then you conclude that the topic can be generalized to Food. Similarly, if the second paragraph of a document contains words such as ticket, train, kilometer, and vacation, then you can conclude that the topic is Travel.
LDA has lots of math behind it—concepts such as Expectation Maximization, Gibs sampling, priors, and a probability distribution over a "bag of words". If you want to understand the mathematical underpinnings, a good start is the Amazon documentation on SageMaker (https://docs.aws.amazon.com/sagemaker/latest/dg/lda-how-it-works.html). Let's look at LDA more pragmatically and understand it empirically through an example.
Say you have one document with six sentences, and you want to infer two common topics.
The sentences are as follows:
When you feed these sentences into an LDA algorithm, specifying the number of topics as two, it will discover the following:
Sentence-Topics
Sentence 1: Topic 0
Sentence 2: Topic 0
Sentence 3: Topic 1
Sentence 4: Topic 1
Sentence 5: Topic 0
Sentence 6: Topic 0
Topic terms
Topic 0: life 12%, people 8%, experience 8%, love 5%, and so forth
Topic 1: 62% revolution, 23% war, and the rest 15%
Of course, knowing that the sentences are from the book Dr. Zhivago by the famous Russian author Boris Pasternak, the topics war and life/love seem reasonable.
While this example is a simplistic depiction of a complex algorithm, it gives you an idea. As discussed in this chapter, in various business situations, an indication of what a document or an e-mail or a social media post is about is very valuable for downstream systems—and the ability to perform that classification automatically is priceless.
LDA is useful when you want to group a set of documents based on common topics, without thinking about the documents themselves. LDA can create subjects from inferring the general topics by analyzing the words in the documents. This is usually utilized in suggestion frameworks, report arrangement, and record synopsis. In conclusion, LDA has many uses. For example, you have 30,000 user emails and want to determine the most common topics to provide group-specific recommended content based on the most prevalent topics. Manually reading, or even outsourcing the manual reading, of 30,000 emails would take an excessive investment in terms of time and money, and the accuracy would be difficult to confirm. However, Amazon Comprehend can seamlessly provide the most common topics in 30,000 emails in a few steps with incredible accuracy. First, convert the emails to text files, upload them to an S3 bucket, and then imitate a Topic Modeling job with Amazon Comprehend. The output is two CSV files with the corresponding topics and terms.
The most accurate results are obtained if you provide Comprehend with the largest possible corpus. More specifically:
Currently, Topic Modeling is limited to two document languages: English and Spanish.
A Topic Modeling job allows two format types for input data (refer to the following Figure 3.1). This allows users to process both collections of large documents (for example, newspaper articles or scientific journals), and short documents (for example, tweets or social media posts).
Input Format Options:
Output Format Options:
After Amazon Comprehend processes your document collection, the modeling outputs two CSV files: topic-terms.csv (see Figure 3.2) and doc-topics.csv.
The topic-terms.csv file provides a list of topics in the document collection with the terms, respective topics, and their weights. For example, if you gave Amazon Comprehend two hypothetical documents, learning to garden and investment strategies, it might return the following to describe the two topics in the collection:
The doc-topics.csv file provides a list of the documents provided for the Topic Modeling job, and the respective topics and their proportions in each document. Given two hypothetical documents, learning_to_garden.txt and investment_strategies.txt, you can expect the following output:
In this exercise, we will use two documents (Romeo and Juliet and War of the Worlds) to better understand LDA. We will use Amazon Comprehend to discover the main topics in the two documents. Before proceeding to the exercise, just look at an overview of the data pipeline architecture. The text files are stored in S3, and then we direct Comprehend to look for the files in the input bucket. Comprehend analyzes the documents and puts the results back in S3 in the output bucket:
Complete the Topic Modeling of a known topic structure:
Note
The bucket names in AWS have to be unique. So, you might get an error saying, "Bucket name already exists." One easy way to get a unique name is to append the bucket name with today's date (plus time, if required); say, YYYYMMDDHHMM. While writing this chapter, we created a bucket, aws-ml-input-for-topic-modeling-20200301.
Clicking Create in the following window uses all the default settings for properties and permissions, while clicking Next allows you to adjust these settings according to your needs.
By way of an example, we have downloaded the files to the Documents/aws-book/The-Applied-AI-and-Natural-Language-Processing-with-AWS directory. Navigate to Upload and select the following two text files from your local disk. As you may have guessed, the files for this exercise are located in the Exercise3.01 subdirectory:
Once the files have been selected, click on the Open button to upload the files:
The following figure shows the uploading of text files:
Now you have two buckets, one for input with two text files, and an empty output bucket. Let's now proceed to Amazon Comprehend.
Check to make sure that Input and Output S3 buckets is listed under Permissions to access:
Note
Bear in mind that clicking Create job starts the job as well. There is no separate "start a job" button. Also, if you want to redo the job, you will have to use the Copy button.
This will take you directly to the S3 bucket:
Note
Your topic-terms.csv and doc-topics.csv results should be the same as the following results. If your results are NOT the same, use the output files for the remainder of the chapter, which are located at Chapter03/Exercise3.01/topic-terms.csv https://packt.live/3iHlH5y and Chapter03/Exercise3.01/doc-topics.csv https://packt.live/2ZMTaTw.
The following is the output generated. As we had indicated that we want to have topics, Comprehend has segregated the relevant words into two groups/topics as well as the weights. It doesn't know what the topics are, but has inferred the similarity of the words to one of the two topics:
The doc-topics.csv shows the affinity of the documents to the topics. In this case, it is very deterministic, but if we have more topics, the proportion will show the strength of the topics in each of the documents:
In this exercise, we used Amazon Comprehend to infer topics embedded in a set of documents. While this is easier to do with two documents; Amazon Comprehend is very effective when we have hundreds of documents with multiple documents and we want to perform process automation.
While it is easy to look at one or two outputs, when we want to scale and analyze hundreds of documents with different topics, we need to use Comprehend programmatically. That is what we will do in this exercise.
In this exercise, we will programmatically upload the CSV files (doc-topics.csv and Topic-terms.csv) to S3, merge the CSV files on the Topic column, and print the output to the console. The following are the steps for performing known structure analysis:
Note
For this step, you will be using Jupyter Notebook. You may either follow along with the exercise and type in the code or obtain it from the source code folder, local_csv_to_s3_for_analysis.ipynb, and paste it into the editor. The source code is available on GitHub in the following repository: https://packt.live/2BOqjWT. As explained in Chapter 1, An Introduction to AWS, you should have downloaded the repository to your local disk.
import boto3
import pandas as pd
# Setup a region
region = 'us-west-2'
# Create an S3 client
s3 = boto3.client('s3',region_name = region)
# Creates a variable with the bucket name
#'<insert a unique bucket name>'
bucket_name = 'known-tm-analysis-20200302'
# Create a location Constraint
location = {'LocationConstraint': region}
# Creates a new bucket
s3.create_bucket(Bucket=bucket_name,
CreateBucketConfiguration=location)
filenames_list = ['doc-topics.csv', 'topic-terms.csv']
Note
Ensure that the two CSV files (highlighted) in the aforementioned step are stored in the same location where you're running the Jupyter Notebook code. An alternative is to specify the exact path as it exists on your local system.
for filename in filenames_list:
s3.upload_file(filename, bucket_name, filename)
Note
Do not execute steps 7 and 8 yet. We will show the code for the entire for block in step 9.
if filename == 'doc-topics.csv':
obj = s3.get_object(Bucket=bucket_name, Key=filename)
for filename in filenames_list:
# Uploads each CSV to the created bucket
s3.upload_file(filename, bucket_name, filename)
# checks if the filename is 'doc-topics.csv'
if filename == 'doc-topics.csv':
# gets the 'doc-topics.csv' file as an object
obj = s3.get_object(Bucket=bucket_name, Key=filename)
# reads the csv and assigns to doc_topics
doc_topics = pd.read_csv(obj['Body'])
else:
obj = s3.get_object(Bucket=bucket_name, Key=filename)
topic_terms = pd.read_csv(obj['Body'])
merged_df = pd.merge(doc_topics, topic_terms, on='topic')
# print the merged_df to the console
print(merged_df)
There will be two CSV files in the bucket – doc-topics.csv and topic-terms.csv:
In this exercise, we learned how to use Comprehend programmatically. We programmatically uploaded two CSV files to S3, merged them on a column, and printed the output to the console.
In this activity, we will perform Topic Modeling on a set of documents with unknown topics. Suppose your employer wants you to build a data pipeline to analyze negative movie reviews that are in individual text files with a unique ID filename. Thus, you need to perform Topic Modeling to determine which files represent the respective topics. Overall, negative reviews represent a loss to the company, so they are prioritizing negative reviews over positive reviews. The company's end goal is to incorporate the data into a feedback chatbot application. To ensure that this happens correctly, you need a file that contains negative comments. The expected outcome for this activity will be the Topic Modeling results from the negative movie review files.
Performing Topic Modeling:
Analysis of Unknown Topics:
This is a long activity. Yet, you were able to manage 1,000 files, upload them to S3, perform Topic Modeling using Amazon Comprehend, and then merge the results into a table that had more than 40,000 rows. In real-world situations, you will be handling thousands of documents, not just one or two. That is the reason we did this activity using Jupyter Notebook and Python.
However, this is only the first step in a multi-step automation process — an important and essential step of inferencing on the unstructured documents. While Comprehend analyzed the documents and gave us a list of topics, it is still our job to figure out what to do with them.
Note
The solution for this activity can be found on page 291.
In this chapter, we learned about analyzing Topic Modeling results from AWS Comprehend. You are now able to incorporate S3 to store data and use it to perform analysis. Also, we learned how to analyze documents where we know the topics before performing Topic Modeling, as well as documents where the topic is unknown. We know that the latter requires additional analysis to determine the relevant topics.
We did not build the downstream systems that analyze the topic lists and then route the document appropriately. For example, you might have a mapping of the topics to a SharePoint folder for knowledge management or a workflow to route the files via email to appropriate persons depending on the topics detected. While the broader topic of Robotic Process Automation (RPA) is beyond the scope of this book, you have learned how to use Amazon Comprehend to implement the Topic and Theme detection steps for process automation.
Another application of what you learned in this chapter is document clustering for knowledge management. In this case, we would restrict the number of topics to 10 and then segregate the documents based on their major topics. For example, if these documents were news articles, this process would divide the articles into 10 subjects, which are easier to handle in downstream systems such as a new recommendation engine.
As you can see, Topic Modeling can be applied in a variety of applications and systems. Now you have the skills required to perform Topic Modeling using Amazon Comprehend.
In the next chapter, we will dive into the concept of chatbots and their use of natural language processing.
3.138.69.172