Amazon Comprehend is a fully managed web service that provides access to a deep-learning–based Natural Language Processing (NLP) and topic-modeling engine that you can use in your projects to analyze the content of text documents and implement features such as topic-based classification, content-based search, and customer sentiment analysis.
The machine learning models used by Amazon Comprehend are pre-trained and continually improved by Amazon using inputs from a variety of real-world sources, including Amazon product reviews. Training a deep-learning model is a complex and lengthy task. Using Amazon Comprehend for NLP tasks lets you focus on solving a business problem without having to worry about building, deploying, and maintaining a complex machine learning model.
In this section you will learn some of the key concepts you will encounter while working with Amazon Comprehend.
Natural Language Processing (NLP) is a discipline within artificial intelligence that focuses on creating algorithms that allow computers to analyze, understand, and derive meaning from text. NLP algorithms can be used to extract meaningful and useful information from text such as:
Topic modeling uses NLP algorithms to examine the content of a text document and determine a set of topics (such as entertainment, sports, politics, etc.) that best describe the content. Early attempts at topic modeling were based purely on using preset dictionaries of keywords associated with each topic. Topic modeling is generally used to organize a collection of documents. A database table could be populated using the results of topic modeling and used by an application to retrieve documents.
Amazon Comprehend uses a Latent Dirichlet Allocation (LDA)–based model and is able to infer the context behind the occurrence of a keyword. LDA is a statistical topic modeling approach and assumes that a document is a mixture of a small set of topics and that each topic contains a small set of keywords that are frequently associated with it.
Keywords can be associated with different topics in different documents. For example, the use of the word “drill” in a document that is talking about dentistry will result in the word being associated with the topics of “dentistry” or “medicine.” The use of the same word in a document about offshore oil rigs will result in the word being associated with topics such as “energy,” “petroleum,” and so on.
Amazon Comprehend's text analysis features support six languages at the time this book was written:
Amazon Comprehend's language detection feature supports over a hundred languages (but you can only use the text analysis features on six languages). You can get information on the list of languages supported by the language detection APIs at https://docs.aws.amazon.com/comprehend/latest/dg/how-languages.html.
Amazon Comprehend is available on a pay-per-use model. You will be charged based on the amount of text you process on a monthly basis. This service is included in the AWS free-tier account. You can get more details on the pricing model at https://aws.amazon.com/comprehend/pricing/
.
Amazon Comprehend is not available in all regions. You can get information on the regions in which it is available from the following URL: https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/
.
In this section you will use the Amazon Comprehend management console to perform entity, key phrase, dominant language, syntax, and sentiment analysis on a short text document.
Amazon Comprehend can be used in both interactive and asynchronous mode. In interactive mode, you paste a small amount of text on a web page, click a button to begin the analysis, and view the results of the analysis on the same page within a few seconds.
Using Amazon Comprehend in asynchronous mode involves creating analysis jobs that run asynchronously, reading documents from an S3 bucket, and writing the results of the analysis to a text file in another bucket. Asynchronous mode is not covered in this book. You can find more information on using Amazon Comprehend in asynchronous mode at https://docs.aws.amazon.com/comprehend/latest/dg/how-async.html
.
Log in to the AWS management console using the dedicated sign-in link for your development IAM user account. Use the region selector to select a region where the Amazon Comprehend service is available. The screenshots in this section assume that the console is connected to the EU (Ireland) region. Click the Services menu and access the Amazon Comprehend service home page (Figure 13.1).
Click the Try Amazon Comprehend link on the page (Figure 13.2).
Replace the contents of the Input text field with the following paragraph and click the Analyze button (Figure 13.3):
Machine Learning is a discipline within Artificial Intelligence that deals with creating algorithms that learn from data. Machine learning traces its roots to a computer program created in 1959 by a computer scientist Arthur Samuel while working for IBM. Samuel's program could play a game of checkers and was based on assigning each position on the board a score that indicated the likelihood of leading towards winning the game. The positional scores were refined by having the program play against itself, and with each iteration the performance of the program improved. The program was in effect, learning from experience, and the field of machine learning was born.
Machine learning specifically deals with the problem of creating computer programs that can generalize and predict information reliably, quickly, and with accuracy resembling what a human would do with similar information. Machine learning algorithms require a lot of processing and storage space, and until recently were only possible to deploy in very large companies, or in academic institutions. Recent advances in storage, processor, GPU technology and the ability to rapidly create new virtual computing resources in the cloud have finally provided the processing power required to build and deploy machine learning systems at scale, and get results in real-time.
Amazon Comprehend will take a few seconds to analyze the text. The results of the analysis will be provided as insights on the same page, below the original text that was analyzed. Scroll down to locate the Insights section of the page. The Insights section will list the results of entity, key phrase, language, sentiment, and syntax analysis (Figure 13.4).
Behind the scenes, the user-friendly web-based management console makes use of AWS APIs for the analysis. The following APIs are used to provide the results that you see listed in the Insights section:
You can find more information on these APIs at the following URL: https://docs.aws.amazon.com/comprehend/latest/dg/API_Operations.html
.
You can use the AWS CLI to access the underlying Amazon Comprehend APIs and perform text analysis over the command line. This section assumes you have installed and configured the AWS CLI to use your development IAM credentials.
To perform entity detection using the AWS CLI, launch a Terminal window on your Mac or a Command Prompt window on Windows, and type the following command:
$ aws comprehend detect-entities
--region "eu-west-1"
--language-code "en"
--text "Machine learning traces its roots to a computer program created in 1959 by a computer scientist Arthur Samuel while working for IBM"
This command uses the DetectEntities API (identified on the command line using the detect-entities
identifier), and specifies the AWS region to be used, as well as the text to be analyzed.
Press the Enter key on your keyboard to execute the command. After a few seconds, the output on your computer should resemble the following:
{
"Entities": [
{
"Score": 0.9976276755332947,
"Type": "DATE",
"Text": "1959",
"BeginOffset": 67,
"EndOffset": 71
},
{
"Score": 0.9960350394248962,
"Type": "PERSON",
"Text": "Arthur Samuel",
"BeginOffset": 96,
"EndOffset": 109
},
{
"Score": 0.9668458700180054,
"Type": "ORGANIZATION",
"Text": "IBM",
"BeginOffset": 128,
"EndOffset": 131
}
]
}
Amazon Comprehend has found three entities in the text and the result of the DetectEntities API is a JSON document containing details on the entities. The details include the type of entity, the value of the entity, the character position within the text where the entity was first located, and the confidence score. A confidence score is a number between 0.0 and 1.0. Higher numbers imply that Amazon Comprehend has more confidence in its analysis.
If no entities are found, the DetectEntities API returns a JSON object with an empty Entities
array. This can be observed if you type the following command on your Terminal window and press the Enter key:
$ aws comprehend detect-entities
--region "eu-west-1"
--language-code "en"
--text "Machine learning specifically deals with the problem of creating computer programs that can generalize and predict information reliably, quickly, and with accuracy resembling what a human would do with similar information."
The result of executing the command should contain an empty Entities
array implying that Amazon Comprehend could not find any entities:
{
"Entities": []
}
To detect key phrases in a snippet of text, you will make use of the DetectKeyPhrases API. Type the following command in your Terminal window (or Command Prompt window):
$ aws comprehend detect-key-phrases
--region "eu-west-1"
--language-code "en"
--text "Machine learning traces its roots to a computer program created in 1959 by a computer scientist Arthur Samuel while working for IBM"
This command uses the DetectKeyPhrases API (identified on the command line using the detect-key-phrases
identifier), and specifies the AWS region to be used, as well as the text to be analyzed.
Press the Enter key on your keyboard to execute the command. After a few seconds, the output on your computer should resemble the following:
{
"KeyPhrases": [
{
"Score": 0.5423930883407593,
"Text": "Machine",
"BeginOffset": 0,
"EndOffset": 7
},
{
"Score": 0.6729782819747925,
"Text": "traces",
"BeginOffset": 17,
"EndOffset": 23
},
{
"Score": 0.9950423836708069,
"Text": "its roots",
"BeginOffset": 24,
"EndOffset": 33
},
{
"Score": 0.9978877902030945,
"Text": "a computer program",
"BeginOffset": 37,
"EndOffset": 55
},
{
"Score": 0.9986926913261414,
"Text": "1959",
"BeginOffset": 67,
"EndOffset": 71
},
{
"Score": 0.9789828062057495,
"Text": "a computer scientist Arthur Samuel",
"BeginOffset": 75,
"EndOffset": 109
}
]
}
To get an indication of the overall sentiment of a snippet of text, you will make use of the DetectSentiment API. Type the following command in your Terminal window (or Command Prompt window):
$ aws comprehend detect-sentiment
--region "eu-west-1"
--language-code "en"
--text "Machine learning traces its roots to a computer program created in
1959 by a computer scientist Arthur Samuel while working for IBM"
This command uses the DetectSentiment API (identified on the command line using the detect-sentiment
identifier), and specifies the AWS region to be used, as well as the text to be analyzed.
Press the Enter key on your keyboard to execute the command. After a few seconds, the output on your computer should resemble the following:
{
"Sentiment": "NEUTRAL",
"SentimentScore": {
"Positive": 0.0015290803276002407,
"Negative": 0.0024455683305859566,
"Neutral": 0.9954049587249756,
"Mixed": 0.0006204072269611061
}
}
In the previous sections of this chapter you have learned to use Amazon Comprehend using the management console and the AWS CLI. While using these APIs interactively certainly provides results, you cannot integrate AWS Comprehend with your own projects in this way.
To integrate AWS in a real-world project, you will most likely pick one of two approaches:
In this section you will create an AWS Lambda function that will be triggered when a document is uploaded to an S3 bucket. Once triggered, the AWS Lambda function will read the uploaded document from S3 and use Amazon Comprehend's entity detection APIs to create a file in a different S3 bucket that contains a list of entities discovered in the document.
In real-world scenarios, there may be several other events that you could use to trigger the AWS Lambda function, such as a message being posted to an SNS topic or an HTTP request being received by an API Gateway. Triggering an AWS Lambda function in the scenarios just described is not covered in this book.
You can download the code files for this chapter from www.wiley.com/go/machinelearningawscloud
or from GitHub using the following URL:
To get started, log in to the AWS management console using the dedicated sign-in link for your development IAM user account and use the region selector to select a region where the Amazon Comprehend service is available. The screenshots in this section assume that the console is connected to the EU (Ireland) region. Click the Services menu and access the AWS Lambda service home page.
If you are using Lambda for the first time, you will be presented with the AWS Lambda splash screen (Figure 13.5). Expand the menu on the left side of the page and click the Dashboard menu item to access the AWS Lambda dashboard.
If you have used Lambda in the past, you will arrive at the AWS Lambda dashboard (Figure 13.6). You can click the Create Function button to start the process of creating a new AWS Lambda function.
After clicking the Create function button, you will be asked to select a template for the function (Figure 13.7). You can either create a Lambda function from scratch, use a predefined template (known as a blueprint) as a starting point, or create a ready-to-use function by picking code from the AWS Serverless Application Repository. Select the Author From Scratch option.
Name the function DetectEntities, use the Runtime drop-down to select the Python 3.6 runtime, and select the Create A New Role With Basic Lambda Permissions from the Execution Role drop-down (Figure 13.8).
AWS will create a new IAM role for your function with a minimal set of permissions that will allow your function to write logs to AWS CloudWatch. The name of this IAM role is displayed below the Execution Role drop down in Figure 13.8 and will be similar to DetectEntities-role-xxxxx
. Make a note of this name as you will need to modify the role to allow access to Amazon S3 and Amazon Comprehend. Click on the Create Function button at the bottom of the page to create the AWS Lambda function and the IAM role.
After the AWS Lambda function is created, use the services menu to switch to the IAM management console and navigate to the new IAM role that was just created for you when you created the lambda function. Locate the permissions policy document associated with the role and click on the Edit Policy button (Figure 13.9).
You will be taken to the policy editor screen; click on the JSON tab to view the policy document as a JSON file (Figure 13.10).
Add the following objects to the Statement
array:
{
"Action": [
"comprehend:DetectEntities"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Effect": "Allow",
"Resource": "arn:aws:s3:::*"
}
Your final policy document should resemble this snippet:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "logs:CreateLogGroup",
"Resource": "arn:aws:logs:eu-west-1:5083XXXX:*"
},
{
"Effect": "Allow",
"Action": [
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:eu-west-1:508XXXX13:log-group:/aws/lambda/DetectEntities:*"
]
},
{
"Action": [
"comprehend:DetectEntities"
],
"Effect": "Allow",
"Resource": "*"
},
{
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Effect": "Allow",
"Resource": "arn:aws:s3:::*"
}
]
}
This policy document allows AWS Lambda to write logs to CloudWatch, call the AWS Comprehend DetectEntities API, and read/write objects from any S3 bucket in your account. You can allow access to additional Amazon Comprehend APIs by adding the relevant actions after "comprehend:DetectEntities"
. You can get a list of available Amazon Comprehend actions that can be used in policy documents at https://docs.aws.amazon.com/comprehend/latest/dg/comprehend-api-permissions-ref.html
.
Click on the Review Policy button at the bottom of the page to go to the Review Policy screen (Figure 13.11).
Click on the Save changes button to finish updating the IAM policy. After the policy changes have been saved, use the Services menu to switch back to the AWS Lambda management console and navigate to the DetectEntities
lambda function (Figure 13.12). The function designer section of the page should now list Amazon CloudWatch Logs, Amazon S3, and Amazon Comprehend as the resources that can be accessed by the function.
Locate the function designer section of the page and add the Amazon S3 trigger to the function (Figure 13.13).
Scroll down to the Configure Triggers section and choose an S3 bucket that will serve as the event source. In this example, the source bucket is called awsml-comprehend-entitydetection-source
. Ensure the Event Type is set to Object Created (All) and click the Add button to finish configuring the S3 event trigger (Figure 13.14).
Click the Save button at the top of the page to save your changes. By creating the trigger, you have set up the Lambda function to be executed every time a new file is uploaded to the source S3 bucket.
To update the Lambda function code, scroll down to the function designer and click the function name to reveal the code editor (Figure 13.15).
Replace the boilerplate code in the code editor with the contents of Listing 13.1.
Click the Save button below the function editor to finish creating your Lambda function. To test this function, use the Services drop-down menu to switch to Amazon S3 and navigate to the bucket that you have associated with the AWS Lambda function trigger. Upload a UTF-8 text file into the bucket. In a few seconds, you should see a new file created in the output bucket with the results of the entity analysis.
The output bucket is hardcoded in Listing 13.1. You should change it to the appropriate value if you are using a different bucket name. The bucket names and files used in this example are:
awsml-comprehend-entitydetection-source
awsml-comprehend-entitydetection-result
comprehend_entitydetection_input1.txt
entityresult-comprehend_entitydetection_input1.txt
www.wiley.com/go/machinelearningawscloud
or from GitHub using the following URL:
3.147.71.94