Chapter 13
Amazon Comprehend

WHAT'S IN THIS CHAPTER

  • Introduction to Natural Language Processing (NLP) concepts
  • Introduction to the Amazon Comprehend NLP service
  • Use the Amazon Comprehend management console to analyze text
  • Use the AWS CLI to analyze text
  • Call Amazon Comprehend APIs from AWS Lambda

Amazon Comprehend is a fully managed web service that provides access to a deep-learning–based Natural Language Processing (NLP) and topic-modeling engine that you can use in your projects to analyze the content of text documents and implement features such as topic-based classification, content-based search, and customer sentiment analysis.

The machine learning models used by Amazon Comprehend are pre-trained and continually improved by Amazon using inputs from a variety of real-world sources, including Amazon product reviews. Training a deep-learning model is a complex and lengthy task. Using Amazon Comprehend for NLP tasks lets you focus on solving a business problem without having to worry about building, deploying, and maintaining a complex machine learning model.

Key Concepts

In this section you will learn some of the key concepts you will encounter while working with Amazon Comprehend.

Natural Language Processing

Natural Language Processing (NLP) is a discipline within artificial intelligence that focuses on creating algorithms that allow computers to analyze, understand, and derive meaning from text. NLP algorithms can be used to extract meaningful and useful information from text such as:

  • Entities: Amazon Comprehend can analyze text and return a list of named entities such as people and places, along with a confidence score. Each entity has an associated type. Amazon Comprehend supports the following entity types:
    1. COMMERCIAL_PRODUCT
    2. DATE
    3. EVENT
    4. LOCATION
    5. ORGANIZATION
    6. PERSON
    7. QUANTITY
    8. TITLE
    9. OTHER
  • Key phrases: Amazon Comprehend can analyze text and return a list of key phrases (or talking points) as well as confidence scores. A key phrase is defined as a noun phrase that talks about a particular thing—such as “my new camera.” This could be useful if you were using Amazon Comprehend to analyze blog posts and wanted information on what people were talking about.
  • Sentiment: Amazon Comprehend can analyze text and return an indicator of the overall sentiment (Positive, Negative, Neutral, or Mixed) along with confidence scores for each sentiment. This could be useful if you were using Amazon Comprehend to analyze customer comments and product reviews.
  • Syntax: Amazon Comprehend can be used to analyze text and identify word boundaries and labels such as nouns, pronouns, adjectives, and verbs. Amazon Comprehend can identify the following syntax elements:
    • Nouns
    • Verbs
    • Numerals
    • Particles
    • Pronouns
    • Proper nouns
    • Punctuation
    • Symbols
    • Subordinating conjunctions
    • Adjectives
    • Adpositions
    • Adverbs
    • Auxiliaries
    • Coordinating conjunctions
    • Determiners
    • Interjections
  • Dominant language: Amazon Comprehend can be used to identify the dominant language in a document. The identified language is represented using codes described in RFC 5646. Amazon Comprehend also provides a confidence score that indicates the level of confidence in the analysis. The confidence score does not indicate the percentage of the language that makes up the document.

Topic Modeling

Topic modeling uses NLP algorithms to examine the content of a text document and determine a set of topics (such as entertainment, sports, politics, etc.) that best describe the content. Early attempts at topic modeling were based purely on using preset dictionaries of keywords associated with each topic. Topic modeling is generally used to organize a collection of documents. A database table could be populated using the results of topic modeling and used by an application to retrieve documents.

Amazon Comprehend uses a Latent Dirichlet Allocation (LDA)–based model and is able to infer the context behind the occurrence of a keyword. LDA is a statistical topic modeling approach and assumes that a document is a mixture of a small set of topics and that each topic contains a small set of keywords that are frequently associated with it.

Keywords can be associated with different topics in different documents. For example, the use of the word “drill” in a document that is talking about dentistry will result in the word being associated with the topics of “dentistry” or “medicine.” The use of the same word in a document about offshore oil rigs will result in the word being associated with topics such as “energy,” “petroleum,” and so on.

Language Support

Amazon Comprehend's text analysis features support six languages at the time this book was written:

  • English
  • French
  • German
  • Italian
  • Portuguese
  • Spanish

Amazon Comprehend's language detection feature supports over a hundred languages (but you can only use the text analysis features on six languages). You can get information on the list of languages supported by the language detection APIs at https://docs.aws.amazon.com/comprehend/latest/dg/how-languages.html.

Pricing and Availability

Amazon Comprehend is available on a pay-per-use model. You will be charged based on the amount of text you process on a monthly basis. This service is included in the AWS free-tier account. You can get more details on the pricing model at https://aws.amazon.com/comprehend/pricing/.

Amazon Comprehend is not available in all regions. You can get information on the regions in which it is available from the following URL: https://aws.amazon.com/about-aws/global-infrastructure/regional-product-services/.

Text Analysis Using the Amazon Comprehend Management Console

In this section you will use the Amazon Comprehend management console to perform entity, key phrase, dominant language, syntax, and sentiment analysis on a short text document.

Amazon Comprehend can be used in both interactive and asynchronous mode. In interactive mode, you paste a small amount of text on a web page, click a button to begin the analysis, and view the results of the analysis on the same page within a few seconds.

Using Amazon Comprehend in asynchronous mode involves creating analysis jobs that run asynchronously, reading documents from an S3 bucket, and writing the results of the analysis to a text file in another bucket. Asynchronous mode is not covered in this book. You can find more information on using Amazon Comprehend in asynchronous mode at https://docs.aws.amazon.com/comprehend/latest/dg/how-async.html.

Log in to the AWS management console using the dedicated sign-in link for your development IAM user account. Use the region selector to select a region where the Amazon Comprehend service is available. The screenshots in this section assume that the console is connected to the EU (Ireland) region. Click the Services menu and access the Amazon Comprehend service home page (Figure 13.1).

Screenshot of accessing the Amazon Comprehend service home page.

FIGURE 13.1 Accessing the Amazon Comprehend service home page

Click the Try Amazon Comprehend link on the page (Figure 13.2).

Screenshot of testing the capabilities of Amazon Comprehend.

FIGURE 13.2 Testing the capabilities of Amazon Comprehend

Replace the contents of the Input text field with the following paragraph and click the Analyze button (Figure 13.3):

Screenshot of Analyzing text with Amazon Comprehend.

FIGURE 13.3 Analyzing text with Amazon Comprehend

Machine Learning is a discipline within Artificial Intelligence that deals with creating algorithms that learn from data. Machine learning traces its roots to a computer program created in 1959 by a computer scientist Arthur Samuel while working for IBM. Samuel's program could play a game of checkers and was based on assigning each position on the board a score that indicated the likelihood of leading towards winning the game. The positional scores were refined by having the program play against itself, and with each iteration the performance of the program improved. The program was in effect, learning from experience, and the field of machine learning was born.

Machine learning specifically deals with the problem of creating computer programs that can generalize and predict information reliably, quickly, and with accuracy resembling what a human would do with similar information. Machine learning algorithms require a lot of processing and storage space, and until recently were only possible to deploy in very large companies, or in academic institutions. Recent advances in storage, processor, GPU technology and the ability to rapidly create new virtual computing resources in the cloud have finally provided the processing power required to build and deploy machine learning systems at scale, and get results in real-time.

Amazon Comprehend will take a few seconds to analyze the text. The results of the analysis will be provided as insights on the same page, below the original text that was analyzed. Scroll down to locate the Insights section of the page. The Insights section will list the results of entity, key phrase, language, sentiment, and syntax analysis (Figure 13.4).

Screenshot of Amazon Comprehend presents analysis results as insights.

FIGURE 13.4 Amazon Comprehend presents analysis results as insights.

Behind the scenes, the user-friendly web-based management console makes use of AWS APIs for the analysis. The following APIs are used to provide the results that you see listed in the Insights section:

  • Entity detection: DetectEntities
  • Key phrase detection: DetectKeyPhrases
  • Language detection: DetectDominantLanguage
  • Sentiment analysis: DetectSentiment
  • Syntax analysis: DetectSyntax

You can find more information on these APIs at the following URL: https://docs.aws.amazon.com/comprehend/latest/dg/API_Operations.html.

Interactive Text Analysis with the AWS CLI

You can use the AWS CLI to access the underlying Amazon Comprehend APIs and perform text analysis over the command line. This section assumes you have installed and configured the AWS CLI to use your development IAM credentials.

Entity Detection with the AWS CLI

To perform entity detection using the AWS CLI, launch a Terminal window on your Mac or a Command Prompt window on Windows, and type the following command:

$ aws comprehend detect-entities 
      --region "eu-west-1" 
      --language-code "en" 
      --text "Machine learning traces its roots to a computer program created in 1959 by a computer scientist Arthur Samuel while working for IBM"

This command uses the DetectEntities API (identified on the command line using the detect-entities identifier), and specifies the AWS region to be used, as well as the text to be analyzed.

Press the Enter key on your keyboard to execute the command. After a few seconds, the output on your computer should resemble the following:

{
    "Entities": [
        {
            "Score": 0.9976276755332947,
            "Type": "DATE",
            "Text": "1959",
            "BeginOffset": 67,
            "EndOffset": 71
        },
        {
            "Score": 0.9960350394248962,
            "Type": "PERSON",
            "Text": "Arthur Samuel",
            "BeginOffset": 96,
            "EndOffset": 109
        },
        {
            "Score": 0.9668458700180054,
            "Type": "ORGANIZATION",
            "Text": "IBM",
            "BeginOffset": 128,
            "EndOffset": 131
        }
    ]
}

Amazon Comprehend has found three entities in the text and the result of the DetectEntities API is a JSON document containing details on the entities. The details include the type of entity, the value of the entity, the character position within the text where the entity was first located, and the confidence score. A confidence score is a number between 0.0 and 1.0. Higher numbers imply that Amazon Comprehend has more confidence in its analysis.

If no entities are found, the DetectEntities API returns a JSON object with an empty Entities array. This can be observed if you type the following command on your Terminal window and press the Enter key:

$ aws comprehend detect-entities 
       --region "eu-west-1" 
       --language-code "en" 
       --text "Machine learning specifically deals with the problem of creating computer programs that can generalize and predict information reliably, quickly, and with accuracy resembling what a human would do with similar information."

The result of executing the command should contain an empty Entities array implying that Amazon Comprehend could not find any entities:

{
    "Entities": []
}

Key Phrase Detection with the AWS CLI

To detect key phrases in a snippet of text, you will make use of the DetectKeyPhrases API. Type the following command in your Terminal window (or Command Prompt window):

$ aws comprehend detect-key-phrases 
--region "eu-west-1" 
--language-code "en" 
--text "Machine learning traces its roots to a computer program created in 1959 by a computer scientist Arthur Samuel while working for IBM"

This command uses the DetectKeyPhrases API (identified on the command line using the detect-key-phrases identifier), and specifies the AWS region to be used, as well as the text to be analyzed.

Press the Enter key on your keyboard to execute the command. After a few seconds, the output on your computer should resemble the following:

{
    "KeyPhrases": [
        {
            "Score": 0.5423930883407593,
            "Text": "Machine",
            "BeginOffset": 0,
            "EndOffset": 7
        },
        {
            "Score": 0.6729782819747925,
            "Text": "traces",
            "BeginOffset": 17,
            "EndOffset": 23
        },
        {
            "Score": 0.9950423836708069,
            "Text": "its roots",
            "BeginOffset": 24,
            "EndOffset": 33
        },
        {
            "Score": 0.9978877902030945,
            "Text": "a computer program",
            "BeginOffset": 37,
            "EndOffset": 55
        },
        {
            "Score": 0.9986926913261414,
            "Text": "1959",
            "BeginOffset": 67,
            "EndOffset": 71
        },
        {
            "Score": 0.9789828062057495,
            "Text": "a computer scientist Arthur Samuel",
            "BeginOffset": 75,
            "EndOffset": 109
        }
    ]
}

Sentiment Analysis with the AWS CLI

To get an indication of the overall sentiment of a snippet of text, you will make use of the DetectSentiment API. Type the following command in your Terminal window (or Command Prompt window):

$ aws comprehend detect-sentiment 
    --region "eu-west-1" 
    --language-code "en" 
    --text "Machine learning traces its roots to a computer program created in 
1959 by a computer scientist Arthur Samuel while working for IBM"

This command uses the DetectSentiment API (identified on the command line using the detect-sentiment identifier), and specifies the AWS region to be used, as well as the text to be analyzed.

Press the Enter key on your keyboard to execute the command. After a few seconds, the output on your computer should resemble the following:

{
    "Sentiment": "NEUTRAL",
    "SentimentScore": {
        "Positive": 0.0015290803276002407,
        "Negative": 0.0024455683305859566,
        "Neutral": 0.9954049587249756,
        "Mixed": 0.0006204072269611061
    }
}

Using Amazon Comprehend with AWS Lambda

In the previous sections of this chapter you have learned to use Amazon Comprehend using the management console and the AWS CLI. While using these APIs interactively certainly provides results, you cannot integrate AWS Comprehend with your own projects in this way.

To integrate AWS in a real-world project, you will most likely pick one of two approaches:

  • Use one of the language-specific AWS SDKs and call the Amazon Comprehend APIs directly from your code.
  • Create an AWS Lambda function that will call the Amazon Comprehend APIs when triggered.

In this section you will create an AWS Lambda function that will be triggered when a document is uploaded to an S3 bucket. Once triggered, the AWS Lambda function will read the uploaded document from S3 and use Amazon Comprehend's entity detection APIs to create a file in a different S3 bucket that contains a list of entities discovered in the document.

In real-world scenarios, there may be several other events that you could use to trigger the AWS Lambda function, such as a message being posted to an SNS topic or an HTTP request being received by an API Gateway. Triggering an AWS Lambda function in the scenarios just described is not covered in this book.

To get started, log in to the AWS management console using the dedicated sign-in link for your development IAM user account and use the region selector to select a region where the Amazon Comprehend service is available. The screenshots in this section assume that the console is connected to the EU (Ireland) region. Click the Services menu and access the AWS Lambda service home page.

If you are using Lambda for the first time, you will be presented with the AWS Lambda splash screen (Figure 13.5). Expand the menu on the left side of the page and click the Dashboard menu item to access the AWS Lambda dashboard.

Screenshot of AWS Lambda splash screen.

FIGURE 13.5 AWS Lambda splash screen

If you have used Lambda in the past, you will arrive at the AWS Lambda dashboard (Figure 13.6). You can click the Create Function button to start the process of creating a new AWS Lambda function.

Screenshot of AWS Lambda dashboard.

FIGURE 13.6 AWS Lambda dashboard

After clicking the Create function button, you will be asked to select a template for the function (Figure 13.7). You can either create a Lambda function from scratch, use a predefined template (known as a blueprint) as a starting point, or create a ready-to-use function by picking code from the AWS Serverless Application Repository. Select the Author From Scratch option.

Screenshot of creating an AWS Lambda function from scratch.

FIGURE 13.7 Creating an AWS Lambda function from scratch

Name the function DetectEntities, use the Runtime drop-down to select the Python 3.6 runtime, and select the Create A New Role With Basic Lambda Permissions from the Execution Role drop-down (Figure 13.8).

Screenshot of Lambda Function Name and Runtime settings.

FIGURE 13.8 Lambda Function Name and Runtime settings

AWS will create a new IAM role for your function with a minimal set of permissions that will allow your function to write logs to AWS CloudWatch. The name of this IAM role is displayed below the Execution Role drop down in Figure 13.8 and will be similar to DetectEntities-role-xxxxx. Make a note of this name as you will need to modify the role to allow access to Amazon S3 and Amazon Comprehend. Click on the Create Function button at the bottom of the page to create the AWS Lambda function and the IAM role.

After the AWS Lambda function is created, use the services menu to switch to the IAM management console and navigate to the new IAM role that was just created for you when you created the lambda function. Locate the permissions policy document associated with the role and click on the Edit Policy button (Figure 13.9).

Screenshot of viewing the default policy document associated with the IAM role created by AWS Lambda.

FIGURE 13.9 Viewing the default policy document associated with the IAM role created by AWS Lambda

You will be taken to the policy editor screen; click on the JSON tab to view the policy document as a JSON file (Figure 13.10).

Screenshot of updating the default policy document associated with the IAM role created by AWS Lambda.

FIGURE 13.10 Updating the default policy document associated with the IAM role created by AWS Lambda

Add the following objects to the Statement array:

        {
            "Action": [
                "comprehend:DetectEntities"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::*"
        }

Your final policy document should resemble this snippet:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "logs:CreateLogGroup",
            "Resource": "arn:aws:logs:eu-west-1:5083XXXX:*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:eu-west-1:508XXXX13:log-group:/aws/lambda/DetectEntities:*"
            ]
        },        
        {
            "Action": [
                "comprehend:DetectEntities"
            ],
            "Effect": "Allow",
            "Resource": "*"
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Effect": "Allow",
            "Resource": "arn:aws:s3:::*"
        }
    ]
}

This policy document allows AWS Lambda to write logs to CloudWatch, call the AWS Comprehend DetectEntities API, and read/write objects from any S3 bucket in your account. You can allow access to additional Amazon Comprehend APIs by adding the relevant actions after "comprehend:DetectEntities". You can get a list of available Amazon Comprehend actions that can be used in policy documents at https://docs.aws.amazon.com/comprehend/latest/dg/comprehend-api-permissions-ref.html.

Click on the Review Policy button at the bottom of the page to go to the Review Policy screen (Figure 13.11).

Screenshot of Review Policy Screen.

FIGURE 13.11 Review Policy screen

Click on the Save changes button to finish updating the IAM policy. After the policy changes have been saved, use the Services menu to switch back to the AWS Lambda management console and navigate to the DetectEntities lambda function (Figure 13.12). The function designer section of the page should now list Amazon CloudWatch Logs, Amazon S3, and Amazon Comprehend as the resources that can be accessed by the function.

Screenshot of AWS Lambda function designer.

FIGURE 13.12 AWS Lambda function designer

Locate the function designer section of the page and add the Amazon S3 trigger to the function (Figure 13.13).

Screenshot of adding the Amazon S3 trigger to the AWS Lambda function.

FIGURE 13.13 Adding the Amazon S3 trigger to the AWS Lambda function

Scroll down to the Configure Triggers section and choose an S3 bucket that will serve as the event source. In this example, the source bucket is called awsml-comprehend-entitydetection-source. Ensure the Event Type is set to Object Created (All) and click the Add button to finish configuring the S3 event trigger (Figure 13.14).

Screenshot of configuring the Amazon S3 event trigger.

FIGURE 13.14 Configuring the Amazon S3 event trigger

Click the Save button at the top of the page to save your changes. By creating the trigger, you have set up the Lambda function to be executed every time a new file is uploaded to the source S3 bucket.

To update the Lambda function code, scroll down to the function designer and click the function name to reveal the code editor (Figure 13.15).

Screenshot of accessing the function code editor.

FIGURE 13.15 Accessing the function code editor

Replace the boilerplate code in the code editor with the contents of Listing 13.1.

Click the Save button below the function editor to finish creating your Lambda function. To test this function, use the Services drop-down menu to switch to Amazon S3 and navigate to the bucket that you have associated with the AWS Lambda function trigger. Upload a UTF-8 text file into the bucket. In a few seconds, you should see a new file created in the output bucket with the results of the entity analysis.

The output bucket is hardcoded in Listing 13.1. You should change it to the appropriate value if you are using a different bucket name. The bucket names and files used in this example are:

  • Source bucket: awsml-comprehend-entitydetection-source
  • Destination bucket: awsml-comprehend-entitydetection-result
  • Input file: comprehend_entitydetection_input1.txt
  • Output file: entityresult-comprehend_entitydetection_input1.txt

Summary

  • Amazon Comprehend is a fully managed web service that provides access to a deep-learning–based Natural Language Processing (NLP) and topic modeling engine.
  • Amazon Comprehend can be used to detect entities, key phrases, syntax, and the dominant language of a document.
  • Amazon Comprehend can be used to analyze a set of documents and create a list of topics associated with each document.
  • Amazon Comprehend can be used interactively through the management console as well as the AWS CLI.
  • The underlying Amazon Comprehend APIs can be used from a number of SDKs for popular programming languages.
  • You can create AWS Lambda functions that make use of Amazon Comprehend APIs and trigger the AWS Lambda function from a variety of events.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.71.94