One of the first stages of an Intelligent Document Processing (IDP) pipeline is to collect your documents and store them in a highly available, reliable, and secure data store. Data is our gold mine, and to extract insights from our documents, we need to understand our data and pre-process it as needed. Most of the time, organizations receive a package of documents that are not labeled. To understand the documents, you need to manually scan these documents and label them into the right category, which is known as the document classification stage of the IDP pipeline. Thus, we are looking for an automated process for data collection and document classification.
In this chapter, we will be covering the following topics:
To do the hands-on labs in the book, you will need access to an AWS account at https://aws.amazon.com/console/. Please refer to the Signing up for an AWS account subsection within the Setting up your AWS environment section for detailed instructions on how you can sign up for an AWS account and sign in to the AWS Management Console. The Python code and sample datasets for the solution discussed in this chapter are available at https://github.com/PacktPublishing/Intelligent-Document-Processing-with-AWS-AI-ML-/tree/main/chapter-2.
In this chapter and all the subsequent chapters in which we will run code examples, you will need access to an AWS account. Before getting started, we recommend that you create an AWS account.
Important note
Please use the AWS Free Tier, which enables you to try services free of charge based on certain time limits or service usage limits. For more details, please see https://aws.amazon.com/free.
Please go through the following steps to create an AWS account:
In the next section, we will show you how to create an S3 bucket and upload your documents.
In this section, we will see how to create a notebook instance in Amazon SageMaker. This is an important step, as most of our solution examples are run using notebooks. After the notebook is created, please follow the instructions to use the notebook in the specific chapters based on the solution being built.
Follow these steps to create an Amazon SageMaker Jupyter notebook instance:
Your notebook instance will take a few minutes to be provisioned; once it’s ready, the status will change to In Service. In the next few sections, we will walk through the steps required to modify the IAM role we attached to the notebook.
In AWS, security is “job zero,” and each service needs explicit role level access to call another service – for example, if we want to call a Textract service inside a SageMaker notebook instance, we need to give required permission to SageMaker Notebook, using Identity and Access Management (IAM) to call the Textract service. In this section, we will walk through the steps needed to add additional permission policies to our Amazon SageMaker Jupyter notebook role:
Important note
To run all the hands-on labs in this book, Full list of required IAM Roles can be found in https://github.com/PacktPublishing/Intelligent-Document-Processing-with-AWS-AI-ML-/blob/main/IAM_Roles.
{ "Version": "2012-10-17", "Statement": [ {
"Action": [
"iam:PassRole"
],
"Effect": "Allow",
"Resource": "<IAM ARN of your current SageMaker
notebook execution role>"
}
]
}
{ "Version": "2012-10-17", "Statement": [
{ "Effect": "Allow",
"Principal":
{ "Service":
[ "sagemaker.amazonaws.com",
"s3.amazonaws.com",
"comprehend.amazonaws.com" ]
},
"Action": "sts:AssumeRole"
}
]
}
Important note
You cannot attach more than 10 managed policies to an IAM role. If your IAM role already has a managed policy from a previous chapter, please detach this policy before adding a new policy as per the requirements of your current chapter. When we create an Amazon SageMaker Jupyter notebook instance (as we did in the previous section), the default role creation step includes permissions for either an S3 bucket that you specify or any S3 bucket in your AWS account.
Now, you are all set.
Go to the GitHub link https://github.com/PacktPublishing/Intelligent-Document-Processing-with-AWS-AI-ML- and then the chapter-2.pynb notebook.
Now that we are done with the technical requirements, let’s look at the components involved in the data capture stage.
Document capture or ingestion is a process to aggregate all our data in a secure, centralized, scalable data store. While building a data capture stage for your IDP pipeline, you have to take data sources, data format, and a data store into consideration.
The first step is to store our documents for transformation. To store documents, we can use any type of document store, such as a local filesystem or Amazon S3. For this IDP pipeline, we will be leveraging AWS AI services, and we recommend, for an easier, more secure, and more scalable document store, to leverage Amazon S3, an object storage service that offers industry-leading scalability, data availability, security, and performance. Amazon S3 has 11 9s of durability, and millions of customers all around the world leverage Amazon S3 for their data store.
Many regulatory industries, such as GE Healthcare, use Amazon S3 for data storage during their digital transformation journeys. GE Health Cloud leveraged Amazon S3 to build a single portal for healthcare data access. Similarly, customers across industries such as the financial sector, public sectors, and so on have leveraged Amazon S3 for their secure data store. You can leverage Amazon S3 to build a data lake. You can run big data analytics and AI/ML applications to unlock insights from your documents. But, at times, you will see companies having their data source centralized on-premises, not wanting to store documents redundantly anywhere else for security reasons. We will discuss how to process documents from such data sources by leveraging AWS IDP in a later section. Moreover, you will see document processing requirements when data sources are in Salesforce, Workday, Box, SAP, and so on, and we can use a custom connector to process documents from these data sources, but connector implementation details are out of the scope of this book.
For this book, we will leverage Amazon S3 as a data store for IDP with AWS AI services. This offers ease of access and implementation for document processing with AWS AI services.
Data can come from various different sources. Data capture can be synchronous where we need to process and get a result instantaneously. For example, we might want to submit receipts for reimbursement. We would need an instantaneous acknowledgment that our receipts are in the right resolution and have been uploaded successfully. We would not expect to receive an instant approval or denial result, which may at times need human reviews, but we expect an immediate acknowledgment.
I was speaking to one of our customers, who was working on building a smart portal for the online submission of university applications. As a process, you would expect the applicant needs to submit ID proof along with other details relevant to the application and the university enrollment application forms itself at the least. Moreover, different universities have their own university-specific application requirements. This process used to be manual and take a long time, with multiple iterations needed just to collect all required documents with the right resolutions. With the building of the smart portal, applicants could now immediately get an acknowledgment if all required documents with the right quality had been uploaded successfully or not. This can eliminate 2 weeks of additional time just to collect all required documents needed for the application. You can imagine this requirement with your use cases, where there is a need to process fully or partially documents to give instantaneous results. In Figure 2.1, you can see how various types of documents from different channels can be centralized in an Amazon S3 data store.
Figure 2.1 – The IDP pipeline – data capture
Let’s see how to upload and process those documents for synchronous extraction using code. We will use the syncdensetext.png file for the following hands-on code.
We will be using the AWS Boto3 library to call into our AI service APIs. You may need to upgrade your boto3 libraries to get the latest updated access to APIs.
Refer to your notebook (https://github.com/PacktPublishing/Intelligent-Document-Processing-with-AWS-AI-ML-/blob/main/chapter-2/Chapter-2-datacapture.ipynb for full hands-on lab) section to import the boto3 library:
import boto3 from IPython.display import Image, display !pip install textract-trp from trp import Document from PIL import Image as PImage, ImageDraw import time from IPython.display import IFrame
Follow these instructions to create an Amazon S3 bucket:
# Create a Unique S3 bucket
s3BucketName = <YOUR_BUCKET_NAME>
print(s3BucketName)
Important note
Bucket names are globally unique, so be creative and give a unique name to your Amazon S3 bucket in <YOUR_BUCKET_NAME>. You might want to add a random hash number to your bucket name, as bucket names are globally unique.
# Upload document to your S3 Bucket
print(s3BucketName)
!aws s3api create-bucket --bucket {s3BucketName}
!aws s3 cp syncdensetext.png s3://{s3BucketName}/syncdensetext.png
chapter2_syncdensedoc = "syncdensetext.png"
display(Image(url=s3.generate_presigned_url('get_object', Params={'Bucket': s3BucketName, 'Key': chapter2_syncdensedoc})))
The output for the preceding code looks like the following:
Figure 2.2 – Sample document content
# Amazon Textract client
textract = boto3.client('textract')
# Document Extraction with Amazon Textract
response = textract.detect_document_text(
Document={
'S3Object': {
'Bucket': s3BucketName,
'Name': chapter2_syncdensedoc
}
})
# Print All the lines
for line in response["Blocks"]:
if line["BlockType"] == "LINE":
print (line["Text"])
Important note
We will look into the API details of Amazon Textract in Chapter 3, Accurate Document Extraction with Amazon Textract.
Check the following output for all the extracted lines from the document.
Figure 2.3 – The LINES output of the synchronous API
Other data sources can be just simple fax or mail attachments or a bulk upload of documents. These documents need to be processed periodically or ad hoc to get insights from documents. Any asynchronous or batch processing operations can fit these use cases.
For example, for real estate insurance, the reviewer needs to check and validate real estate properties for accurate assessments. For that reason, they collect and upload insurance title documents from various sources and web and bulk upload to their data store. They manually extract the data elements from the documents such as address, lot number, and so on, and do the data entry in their system so that a user can easily find those data elements easily, which helps in making the right assessment. As you can see, in this use case, there is no need to process documents instantaneously, but you can asynchronously upload and extract elements from your documents.
Let’s see how to upload and process those documents for asynchronous extraction in code.
Follow the steps mentioned previously to create your Amazon S3 bucket and upload the sync_densetext.png document to your S3 bucket.
This is the method to start a Textract asynchronous operation to extract elements from your Amazon S3 bucket:
def startasyncJob(s3BucketName, filename):
response = None
response = textract.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3BucketName,
'Name': filename
}
})
return response["JobId"]
The following is the method to check for the completion of Amazon Textract’s asyncronous API extraction. The Textract asyncronous API takes time to process all our documents. Constant polling to check the status of the Textract processing may cause unnecessary CPU overhead, so we will use a sleep and pool mechanism to get the status of our asynchronous operation. There are additional options to pool status of an asynchronous API job. We will discuss this in detail in Chapter 3, Accurate Document Extraction with Amazon Textract:
def isAsyncJobComplete(jobId):
response = textract.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
while(status == "IN_PROGRESS"):
time.sleep(10)
response = textract.get_document_text_detection(JobId=jobId)
status = response["JobStatus"]
print("Job status: {}".format(status))
return status
This function returns all page blocks as pagearrays:
def getAsyncJobResult(jobId):
pages = []
response = textract.get_document_text_detection(JobId=jobId)
pages.append(response)
ntoken = None
if('NextToken' in response):
ntoken = response['NextToken']
while(ntoken):
response = textract.get_document_text_detection(JobId=jobId, NextToken=ntoken)
pages.append(response)
print("Resultset page recieved: {}".format(len(pages)))
nextToken = None
if('NextToken' in response):
ntoken = response['NextToken']
return pages
Let’s see how we can call into previously mentioned process documents asynchronously:
jobId = startasyncJob(s3BucketName, chapter2_syncdensedoc)
print("Started job with id: {}".format(jobId))
if(isAsyncJobComplete(jobId)):
response = getAsyncJobResult(jobId)
# Print detected text
for resultPage in response:
for item in resultPage["Blocks"]:
if item["BlockType"] == "LINE":
print (' 33[94m' + item["Text"] + ' 33[0m')
You can check the output here:
Figure 2.4 – LINES output of the asynchronous API
Let’s move on to sensitive document processing next.
At times, some companies have their own internal security and regulations and do not want to store documents in the cloud. Other times, a customer already has a data store on-premises and does not want to store documents in two different data stores. That adds work to maintain two different data stores, one in the cloud and the other on-premises. Can they process documents without sending them to the cloud?
So, let’s check the solution in code:
# Document Name
chapter2_sensitivedoc = "sensitive-doc.png"
print(chapter2_sensitivedoc)
display(Image(filename=chapter2_sensitivedoc))
Let’s check the following output:
Figure 2.5 – Sensitive Document content display
# Read sensitive document content
with open(chapter2_sensitivedoc, 'rb') as document:
imgbytes = bytearray(document.read())
# Call Amazon Textract by sending bytes
res = textract.detect_document_text(Document={'Bytes': imgbytes})
# Print detected text
for line in res["Blocks"]:
if line["BlockType"] == "LINE":
print (line["Text"])
Let’s check the following output:
Figure 2.6 – The lines output from the sensitive document
Documents come in different formats, layouts, types, and sizes. Some documents can just be a single page type, but others can be of multiple pages and can be of type PDF, JPEG, PNG, TIFF, Word, and so on. Examples of single-page documents include driver’s licenses and insurance cards, and examples of multipage documents include legal contract documents and policy documents.
These documents can have diverse layouts, such as a structured layout (such as a table) or unstructured (such as dense paragraphs of text). Alternatively, they can be semi-structured, such as form type of layouts in documents. Sometimes, we get documents that can have multiple different formats, such as dense text interweaved with forms and tables in single or multiple pages. We will look into the accurate extraction of these different formats in Chapter 3, Accurate Document Extraction with Amazon Textract.
Our ML models may require a certain quality and prerequisite before the accurate extraction of data elements from documents. Some of the prerequisites might include a minimum font size requirement or a maximum document size for accurate document processing. For example, Amazon Textract can process documents with a maximum size of 10 MB and a recommended resolution of 150 dpi. You can find all the quotas and limits at the following service limit page link: https://docs.aws.amazon.com/textract/latest/dg/limits.html.
For those instances, you will have to take additional steps to preprocess those documents. We will not be covering preprocessing for IDP in detail, but we highly recommend you to check out the blog post here: https://aws.amazon.com/blogs/machine-learning/improve-newspaper-digitalization-efficacy-with-a-generic-document-segmentation-tool-using-amazon-textract/.
Sometimes, we receive many documents as a single package, and we need to process each document individually to derive insights from it as per business requirements. To achieve this, one of the major tasks is to categorize and index different types of documents. This later helps in the accurate extraction of information that meets business-specific requirements. This process of categorizing documents is known as document classification.
Let’s look into a claims processing use case in the insurance industry. Claims processing is very much a transactional use case, with millions of claims getting processed every single year in the United States. A manual submission of a claims package is a combination of multiple documents, such as an insurance form, receipts, invoices, ID documents, and some unstructured documents such as doctor’s notes and discharge summaries. A package is a combination of scanned pages, where each document can be a single or multiple pages long. You want to categorize these documents accurately for the accurate extraction of information. For example, for an invoice document, you may want to extract the amount and invoice ID, but for an insurance form, you may want to extract a diagnosis code and a person’s insurance ID number. So, the extraction requirements depend on the type of document you are processing. For that, we need to classify documents before processing. With a manual workforce, you can preview the documents to categorize them into their specified folders. But as with any other manual processing, it can be expensive, error-prone, and time-consuming, so we need an automated way to classify these documents. The following diagram shows a process to classify documents.
Figure 2.7 – The IDP pipeline – data classification
This process is crucial when we try to automate our document extraction process, where we receive multiple documents and don’t have a clear way to identify each document type. If you are dealing with a single document type or have an identifiable way to locate a document, then you can skip this step in the IDP pipeline. Alternatively, classify those documents correctly before proceeding in the IDP pipeline.
We will use the Amazon Comprehend custom classifier feature for automating our document classification process. Amazon Comprehend provides a capability to bring in your labeled documents to leverage Comprehend’s AutoML feature to accurately classify documents. The auto-algorithm selection and model tuning significantly helps you spend less time building your classification model. Moreover, the ML infrastructure is already taken care of by Amazon Comprehend. As it is a managed service, it automatically takes care of your ML infrastructure to build, train, and deploy your model. You do not need to spend time and effort building your ML platform to train your model. After Amazon Comprehend builds a classification model from your dataset, it gives you the evaluation metrics for your model performance.
Custom classification is a two-step process:
In this architecture, we will walk you through the following steps:
Let’s now dive into the details of training a Comprehend custom classifier.
First, we will walk you through the following architecture on how you can train a custom classification model and run an analysis job for inferencing or classifying documents.
Figure 2.8 – The IDP pipeline – data classification architecture
This architecture walks you through the following steps:
Important note
Amazon Comprehend also offers a real-time endpoint mode for inference of documents. Comprehend charges per resource endpoint, which can be expensive, so it is recommended to delete your endpoint when not in use. Also, if your workload can be processed asynchronously, I recommend using a Comprehend analysis job instead of creating a real-time endpoint.
Now, let’s look at a hands-on example. For the dataset, we will be leveraging invoice- and receipt-type documents. Also, for full notebook walkthrough check: https://github.com/PacktPublishing/Intelligent-Document-Processing-with-AWS-AI-ML-/blob/main/chapter-2/Chapter-2-comprehend-classification.ipynb.
# Upload images to S3 bucket:
!aws s3 cp classification-training-dataset s3://{data_bucket}/idp/textract --recursive --only-show-errors
The documents are scanned images, and the text needs to be extracted to be a trained classification model. So, let’s extract text from the documents.
def textract_extract(table, bucket=data_bucket):
try:
response = textract.detect_document_text(
Document={
'S3Object': {
'Bucket': bucket,
'Name': table
}
})
The sample output is shown as follows:
Figure 2.9 – Labeled sample documents
# Upload Comprehend training data to S3
key='idp/comprehend/comprehend_train_data.csv'
data_compre.to_csv("comprehend_train_data.csv", index=False, header=False)
s3.upload_file(Filename='comprehend_train_data.csv',
Bucket=data_bucket,
Key=key)
create_response = comprehend.create_document_classifier(
InputDataConfig={
'DataFormat': 'COMPREHEND_CSV',
'S3Uri': f's3://{data_bucket}/{key}'
},
DataAccessRoleArn=role,
DocumentClassifierName=document_classifier_name,
VersionName=document_classifier_version,
LanguageCode='en',
Mode='MULTI_CLASS'
)
document_classifier_arn = create_response['DocumentClassifierArn']
print(f"Comprehend Custom Classifier created with ARN: {document_classifier_arn}")
After the Comprehend classifier model is trained, you can check the performance of the Comprehend classification model as follows:
Figure 2.10 – Amazon Comprehend – Custom classification
Figure 2.11 – Comprehend custom classification model performance
It is recommended to check the Comprehend training performance metrics at the following link: https://docs.aws.amazon.com/comprehend/latest/dg/cer-doc-class.html.
Let’s next move on to cover inference with the Comprehend classifier.
We will submit a sample document to our trained custom classifier and leverage a Comprehend analysis job for inference.
Amazon Comprehend requires a text document with UTF-8 encoding. We have already processed the sample document and stored it in https://github.com/PacktPublishing/Intelligent-Document-Processing-with-AWS-AI-ML-/blob/main/chapter-2/rawText-invoice-for-inference.txt.
If you want to use your document, which can be of any scanned image type, we recommend calling Amazon Textract to extract text from it, as follows:
response = textract.detect_document_text( Document={ 'S3Object': { 'Bucket': bucket, 'Name': table } })
After data is extracted from the document by Amazon Textract, go to Amazon Comprehend in the AWS Management Console and follow these steps:
Figure 2.12 – Creating an Amazon Comprehend classification job
Figure 2.13 – Filling in classification job information
Let’s look at the result next.
After the analysis job is completed, go to the Amazon S3 bucket that you have specified as the output location. Check the result in the output.tgz file.
One of the common use cases across regulatory industries is classifying documents based on the presence of sensitive data in them.
Figure 2.14 – Data classification for sensitive versus non-sensitive documents
We will dive into a hands-on solution in Chapter 4, Accurate Extraction with Amazon Comprehend, in the How to leverage document enrichment for document classification section.
In document preprocessing, you will come across use cases where documents are classified based on a branded logo on the document or having tables. These are visual clues that we want to use to classify documents before processing. We can use Amazon Rekognition, computer vision software with deep learning-powered image recognition, to detect visual clues such as objects, scenes, and text from any scanned images for document classification.
You can leverage an Amazon Rekognition Custom Label to detect a logo from any document and classify the document based on the logo. For example, a healthcare provider supports multiple insurance providers. Patients when visiting a doctor’s office submit insurance cards. These insurance cards can be processed automatically to detect the logos from them, and the documents can be classified according to their corresponding categories.
Now, let’s look at a hands-on example, where we have input documents with an AWS logo as well as a non-AWS logo:
Figure 2.15 – Data classification of structural elements
Figure 2.16 – Amazon Rekognition in the AWS Management Console
The data preparation instructions are as follows:
Figure 2.17 – An Amazon Rekognition project
Important note
We recommend using a more versatile and equal distribution of samples for accurate training of the model.
For labeling, we are using the Amazon Rekognition console to tag the logo appropriately, as shown here:
Figure 2.18 – Amazon Rekognition labeling
Next, follow these Rekognition model training instructions:
Figure 2.19 – Amazon Rekognition model training
It will take some time to train your Rekognition model.
Figure 2.20 – Amazon Rekognition Custom Label model performance
import boto3
import io
import logging
import argparse
from PIL import Image, ImageDraw, ImageFont
from botocore.exceptions import ClientError
def callrek_cl(document):
model="arn:aws:rekognition:us-east-1:<MY_ACCOUNT>:project/book-rek-class-1/version/book-rek-class-1.2022-02-25T13.12.07/1645812727191"
image=Image.open(document)
image_type=Image.MIME[image.format]
# get images bytes for call to classify
image_bytes = io.BytesIO()
image.save(image_bytes, format=image.format)
image_bytes = image_bytes.getvalue()
response = rek_client.detect_custom_labels(Image={'Bytes': image_bytes}, ProjectVersionArn=model)
return response
response = callrek_cl("chapter-2-rekcl.png")
print(response)
response1 = callrek_cl("cha15train.png")
print(response1)
Check the output response shown here.
Figure 2.21 – Amazon Rekognition classification output
In this chapter, we discussed how to build a data capture stage for the IDP pipeline. Data is your gold mine, and you need a secure, scalable, and reliable data store. We introduced Amazon S3 and how you can leverage an object store to aggregate and store data in a scalable and highly available manner. We also described the data capture stage, with documents of varying layouts, formats, and types.
We then reviewed the need for document classification and categorization, with examples including mortgage processing and insurance claims processing. We discussed Amazon Comprehend and its custom classification feature. This chapter also gave you a hands-on experience in how to classify documents as invoice and receipt types. Moreover, we also looked at Amazon Rekognition, and how we can use its Custom Label feature to classify documents on its structural formats. You also had hands-on experience in classifying documents with the presence of the AWS logo or a non-AWS logo.
In the next chapter, we will go through the details of document extraction. We will look into the details of the extraction stage in the IDP pipeline. We will also dive deep into one of the AI services, Amazon Textract, to accurately extract information from any type of document.
3.22.181.209