In the last chapter, we covered how you can use Amazon Comprehend Custom Entity to extract business entities from your documents, and we showed you how you can use humans in the loop with Amazon Augmented AI (A2I) to augment or improve entity predictions. Lastly, we showed you how you can retrain the Comprehend custom entity model with an augmented dataset to improve accuracy using Amazon A2I.
In this chapter, we will talk about how you can use Amazon Comprehend custom classification to classify documents and then how you can set up active learning feedback with your custom classification model using Amazon A2I.
We will be covering the following topics in this chapter:
For this chapter, you will need access to an AWS account. Please make sure to follow the instructions specified in the Technical requirements section in Chapter 2, Introducing Amazon Textract, to create your AWS account, and log in to the AWS Management Console before trying the steps in this chapter.
The Python code and sample datasets for setting up Comprehend custom classification with a human-in-the-loop solution are in the following link: https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/tree/main/Chapter%2015.
Check out the following video to see the Code in Action at https://bit.ly/3BiOjKt.
Please use the instructions in the following sections along with the code in the repository to build the solution.
Amazon Comprehend provides the capability to classify the data using Amazon Comprehend AutoML and bring your own custom training dataset. You can easily accomplish a lot with the Amazon Comprehend custom classification feature as it requires fewer documents to train Comprehend AutoML models. You are spending less time labeling the dataset and then worrying about setting up infrastructure or choosing the right algorithm.
You can use Amazon Comprehend custom classification for a variety of use cases, such as classifying documents based on type, classifying news articles, or classifying movies based on type.
The fictitious company LiveRight pvt ltd wants to classify the documents submitted by the customers, such as whether the document submitted is an ID or a bank statement, even before analyzing the data inside the document. Moreover, if you are using a classification model to classify the documents based on the type of submitted document, you would also want to improve the accuracy of your predicted outcome in real time, based on the confidence score predicted by the Comprehend custom classification model. This is where humans in the loop with Amazon Augmented AI is going to help.
We covered Amazon A2I in Chapter 13, Improving the Accuracy of Document Processing Workflows. In this chapter, we will walk you through some reference architecture on how you can easily set up a custom classification model using Amazon Comprehend and have a feedback loop set up with Amazon A2I for active learning on your Comprehend custom model.
First, we will walk you through the following architecture on how you can train a custom classification model and create a real-time endpoint for inferencing or classifying documents in near real time.
This architecture walks through the following steps:
We are going to walk you through the preceding conceptual architecture using Jupyter Notebook and a few lines of Python code in the Setting up to solve the use case section.
Now, we have a near real-time document classification endpoint. We will show you how you can set up humans in the loop with this Amazon Comprehend custom classification endpoint and set up a model retraining or active-learning loop to improve your model accuracy using the following architecture:
In this architecture, we will walk you through the following steps:
We will walk you through steps 1 to 6 using Jupyter Notebook in the Setting up the use case section. Feel free to combine the augmented classified labels with the original dataset and try retraining for your understanding. You can automate this architecture using step functions and Lambda functions. We will share with you the blogs that can help you set up this architecture using Lambda functions in the Further reading section.
In this section, we covered the architecture for both model training and retraining or active learning. Now, let's move on to the next section to see these concepts with code.
In this section, we will get right down to action and start executing the tasks to build our solution. But first, there are prerequisites we will have to take care of.
If you have not done so in the previous chapters, you will first have to create an Amazon SageMaker Jupyter notebook and set up Identity and Access Management (IAM) permissions for that notebook role to access the AWS services we will use in this notebook. After that, you will need to clone the GitHub repository (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services), go to the Chapter 15 folder, and open the chapter15 classify documents with human in the loop.ipynb notebook.
Now, let's move to the next section to show you how you can set up the libraries and upload training data to Amazon S3 using this notebook.
In this step, we will follow instructions to set up an S3 bucket and upload documents:
data_bucket = "doc-processing-bucket-MMDD"
region = boto3.session.Session().region_name
os.environ["BUCKET"] = data_bucket
os.environ["REGION"] = region
if region=='us-east-1':
!aws s3api create-bucket --bucket $BUCKET
else:
!aws s3api create-bucket --bucket $BUCKET --create-bucket-configuration LocationConstraint=$REGION
!aws s3 cp documents/train s3://{data_bucket}/train –recursive
word_prefix=os.getcwd()+'/SAMPLE8/WORDS/'
box_prefix=os.getcwd()+'/SAMPLE8/BBOX/'
We've covered how to create an S3 bucket and we have loaded training data. Now, let's move on to the next section to extract text.
Go to the notebook and run the calls in Step 2: Extract text from sample documents using Amazon Textract to define a function using Amazon Textract to extract data from the sample images in Amazon S3. We are using the DetectDocumentText sync API to do this extraction; you can also use AsyncAPI or Textract batch APIs to perform data extraction. Refer to Chapter 4, Automating Document Processing Workflows, to dive deep into these APIs:
def data_retriever_from_path(path):
mapping={}
for i in names:
if os.path.isdir(path+i):
mapping[i] = sorted(os.listdir(path+i))
label_compre = []
text_compre = []
for i, j in mapping.items():
for k in j:
label_compre.append(i)
text_compre.append(open(path+i+"/"+k, encoding="utf-8").read().replace(' ',' '))
return label_compre, text_compre
This function takes the image's path and returns the text and labels for the images.
Let's call this function by passing the scanned document's images by running the following cell in the notebook:
tic = time.time()
pool = mp.Pool(mp.cpu_count())
pool.map(textract_store_train_LM, [table for table in images ])
print("--- %s seconds for extracting ---" % (time.time() - tic))
pool.close()
The preceding function extracts the data and saves it in the local directory structure you defined in the Set up and Upload Sample Documents step. The following is the output:
Now, we have extracted the text and associated labels, for example, 0 for a bank statement and 1 for pay stubs. Now, let's move to the next section for Comprehend training.
We have extracted the data and labels in the previous step from our sample of scanned documents in Amazon S3. Now, let's understand how to set up a Comprehend classification training job using Step 3: Create Amazon Comprehend Classification training job in the notebook:
def data_retriever_from_path(path):
mapping={}
for i in names:
if os.path.isdir(path+i):
mapping[i] = sorted(os.listdir(path+i))
# label or class or target list
label_compre = []
# text file data list
text_compre = []
# unpacking and iterating through dictionary
for i, j in mapping.items():
# iterating through list of files for each class
for k in j:
# appending labels/class/target
label_compre.append(i)
# reading the file and appending to data list
text_compre.append(open(path+i+"/"+k, encoding="utf-8").read().replace(' ',' '))
return label_compre, text_compre
label_compre, text_compre=[],[]
path=word_prefix+'train/'
label_compre_train, text_compre_train=data_retriever_from_path(path)
label_compre.append(label_compre_train)
text_compre.append(text_compre_train)
if type(label_compre[0]) is list:
label_compre=[item for sublist in label_compre for item in sublist]
#print(label_compre)
text_compre=[item for sublist in text_compre for item in sublist]
#print(text_compre)
data_compre= pd.DataFrame()
data_compre["label"] =label_compre
data_compre["document"] = text_compre
data_compre
You will get a pandas DataFrame with labels and documents, shown as follows:
csv_compre=io.StringIO()
data_compre.to_csv(csv_compre,index=False, header=False)
key='comprehend_train_data.csv'
input_bucket=data_bucket
output_bucket= data_bucket
response2 = s3.put_object(
Body=csv_compre.getvalue(),
Bucket=input_bucket,
Key=key)
Important Note
We have the choice to add versions for Amazon Comprehend custom models. To learn more about this feature, refer to this link: https://docs.aws.amazon.com/comprehend/latest/dg/model-versioning.html.
Important Note
This training will take 30 minutes to complete as we have a large number of documents to train with in this chapter. You can use this time to set up a private workforce for setting up humans in the loop, which we did in Chapter 13, Improving the Accuracy of Document Processing Workflows.
Once your job is completed, move on to the next step.
In this section, we will show you how you can create a real-time endpoint with the trained model in the AWS Management Console. Comprehend uses the Inference Unit (IU) to analyze how many characters can be analyzed in real time per second. IU is a measure of the endpoint's throughput. You can adjust the IU of an endpoint anytime. After creating the endpoint, we will then show you how you can call this endpoint to test a sample bank statement using the Jupyter Notebook:
Delete this endpoint at the cleanup section in the notebook to avoid incurring a cost.
ENDPOINT_ARN='your endpoint arn paste here'
documentName = "paystubsample.png"
display(Image(filename=documentName))
You will get the following output:
response = comprehend.classify_document(
Text= page_string,
EndpointArn=ENDPOINT_ARN
)
print(response)
You will get the following response:
As per the response, the model endpoint has classified the document as a pay stub with 99% confidence. We tested this endpoint, so now let's move on to the next section to set up a human loop.
In this section, we are going to show you a custom integration with a Comprehend classifier endpoint, which you can invoke using the A2I StartHumanLoop API. You can pass any type of AI/ML prediction response to this API to trigger a human loop. In Chapter 13, Improving the Accuracy of Document Processing Workflows, we showed you a native integration with the Textract Analyze document API by passing a human loop workflow ARN to the AnalyzeDocument API. Setting up a custom workflow includes the following steps:
To get started, you need to create a private workforce and copy the private ARN in the Environment setup step in the Jupyter Notebook:
REGION = 'enter your region'
WORKTEAM_ARN= "enter your private workforce arn "
BUCKET = data_bucket
ENDPOINT_ARN= ENDPOINT_ARN
role = sagemaker.get_execution_role()
region = boto3.session.Session().region_name
prefix = "custom-classify" + str(uuid.uuid1())
Important Note
You can create a custom UI HTML template based on what type of data you want to show to your labelers. For example, you can show the actual document on the right and entities highlighted on the left using custom UIs.
def create_task_ui():
response = sagemaker.create_human_task_ui(
HumanTaskUiName=taskUIName,
UiTemplate={'Content': template})
return response
create_workflow_definition_response = sagemaker.create_flow_definition(
FlowDefinitionName= flowDefinitionName,
RoleArn= role,
HumanLoopConfig= {
"WorkteamArn": WORKTEAM_ARN,
"HumanTaskUiArn": humanTaskUiArn,
"TaskCount": 1,
"TaskDescription": "Read the instructions",
"TaskTitle": "Classify the text"
},
OutputConfig={
"S3OutputPath" : "s3://"+BUCKET+"/output"
}
flowDefinitionArn = create_workflow_definition_response['FlowDefinitionArn']
response = comprehend.classify_document(
Text= page_string,
EndpointArn=ENDPOINT_ARN
)
print(response)
p = response['Classes'][0]['Name']
score = response['Classes'][0]['Score']
#print(f»S:{sentence}, Score:{score}»)
response = {}
response['utterance']=page_string
response['prediction']=p
response['confidence'] = score
print(response)
human_loops_started = []
CONFIDENCE_SCORE_THRESHOLD = .90
if(response['confidence'] > CONFIDENCE_SCORE_THRESHOLD):
humanLoopName = str(uuid.uuid4())
human_loop_input = {}
human_loop_input['taskObject'] = response['utterance']
start_loop_response = a2i_runtime_client.start_human_loop(
HumanLoopName=humanLoopName,
FlowDefinitionArn=flowDefinitionArn,
HumanLoopInput={
"InputContent": json.dumps(human_loop_input)
}
)
print(human_loop_input)
human_loops_started.append(humanLoopName)
print(f'Score is less than the threshold of {CONFIDENCE_SCORE_THRESHOLD}')
print(f'Starting human loop with name: {humanLoopName} ')
else:
print('No human loop created. ')
Important Note
The preceding condition states anything greater than 90% confidence from your model endpoint will trigger a loop. This threshold is for demo purposes and needs to be changed for real use cases, such as anything below 90% that would trigger a human loop.
workteamName = WORKTEAM_ARN[WORKTEAM_ARN.rfind('/') + 1:]
print("Navigate to the private worker portal and do the tasks. Make sure you've invited yourself to your workteam!")
print('https://' + sagemaker.describe_workteam(WorkteamName=workteamName)['Workteam']['SubDomain'])
Review the data on the left in the previous screenshot and classify it by selecting the Pay Stubs category, and then click Submit.
completed_human_loops = []
resp = a2i_runtime_client.describe_human_loop(HumanLoopName=humanLoopName)
for resp in completed_human_loops:
splitted_string = re.split('s3://' + data_bucket + '/', resp['HumanLoopOutput']['OutputS3Uri'])
output_bucket_key = splitted_string[1]
response = s3.get_object(Bucket=data_bucket, Key=output_bucket_key)
content = response["Body"].read()
json_output = json.loads(content)
pp.pprint(json_output)
You get the following response:
Using this data, you can augment or enrich your existing dataset used for training. Try combining this data with the Comprehend training data we created and try retraining your model to improve accuracy. We will point you to some blogs to accomplish this step in the Further reading section.
Important Note
Please delete the model and the Comprehend endpoints created for the steps we did in this notebook.
In this chapter, we covered two things using a reference architecture as well as a code walkthrough. Firstly, we covered how you can extract data from various types of documents, such as pay stubs, bank statements, or identification cards using Amazon Textract. Then, we learned how you can perform some post-processing to create a labeled training file for Amazon Comprehend custom classification training.
We showed you that even with 36 bank statement documents and 24 pay stubs as a training sample, you can achieve really good accuracy using Amazon Comprehend transfer-learning capabilities and AutoML with document or text classification. Obviously, the accuracy improves with more data.
Then, you learned how to set up a training job in the AWS Management Console and how to set up a real-time classification endpoint using the AWS Management Console.
Secondly, you learned how you can set up humans in the loop with the real-time classification endpoint to review/verify and validate what the model has classified. We then also discussed how you can retrain your existing model by adding this data with your existing training data and set up a retraining or active-learning loop. Please refer to the Further reading section to automate this workflow using Lambda functions.
In the next chapter, we will cover how you can improve the accuracy of PDF batch processing with Amazon Textract and humans in the loop. So, stay tuned!
18.224.69.169