As we have seen in this book so far, AI, and specifically NLP, has a wide range of uses in areas hitherto considered traditional IT spurred on by the rapid proliferation of data and the democratization of machine learning (ML) with cloud computing. In the previous chapter, we saw a cool example of how you can bring color to social media reviews and other forms of textual data by running voice of the customer analytics with sentiment detection. We saw how you can use AWS Glue to crawl raw data from Amazon S3, use Amazon Athena to interactively query this data, transform the raw data using PySpark (http://spark.apache.org/docs/latest/api/python/index.html) in an AWS Glue job to call Amazon Comprehend APIs (which provide ready-made intelligence with pre-trained NLP models) to get sentiment analysis on the review, convert the data into Parquet, and partition it (https://docs.aws.amazon.com/athena/latest/ug/partitions.html) by sentiment to optimize analytics queries. In this chapter, we will change gears and look at a use case that has gained tremendous popularity in recent times due to the increased adoption of streaming media content, specifically how to monetize content.
The gap between online advertising and print media advertising is ever widening. According to this article, https://www.marketingcharts.com/advertising-trends-114887, quoting a PwC outlook report on global entertainment and media (https://www.pwc.com/outlook), online advertising spend was estimated to be approximately $58 billion higher than TV advertising, and $100 billion higher than magazine and newspaper advertising in 2020 even with the COVID-19 pandemic considered.
This, of course, is also driven by the increased usage of smart consumer devices and the explosion of the internet age consumer trends. Google Ads is one of the most popular ad-serving platforms today, accounting for 80% of Alphabet's (the public holding company that owns Google) revenues, raking in $147 billion in 2020 according to this article: https://www.cnbc.com/2021/05/18/how-does-google-make-money-advertising-business-breakdown-.html. You read that right: online advertisements are indeed a big deal. So, when you are next thinking of posting that cool travel video or your recipe for an awesome chili con carne, you could actually be making money out of your content. You may ask, this is all great but how does NLP help in this case? Read on to find out!
The answer, as you probably already guessed, is context-based ad serving. Suppose you have an intelligent solution that could listen to the audio/text in your content, understand what is being discussed, identify topics that represent the context of the content, look up ads related to the topic, and stitch these ads back into your content seamlessly without having to train any ML models: wouldn't that be swell? Yes, that's exactly what we will be building now.
We will navigate through the following sections:
For this chapter, you will need access to an AWS account. Please make sure to follow the instructions specified in the Technical requirements section in Chapter 2, Introducing Amazon Textract, to create your AWS account, and log in to the AWS Management Console before trying the steps in the Building the NLP solution for content monetization section.
The Python code and sample datasets for our solution can be found here: https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/tree/main/Chapter%2008. Please use the instructions in the following sections along with the code in the repository to build the solution.
Check out the following video to see the Code in Action at https://bit.ly/317mcSh.
We know NLP can help enhance the customer service experience and understand better what our customers are telling us. We will now use NLP to determine the context of our media content and stitch ads into the content relevant to that context. To illustrate our example, let's go back to our fictitious banking corporation called LiveRight Holdings Private Limited. LiveRight's management has decided they now need to expand to more geographies as they are seeing a lot of demand for their model of no-frills banking that cuts their operational costs and transfers the savings back to their customers. They have decided to hire you as their marketing technology architect, putting you in charge of all their content creation, but challenge you to devise a way for the content to pay for itself due to their low-cost policies. You come up with the idea of creating fun educational videos that show the latest trends in the intersection of banking and technology. There is a lot of demand for such videos since they are free to watch, you can intersperse them with ads to get monetary returns, and they serve to raise awareness of the bank in the process.
You have thought through the solution design and decide to use the following:
The components of the solution we will build are as shown in the following figure:
We will be walking through this solution using the AWS Management Console (https://aws.amazon.com/console/) and an Amazon SageMaker Jupyter notebook (https://docs.aws.amazon.com/sagemaker/latest/dg/nbi.html), which will allow us to review the code and results as we execute it step by step. If you do not have access to the AWS Management Console, please follow the detailed instructions in the Technical requirements section in Chapter 2, Introducing Amazon Textract, of this book.
As a first step, we will look at the sample video file provided in the GitHub repository (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2008/media-content/bank-demo-prem-ranga.mp4). The sample video is from a demonstration of AWS AI services for document processing. For a full version of this video, please refer to https://www.youtube.com/watch?v=vBtxjXjr_HA. We will upload this sample video to an S3 bucket:
a) Durations in the video content to play the ads
b) A content source ID referred by the tag 'cmsid' and a video content ID referred by the tag 'vid', which we will populate to stitch in the ads specific to the topic we detected from the transcribed text in the previous step
In this section, we introduced the content monetization requirement we are trying to build with our NLP solution, reviewed the challenges faced by LiveRight, and looked at an overview of the solution we will build. In the next section, we will walk through the building of a solution step by step.
In the previous section, we introduced a requirement for content monetization, covered the architecture of the solution we will be building, and briefly walked through the solution components and workflow steps. In this section, we will start executing the tasks to build our solution. But first, there are prerequisites we will have to take care of.
If you have not done so in the previous chapters, you will as a prerequisite have to create an Amazon SageMaker Jupyter notebook instance and set up Identity and Access Management (IAM) permissions for that notebook role to access the AWS services we will use in this notebook. After that, you will need to clone the GitHub repository (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services), create an Amazon S3 (https://aws.amazon.com/s3/) bucket, and provide the bucket name in the notebook to start execution. Please follow the next steps to complete these tasks before we can execute the cells from our notebook:
Note
Please ensure you have completed the tasks mentioned in the Technical requirements section. If you have already created an Amazon SageMaker notebook instance and cloned the GitHub repository for the book in a previous chapter, you can skip some of these steps. Please go directly to the step where you open the notebook folder corresponding to this chapter.
IAM role permissions while creating Amazon SageMaker Jupyter notebooks
Accept the default option for the IAM role at notebook creation time to allow access to any S3 bucket.
This will take you to the home folder of your notebook instance.
Now that we have set up our notebook and cloned the repository, let's now add the permissions policies we need to successfully run our code sample.
To run the notebook, we have to enable additional policies and also update the trust relationships for our SageMaker notebook role. Please complete the following steps to do this:
{ "Version": "2012-10-17", "Statement": [ {
"Action": [
"s3:GetObject",
"s3:ListBucket",
"s3:PutObject"
],
"Resource": ["*"],
"Effect": "Allow"
}
]
}
{ "Version": "2012-10-17", "Statement": [
{ "Effect": "Allow",
"Principal":
{ "Service":
[ "sagemaker.amazonaws.com",
"s3.amazonaws.com",
"transcribe.amazonaws.com",
"comprehend.amazonaws.com" ]
},
"Action": "sts:AssumeRole" }
]
}
Now that we have set up our notebook and set up the IAM role to run the walk-through notebook, in the next section, we will start with creating broadcast versions of our sample video.
In this section we will create two S3 buckets and get the sample video uploaded for processing. Please execute the following steps:
We have now successfully completed the steps required to convert our sample video file into broadcast-enabled output files, which is required for us to insert ads into the video. In the next section, we will run a transcription of the audio content from our video, run topic modeling, create the VAST ad tag URL required for ad insertion, and show how we can perform content monetization.
Open the notebook you cloned from the GitHub repository (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2008/contextual-ad-marking-for-content-monetization-with-nlp-github.ipynb) in the Setting up to solve the use case section and execute the cells step by step, as follows:
Note
Please ensure you have executed the steps in the Technical requirements, Setting up to solve the use case, and Uploading the sample video and converting it for broadcast sections before you execute the cells in the notebook.
bucket = '<your-s3-bucket>'
prefix = 'chapter8'
s3=boto3.client('s3')
import time
import boto3
def transcribe_file(job_name, file_uri, transcribe_client):
transcribe_client.start_transcription_job(
TranscriptionJobName=job_name,
Media={'MediaFileUri': file_uri},
MediaFormat='mp4',
LanguageCode='en-US'
)
job_name = 'media-monetization-transcribe'
transcribe_client = boto3.client('transcribe')
file_uri = 's3://'+bucket+'/'+prefix+'/'+'rawvideo/bank-demo-prem-ranga.mp4'
transcribe_file(job_name, file_uri, transcribe_client)
job = transcribe_client.get_transcription_job(TranscriptionJobName=job_name)
job_status = job['TranscriptionJob']['TranscriptionJobStatus']
if job_status in ['COMPLETED', 'FAILED']:
print(f"Job {job_name} is {job_status}.")
if job_status == 'COMPLETED':
print(f"Download the transcript from "
f" {job['TranscriptionJob']['Transcript']['TranscriptFileUri']}")
raw_df = pd.read_json(job['TranscriptionJob']['Transcript']['TranscriptFileUri'])
raw_df = pd.DataFrame(raw_df.at['transcripts','results'].copy())
raw_df.to_csv('topic-modeling/raw/transcript.csv', header=False, index=False)
import csv
folderpath = r"topic-modeling/raw" # make sure to put the 'r' in front and provide the folder where your files are
filepaths = [os.path.join(folderpath, name) for name in os.listdir(folderpath) if not name.startswith('.')] # do not select hidden directories
fnfull = "topic-modeling/job-input/transcript_formatted.csv"
for path in filepaths:
print(path)
with open(path, 'r') as f:
content = f.read() # Read the whole file
lines = content.split('.') # a list of all sentences
with open(fnfull, "w", encoding='utf-8') as ff:
csv_writer = csv.writer(ff, delimiter=',', quotechar = '"')
for num,line in enumerate(lines): # for each sentence
csv_writer.writerow([line])
f.close()
s3.upload_file('topic-modeling/job-input/transcript_formatted.csv', bucket, prefix+'/topic-modeling/job-input/tm-input.csv')
Click the Launch Amazon Comprehend button.
Click on Analysis jobs in the left pane and click on Create job on the right, as follows:
Type a name for your analysis job and set the analysis type as Topic modeling from the built-in jobs list:
Provide the location of the CSV file (the transcript_formatted.csv file that we uploaded to S3 in preceding steps) in your S3 bucket in the Input Data section with the data source as My documents and the number of topics as 2, as shown:
Provide the Output data S3 location, as shown (you can use the same S3 bucket you used for input), and then type a name suffix and click on Create job.
You should see a job submitted status after the IAM role propagation is completed. After 30 minutes, the job status should change to Completed. Now click on the job name and copy the S3 link provided in the Output data location field and go back to your notebook. We will continue the steps in the notebook.
a.) To download the results of the Topic Modeling job, we need the output data location S3 URI that we copied in the previous step. In the first cell in this section of the notebook, replace the contents of the tpprefix variable, specifically <path-to-job-output-tar>, with the string highlighted in bold from the S3 URI you copied shown in the following code block.
Note
The output data location S3 URI you copied in the preceding step is s3://<your-s3-bucket>/chapter8/topic-modeling/<aws-account-nr>-TOPICS-<long-hash-nr>/output/output.tar.gz
b.) The revised code should look as follows and when executed will download the output.tar.gz file locally and extract it:
directory = "results"
parent_dir = os.getcwd()+'/topic-modeling'
path = os.path.join(parent_dir, directory)
os.makedirs(path, exist_ok = True)
print("Directory '%s' created successfully" %directory)
tpprefix = prefix+'/'+' topic-modeling/<aws-account-nr>-TOPICS-<long-hash-nr>/output/output.tar.gz'
s3.download_file(bucket, tpprefix, 'topic-modeling/results/output.tar.gz')
!tar -xzvf topic-modeling/results/output.tar.gz
c.) Now, load each of the resulting CSV files to their own pandas DataFrames:
tt_df = pd.read_csv('topic-terms.csv')
dt_df = pd.read_csv('doc-topics.csv')
d.) The topic terms DataFrame contains the topic number, what term corresponds to the topic, and the weightage this term contributes to the topic. Execute the code shown in the following code block to review the contents of the topic terms DataFrame:
for i,x in tt_df.iterrows():
print(str(x['topic'])+":"+x['term']+":"+str (x['weight']))
e.) We may have multiple topics on the same line, but for this solution, we are not interested in these duplicates, so we will drop them:
dt_df = dt_df.drop_duplicates(subset=['docname'])
f.) Let's now filter the topics such that we select the topic with the maximum weight distribution for the text it refers to:
ttdf_max = tt_df.groupby(['topic'], sort=False)['weight'].max()
g.) Load these into their own DataFrame and display it:
newtt_df = pd.DataFrame()
for x in ttdf_max:
newtt_df = newtt_df.append(tt_df.query('weight == @x'))
newtt_df = newtt_df.reset_index(drop=True)
newtt_df
h.) We will select the content topic term as it has the highest weight and assign this to a variable for use in the subsequent steps:
adtopic = newtt_df.at[1,'term']
We will then substitute these in the VAST ad marker URL before creating the AWS Elemental MediaTailor configuration. Execute the code cells as shown in the following code block:
adindex_df = pd.read_csv('media-content/ad-index.csv', header=None, index_col=0)
adindex_df
a.) Please note this is from the sample ad-index.csv file that the authors created for this demo. When you use this solution for your use case, you will need to create a Google Ads account to get the cmsid and vid values. For more details, please see this link: https://support.google.com/admanager/topic/1184139?hl=en&ref_topic=7506089.
b.) Run the code in the following snippet to select the cmsid and vid values based on our topic:
advalue = adindex_df.loc[adtopic]
advalue
c.) We get the following response:
1 cmsid=176
2 vid=short_tencue
d.) Now we will create the ad server URL to use with AWS Elemental MediaTailor. Let's first copy the placeholder URL available in our GitHub repo (https://github.com/PacktPublishing/Natural-Language-Processing-with-AWS-AI-Services/blob/main/Chapter%2008/media-content/adserver.csv), which has pre-roll, mid-roll, and post-roll segments filled in:
ad_rawurl = pd.read_csv('media-content/adserver.csv', header=None).at[0,0].split('&')
ad_rawurl
e.) We get the following response:
['https://pubads.g.doubleclick.net/gampad/ads?sz=640x480',
'iu=/124319096/external/ad_rule_samples',
'ciu_szs=300x250',
'ad_rule=1',
'impl=s',
'gdfp_req=1',
'env=vp',
'output=vmap',
'unviewed_position_start=1',
'cust_params=deployment%3Ddevsite%26sample_ar%3Dpremidpost',
'cmsid=',
'vid=',
'correlator=[avail.random]']
f.) We will now replace the cmsid and vid values highlighted in the preceding response with the values corresponding to our topic and reformat the URL:
ad_formattedurl = ''
for x in ad_rawurl:
if 'cmsid' in x:
x = advalue[1]
if 'vid' in x:
x = advalue[2]
ad_formattedurl += x + '&'
ad_formattedurl = ad_formattedurl.rstrip('&')
ad_formattedurl
g.) We get the following response. Copy the contents of the following URL:
'https://pubads.g.doubleclick.net/gampad/ads?sz=640x480&iu=/124319096/external/ad_rule_samples&ciu_szs=300x250&ad_rule=1&impl=s&gdfp_req=1&env=vp&output=vmap&unviewed_position_start=1&cust_params=deployment%3Ddevsite%26sample_ar%3Dpremidpost&cmsid=176&vid=short_tencue&correlator=[avail.random]'
Alright, that brings us to the end of this section. We successfully transcribed our sample video file using Amazon Transcribe, ran an Amazon Comprehend Topic Modeling job on the transcript, selected a topic, and stitched together an ad server VAST tag URL with the ad content ID corresponding to the topic. In the next section, we will use AWS Elemental MediaTailor to create new video output with the ad segments inserted, and we will test it by playing the video.
Before we can proceed forward, we need to create an Amazon CloudFront (https://aws.amazon.com/cloudfront/) content delivery distribution for the video output files we transcoded with AWS Elemental MediaConvert in the Uploading the sample video and converting it for broadcast section.
Amazon CloudFront is a managed content delivery network that can be used for site hosting, APIs, and image, media, and video file delivery, with live or on-demand streaming formats, configured for global distribution or based on the selected price class. Please follow the next steps to set up the CloudFront distribution for your transcoded video files:
And that concludes the solution build for this chapter. Please refer to the Further reading section for more details on media content monetization with AWS AI and media services.
In this chapter, we learned how to build an intelligent solution for media content monetization using the AWS AI services Amazon Transcribe and Amazon Comprehend, the Amazon CloudFront content delivery network, and the AWS media services Elemental MediaConvert and Elemental MediaTailor by taking a sample MP4 video file. We covered all this by first transcoding it into Apple HLS output files using MediaConvert, then creating atranscription from the MP4 file using Amazon Transcribe, analyzing the transcript, and detecting topics using Amazon Comprehend Topic Modeling, creating a VAST ad decision server URL. We also covered creating a distribution for the transcoded video content using Amazon CloudFront and using this distribution and the ad decision server URL to insert ads into the transcoded video using MediaTailor.
For our solution, we started by introducing the content monetization use case for LiveRight, the requirement for a cost-effective expansion resulting in using content to pay for content creation. We then designed an architecture that used AWS AI services, media services, and the content delivery network to assemble an end-to-end walk-through of how to monetize content in video files. We assumed that you, the reader, are the architect assigned to this project, and we reviewed an overview of the solution components along with an architectural illustration in Figure 8.1.
We then went through the prerequisites for the solution build, set up an Amazon SageMaker notebook instance, cloned our GitHub repository, and started executing the steps using the AWS Management Console and the code in the notebook based on instructions from this chapter.
In the next chapter, we will look at an important use case, metadata extraction, using named entity recognition. We will, as before, introduce the use case, discuss how to design the architecture, establish the prerequisites, and walk through in detail the various steps required to build the solution.
3.12.103.29