Using AWS Glue and Amazon Athena

In this section, we will use AWS Glue to create a crawler, an ETL job, and a job that runs KMeans clustering algorithm on the input data.

We use a publicly available dataset about the students' knowledge status on a subject. The dataset and the field descriptions are available for download from the UCI site: https://archive.ics.uci.edu/ml/datasets/User+Knowledge+Modeling

  1. Log in to the AWS Management Console and go to the Glue console. Click on the Add crawler button.
  2. Specify the Crawler name as User Modeling Data Crawler as shown here. Click on the Next button:
  1. In the Add a data store screen, select S3 as the Data store, and select the Specified path in my account option. Specify the path for the S3 bucket containing the input data. Click on the Next button:
  1. Select No on the Add another data store and click on the Next button.
  2. On the IAM console, select the Glue service and click on the Next: Permissions button:
  1. Next, we attach the appropriate policies to the role. Filter for AWSGlueServiceRole and select it:
  1. Filter for AmazonS3FullAccess policy and select it:
  1. Review the role for the Glue service and click on the Create role button:
  1. Return to the crawler wizard and select the newly created IAM role. Click on the Next button:
  1. Select the Frequency parameter as Run on demand and click on the Next button.
  2. Next, we configure a database for the crawler's output. As we do not have a database configured, click on the Add database button.
  3. Specify the name of the database as usermodelingdb and click on the Create button:
  1. Specify a prefix for the tables (as users) and click on the Next button:
  1. Review all the details for the crawler and click on the Finish button
  2. After the crawler is created, you should see the message shown here. Click on the hyperlink Run it now? to execute the crawler:
  3. After the crawler finishes running, click on the Tables link to see the tables listed. Click on the newly created table (usersinput). You should see the following details. Notice that the header line is skipped and 553 records from our input file are inserted. Additionally, verify that the inferred schema is accurate:
  1. Next, we create an ETL job using Glue. Click on the Jobs link and then click on the Add job button.
  2. Next, we specify the properties of the ETL job, including the Name (User Modeling ETL Job), IAM role (select the AWSGlueServiceRole-GlueRole we created earlier), select the option for using the proposed script generated by AWS Glue, select the language as Python, specify a name for the script file, and S3 locations for the script and temporary directory (create these folders from the S3 console). Click on the Next button:
  1. Choose usersinput table as the data source and click on the Next button.
  2. Select Amazon S3 as the Data store, Format as Parquet, and the S3 Target path (create the S3 folder from the S3 console). Click on the Next button:
  1. On this screen, you can map the source columns to the target columns. Additionally, you can drop columns in the target, if they are not relevant to your analysis. We will leave the mapping as-is as we want all of the input data in the target parquet file. Click on the Next button: 
  1. Review the Job properties and click on the Finish button.
  1. You should see the code generated by Glue. In the left pane, there is a figure that shows you the flow visually. We can make changes to the generated script, if required. Click on the Run job button in the top menu:
  1. Click on the Run job button. We do not have any further parameters to specify for our example:
  1. You should see the Run job button disabled as the job has started:
  1. Switch to the Jobs console to see your submitted job listed. Select the job to expand the display. You should see the Run status as Running. Upon completion of the job, the status changes to Succeeded.
  2. Go to the S3 console to confirm that the parquet files have been created in the target folder:
  1. Next, we use Athena to check the contents of the source table. Start the Athena console and click on the Query Editor link.
  2. Select the database and click on the three dots next to the source table to display the menu. Click on the Preview table option:
  1. The SQL for querying contents of the table is executed and the rows are displayed as shown in the following screenshot:
  1. Create a Python-Spark-Glue script file (testscript.py) containing the following code:
import sys
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import SelectFields
from awsglue.dynamicframe import DynamicFrame
from pyspark.context import SparkContext
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.clustering import KMeans

args = getResolvedOptions(sys.argv, ['JOB_NAME'])
destination = "s3://gluesparkemrtestdata/datasets/parquet/results/"
namespace = "usermodelingdb"
tablename = "usersinput"

sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

#Load table and select fields
datasource0 = glueContext.create_dynamic_frame.from_catalog(name_space = namespace, table_name = tablename)
SelectFields0 = SelectFields.apply(frame = datasource0, paths=["stg","scg","str","lpr","peg","uns" ])
DataFrame0 = DynamicFrame.toDF(SelectFields0)

#Select features and convert to Spark ML required format
features = ["stg","scg","str","lpr","peg"]
assembler = VectorAssembler(inputCols=features,outputCol='features')
assembled_df = assembler.transform(DataFrame0)

#Fit and Run Kmeans
kmeans = KMeans(k=2, seed=1)
model = kmeans.fit(assembled_df)
transformed = model.transform(assembled_df)

#Save data to destination
transformed.write.mode('overwrite').parquet(destination)
job.commit()
  1. Upload the script file to the S3 script folder using the S3 console:
  1. Create a new job and specify the Name as Spark KMeans Job. This time select the option An existing script that you provide. Furthermore, select Python as the language, and specify the S3 paths for the testscript.py file and the temporary directory. Click on Next:
  1. Click on the Next button on the Connections screen.

 

  1. Review the Job properties and click on the Finish button:
  1. Follow the steps for running the job as shown in the previous steps of this exercise.
  2. After the job has completed successfully, check the S3 results folder to confirm that the output files are present:
  1. Next, we will create another crawler to parse the results file. Follow the steps listed for creating a crawler as shown earlier in this exercise. We display the final review screen for the crawler for reference purposes:
  1. After the crawler finishes processing, you should see another table added to your database (usersresults).
  2. Verify the contents of the newly created table (userresults) using Athena as shown previously. Note that a new column called prediction is added by the K-Means algorithm.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.121.54