Best practices for distributed machine learning and predictive analytics

Enterprise predictive analytics applications need to be highly scalable and need to support agility in feature development and deployments. There are a plethora of open source tools and solutions available. However, using managed services eliminates infrastructure-related heavy lifting and leads to agility.

Such applications use a variety of supporting AWS services, including the following:

Amazon S3: Store anything (object store), Scalable/Elastic, High durability, effectively infinite inbound bandwidth, extremely low cost, and data layer for virtually all AWS services. Amazon S3 is a great choice for cloud-based data lake because of durability (designed for 11 9s durability), availability (designed for 99.99% availability), high performance (features such as multiple uploads, Range GET), ease of use (supports simple, REST APIs, AWS SDKs, read-after-create consistency, event notification, and lifecycle policies), scalability (save as much data as you need, scale storage and compute independently, no minimum usage commitments), and is integrated with other services such as Amazon EMR, Amazon Redshift, and Amazon DynamoDB.
AWS Glue: This provides a managed ETL engine, and a Job scheduler. It is built on Apache Spark and is integrated with S3, RDS, Redshift, and any JDBC data store.
Glue Data Catalog: This manages table metadata through a Hive Metastore API or Hive SQL, and is supported by tools such as Hive, Presto, and Spark. There are several extended services available to support additional functionality such as search metadata for data discovery, connection information for JDBC URLs and credentials, classification for identifying and parsing files, versioning of table metadata as schemas evolve and other metadata are updated. You can populate the catalog using Hive DDL, bulk import, or automatically through crawlers.
Glue Crawlers: The Glue crawlers are used to auto-populate data catalog. They come with built-in classifiers that detect the file type and extract the record schema (record structures and data types). It can also auto-detect Hive-style partitions and group similar files into one table. You can set up a schedule for running the crawlers. They are implemented as a serverless service so you only pay when the crawlers are actually run.
Amazon EMR: This includes scalable Hadoop clusters as a service. The service includes Hadoop, Hive, Spark, Presto, HBase, etc. distributions. It is easy to use and a fully managed service. You have a choice of using On-demand, reserved, or spot instances for your cluster. Storage options include HDFS, S3, and Amazon EBS filesystems. It is integrated with AWS Glue, and you can configure end-to-end security for the data-in-flight and data-at-rest.

Table of Contents for Best practices for distributed machine learning and predictive analytics

Create new playlist

Sign In

Sign Up

Table of Contents for
Best practices for distributed machine learning and predictive analytics