Understanding Amazon SageMaker algorithms and features 

In this section, we present some of the key algorithm design choices and features in Amazon SageMaker that make them an excellent option for your machine learning applications. The following are some of its key features:

  • Based on streaming: Amazon SageMaker algorithms are all based on streaming (data points seen only once). Based on the streaming data, a state (of fixed size) is set; so it does not matter how much data you have streamed (the data structure used is not going to grow in size). Hence, the memory footprint of the algorithm is going to be fixed and the run time/costs remain linear to the size of the data.
  • Support for incremental training: These algorithms are designed to support incremental training. For example, if you are training on 2 days' worth of data, then you will have to train on day 1 and 2 data, and later on day 2 and day 3 data. In such a situation, with these algorithms, you can serialize and persist the state after days 1 and 2. When day 3 data is available, you can deserialize the state (at this stage you are exactly at the end of day 2), and then you can proceed with just processing day 3 data. Hence, you save on compute because you don’t have to retrain on day 2 data. Another significant advantage is that you no longer have to choose how far back you want to go for including the data in your training set. In our example, you end up effectively training on all of the data from days 1, 2, and 3. So, overall it becomes faster, cheaper, and more accurate.
  • Support for GPUs: All the algorithms work on both CPU and GPU.
  • Distributed architecture: It is linearly scalable. If each node has one third of the data, then it will run in one third of the time. As the states for each node may be different after the training is completed, there is also a shared state (a global state that all these nodes share). The local state is synchronized with the global state.
  • Support for efficient model selection: Amazon Kinesis streams can be consumed as input streams. This allows you to do post training processing (and not only pretraining) to explore different models. For model selection, hyperparameter tuning can be done based on the state to generate multiple models instead of retraining on the same data.
  • Support for abstraction and containerization: You can build on a desktop (using CPU) and then deploy in a distributed GPU environment. These features also give you superior production readiness. The deployment can run the same solution on 1 GB or 1 TB of data.

Amazon SageMaker algorithms can be used from the command line (specify the algorithm, input data, and hardware), and also from the Amazon SageMaker Notebooks. You can also deploy directly from the notebook itself as shown in Chapter 9, Implementing a Big Data Application.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.167.178