Data analysts, scientists, and software engineers have been facing a serious challenge: the explosion of the amount of data required to build reliable models. After all, how valuable is a data mining application if the model does not scale?
The challenge of big data is addressed through a two-facet strategy: improving the efficiency of existing data mining and machine learning solutions, and leveraging scalable infrastructure (frameworks, programming languages, GPUs, and so on).
This chapter covers the Scala parallel collections, the Actor model, and the Akka framework. The next chapter introduces the Apache Spark framework and its collection of machine learning algorithms.
The following are the topics addressed in this chapter:
The support for distributing and concurrent processing is provided by different stacked frameworks and libraries. Scala concurrent and parallel collections classes leverage the threading capabilities of the Java virtual machine. Akka.io implements a reliable action model originally introduced as part of the Scala standard library. The Akka framework supports remote Actors, routing, and load balancing protocol; dispatchers, clusters, events, and configurable mailboxes management; and support for different transport modes, supervisory strategies, and typed Actors.
The following stack representation illustrates the interdependencies between frameworks:
The next chapter introduces the Apache Spark framework.
Each layer adds a new functionality to the previous one to increase scalability. The Java Virtual Machine (JVM) runs as a process within a single host. Scala concurrent classes support effective deployment of an application by leveraging multi core CPU capabilities without the need to write multithreaded applications. Akka extends the Actor paradigm to clusters with advanced messaging and routing options. Finally, Apache Spark leverages Scala higher-order collection methods and the Akka implementation of the Actor model to provide large-scale data processing systems with better performance and reliability, through its resilient distributed datasets and in-memory persistency.
3.139.97.53