Spark ecosystem in brief

To provide you with more advanced and additional big data processing capabilities, your Spark jobs can be running on top of Hadoop-based (aka YARN) or Mesos-based clusters. On the other hand, the core APIs in Spark, which is written in Scala, enable you to develop your Spark application using several programming languages such as Java, Scala, Python, and R. Spark provides several libraries that are part of the Spark ecosystems for additional capabilities for general purpose data processing and analytics, graph processing, large-scale structured SQL, and machine learning (ML) areas. The Spark ecosystem consists of the following components:

Figure 1: Spark ecosystem (up to Spark 2.1.0)

The core engine of Spark is written in Scala but supports different languages to develop your Spark application, such as R, Java, Python, and Scala. The main components/APIs in the Spark core engine are as follows:

  1. SparkSQL: This helps in seamlessly mix SQL queries with Spark programs so that you can query structured data inside Spark programs.
  2. Spark Streaming: This is for large-scale streaming application development that provides seamless integration of Spark with other streaming data sources such as Kafka, Flink, and Twitter.
  3. SparkMLlib and SparKML: These are for RDD and dataset/DataFrame-based machine learning and pipeline creation.
  4. GraphX: This is for large-scale graph computation and processing to make your graph data object fully connected.
  5. SparkR: R on Spark helps in basic statistical computations and machine learning.

As we have already stated, it is very much possible to combine these APIs seamlessly to develop large-scale machine learning and data analytics applications. Moreover, Spark jobs can be submitted and executed through cluster managers such as Hadoop YARN, Mesos, and standalone, or in the cloud by accessing data storage and sources such as HDFS, Cassandra, HBase, Amazon S3, or even RDBMS. However, to the full facility of Spark, we need to deploy our Spark application on a computing cluster.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.248.159