Architecture of Spark SQL

Spark SQL is a library on top of the Spark core execution engine, as shown in Figure 4.2. It exposes SQL interfaces using JDBC/ODBC for Data Warehousing applications or through a command-line console for interactively executing queries. So, any Business Intelligence (BI) tools can connect to Spark SQL to perform analytics at memory speeds. It also exposes a Dataset API and DataFrame API, which are supported in Java, Scala, Python, and R. Spark SQL users can use the Data Source API to read and write data from and to a variety of sources to create a DataFrame or a Dataset. Figure 4.2 also indicates the traditional way of creating and operating on RDDs from programming languages to the Spark core engine.

Architecture of Spark SQL

Figure 4.2: Spark SQL architecture

Spark SQL also extends the Dataset API, DataFrame API, and Data Sources API to be used across all other Spark libraries such as SparkR, Spark Streaming, Structured Streaming, Machine Learning Libraries, and GraphX as shown in Figure 4.3. Once the Dataset or DataFrame is created, it can be used in any library, and they are interoperable and can be converted to traditional RDDs.

Architecture of Spark SQL

Figure 4.3: Spark ecosystem with Data Sources API and DataFrame API

Spark SQL introduced an extensible optimizer called Catalyst to support most common data sources and algorithms. Catalyst enables the adding of new data sources, optimization rules, and data types for domains such as machine learning. Catalyst uses the pattern matching feature of Scala to express rules. It offers a general framework for transforming trees, which are used to perform analysis, planning, and runtime code generation.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.113.163