Using PySpark

Nowadays, Apache Spark is one of the most popular projects for distributed computing. Developed in Scala, Spark was released in 2014, and integrates with HDFS and provides several advantages and improvements over the Hadoop MapReduce framework.

Contrary to Hadoop MapReduce, Spark is designed to process data interactively and supports APIs for the Java, Scala, and Python programming languages. Given its different architecture, especially by the fact that Spark keep results in memory, Spark is generally much faster than Hadoop MapReduce. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.171.107