The processing environment

If any of you have examined my web-based blogs, or read my first book, Big Data Made Easy, you will see that I am interested in Big Data integration, and how the big data tools connect. None of these systems exist in isolation. The data will start upstream, be processed in Spark plus H2O, and then the result will be stored, or moved to the next step in the ETL chain. Given this idea in this example, I will use Cloudera CDH HDFS for storage, and source my data from there. I could just as easily use S3, an SQL or NoSQL database.

At the point of starting the development work for this chapter, I had a Cloudera CDH 4.1.3 cluster installed and working. I also had various Spark versions installed, and available for use. They are as follows:

  • Spark 1.0 installed as CentOS services
  • Spark 1.2 binary downloaded and installed
  • Spark 1.3 built from a source snapshot

I thought that I would experiment to see which combinations of Spark, and Hadoop I could get to work together. I downloaded Sparkling water at http://h2o-release.s3.amazonaws.com/sparkling-water/master/98/index.html and used the 0.2.12-95 version. I found that the 1.0 Spark version worked with H2O, but the Spark libraries were missing. Some of the functionality that was used in many of the Sparkling Water-based examples was available. Spark versions 1.2 and 1.3 caused the following error to occur:

15/04/25 17:43:06 ERROR netty.NettyTransport: failed to bind to /192.168.1.103:0, shutting down Netty transport
15/04/25 17:43:06 WARN util.Utils: Service 'sparkDriver' could not bind on port 0. Attempting port 1.

The Spark master port number, although correctly configured in Spark, was not being picked up, and so the H2O-based application could not connect to Spark. After discussing the issue with the guys at H2O, I decided to upgrade to an H2O certified version of both Hadoop and Spark. The recommended system versions that should be used are available at http://h2o.ai/product/recommended-systems-for-h2o/.

I upgraded my CDH cluster from version 5.1.3 to version 5.3 using the Cloudera Manager interface parcels page. This automatically provided Spark 1.2—the version that has been integrated into the CDH cluster. This solved all the H2O-related issues, and provided me with an H2O-certified Hadoop and Spark environment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.223.160