ETL with Scala

Let's have a look at the first Scala-based notebook where our ETL process is expressed. Fortunately, the IEEE website allows us to download a ZIP file containing all the data. We'll use a Scala-based notebook to perform the following steps:

  1. Download the ZIP file to a local staging area within the Apache Spark driver. The download location is already included in the notebook file that you can download from the book's download page. But for more information on the data set please refer to http://www.femto-st.fr/en/Research-departments/AS2M/Research-groups/PHM/IEEE-PHM-2012-Data-challenge.php.
  2. Unzip the file.
  3. Create an Apache Spark DataFrame out of the nested folder structure to get a unique and queryable view of all data.
  4. Use an SQL statement for data transformation.
  5. Save the result into a single JSON file on the IBM Cloud OpenStack Swift-based ObjectStore.

So let's have a look at each individual step:

  1. Downloading the ZIP file to the staging area:
    In the IBM Data Science Experience platform, the Apache Spark cluster lives in a Docker/Kubernetes-based environment (more on this in the next chapter). This also holds for the Apache Spark driver container. This means we can call ordinary shell commands within the driver container and also use this as the staging area:
  1. As you can see in the following screenshot, IBM provides plenty of free space in this staging area by mounting a GPFS cluster file system onto the driver container. In this case, we have more than 70 TB of space in this staging area. In case we need to load more than 70 TB, we need to split the data and incrementally load it. IBM Cloud OpenStack Swift-based ObjectStore provides unlimited storage capacity, and there is no need be concerned about scalability. The client is charged per GB on a monthly basis. More details on pricing can be found here https://console-regional.ng.bluemix.net/?direct=classic&env_id=ibm:yp:us-south#/pricing/cloudOEPaneId=pricing&paneId=pricingSheet&orgGuid=3cf55ee0-a8c0-4430-b809-f19528dee352&spaceGuid=06d96c35-9667-4c99-9d70-980ba6aaa6c3:
  1. Next let's unzip the data:
  1. Now we have a huge number of individual CSV files, which we want to use to create a DataFrame. Note that we can filter individual files using a wildcard. So in this case, we are only interested in the accelerometer data, ignoring the temperature data for now:

As we can see, we are double-checking whether the schema was inferred correctly, and we also have a look at the first rows of the dataset to get an idea of what's inside.

The column names represent the following:

    • _c0: hour
    • _c1: minute
    • _c2: second
    • _c3: millisecond
    • _c4: horizontal acceleration
    • _c5: vertical acceleration
  1. It is obvious that some data transformation is necessary. The timestamp especially is in a very unusable format. The data scientist also told us to create an additional column containing an aggregation key called cluster composed of hours, minutes, and seconds.
    The following script does the job:
  1. The final step is to store the resulting DataFrame as JSON into the IBM Cloud OpenStack Swift-based ObjectStore:

This concludes our ETL process. Once we are convinced that everything runs fine by manually running the notebook, we can automatically schedule it to run once every hour if we like:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.85.181