© Butch Quinto 2018
Butch QuintoNext-Generation Big Datahttps://doi.org/10.1007/978-1-4842-3147-0_9

9. Big Data Visualization and Data Wrangling

Butch Quinto1 
(1)
Plumpton, Victoria, Australia
 

It’s easy to understand why self-service data analysis and visualization have become popular these past few years. It made users more productive by giving them the ability to perform their own analysis and allowing them to interactively explore and manipulate data based on their own needs without relying on traditional business intelligence developers to develop reports and dashboards, a task that can take days, weeks, or longer. Users can perform ad hoc analysis and run follow-up queries to answer their own questions. They’re also not limited by static reports and dashboards. Output from self-service data analysis can take various forms depending on the type of analysis. The output can take the form of interactive charts and dashboards, pivot tables, OLAP cubes, predictions from machine learning models, or query results returned by a SQL query.

Big Data Visualization

Several books have already been published about popular data visualization tools such as Tableau, Qlik, and Power BI. These tools all integrate well with Impala and some of them have rudimentary support for big data. They’re adequate for typical data visualization tasks, and most of the time that’s all typical users will ever need. However, when an organization’s analytic requirements go beyond what traditional tools can handle, it’s time to look at tools that were specifically designed for big data. Note that when I talk about big data, I am referring to the three V’s of big data. I am not only referring to the size (volume) of data, but also the variety (Can these tools analyze structured, unstructured, semi-structured data) and velocity of data (Can these tools perform real-time or near real-time data visualization?). I explore data visualization tools in this chapter that are specifically designed for big data.

SAS Visual Analytics

SAS has been one of the leaders in advanced analytics for more than 40 years. The SAS software suite has hundreds of components designed for various use cases ranging from data mining and econometrics to six sigma and clinical trial analysis. For this chapter, we’re interested in the self-service analytic tool known as SAS Visual Analytics.

SAS Visual Analytics is a web-based, self-service interactive data analytics application from SAS. SAS Visual Analytics allows users to explore their data, write reports, and process and load data into the SAS environment. It has in-memory capabilities and was designed for big data. SAS Visual Analytics can be deployed in distributed and non-distributed mode. In distributed mode, the high-performance SAS LASR Analytic Server daemons are installed on the Hadoop worker nodes. It can take advantage of the full processing capabilities of your Hadoop cluster, enabling it to crank through large volumes of data stored in HDFS.

SAS Visual Analytics distributed deployment supports Cloudera Enterprise, Hortonworks HDP, and the Teradata Data Warehouse Appliance. The non-distributed SAS VA deployment runs on a single machine with a locally installed SAS LASR Analytic Server and connects to Cloudera Impala via JDBC. i

Zoomdata

Zoomdata is a data visualization and analytics company based in Reston, Virginia. ii It was designed to take advantage of Apache Spark, making it ideal for analyzing extremely large data sets. Spark’s in-memory and streaming features enable Zoomdata to support real-time data visualization capabilities. Zoomdata supports various data sources such as Oracle, SQL Server, Hive, Solr, Elasticsearch, and PostgreSQL to name a few. One of the most exciting features of Zoomdata is its support for Kudu and Impala, making it ideal for big data and real-time data visualization. Let’s take a closer look at Zoomdata.

Self-Service BI and Analytics for Big Data

Zoomdata is a modern BI and data analytics tool designed specifically for big data. Zoomdata supports a wide range of data visualization and analytic capabilities, from exploratory self-service data analysis, embedded visual analytics to dashboarding. Exploratory self-service data analysis has become more popular during the past few years. It allows users to interactively navigate and explore data with the goal of discovering hidden patterns and insights from data. Compared to traditional business intelligence tools, which rely heavily on displaying predefined metrics and KPIs using static dashboards.

Zoomdata also supports dashboards. Dashboards are still an effective way of communicating information to users who just wants to see results. Another nice feature is the ability to embed charts and dashboards created in Zoomdata into other web applications. Utilizing standard HTML5, CSS, WebSockets, and JavaScript, embedded analytics can run on most popular web browsers and mobile devices such as iPhones, iPads, and Android tablets. Zoomdata provides a JavaScript SDK if you need the extra flexibility and you need to control how your applications look and behave. A REST API is available to help automate the management and operation of the Zoomdata environment.

Real-Time Data Visualization

Perhaps the most popular feature of Zoomdata is its support for real-time data visualization. Charts and KPIs are updated in real time from live streaming sources such as MQTT and Kafka or an SQL-based data source such as Oracle, SQL Server, or Impala.

Architecture

At the heart of Zoomdata is a stream processing engine that divides and processes all incoming data as data streams. The data source doesn’t have to be live, as long as the data has a time attribute, Zoomdata can stream it. Zoomdata has a technology called Data DVR, which not only lets users stream data, but also rewind, replay, and pause real-time streaming data, just like a DVR. Zoomdata pushes most of the data processing down to the data source where the data resides, taking advantage of data locality by minimizing data movement. iii Zoomdata uses Apache Spark behind the scenes as a complementary data processing layer. It caches streamed data as Spark DataFrames and stores it in its result set cache. Zoomdata always inspects and tries to retrieves the data from the result set cache first before going to the original source. iv Figure 9-1 shows the high-level architecture of Zoomdata. Apache Spark enables most of Zoomdata’s capabilities such as Zoomdata Fusion and Result Set Caching.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig1_HTML.jpg
Figure 9-1

High-level architecture of Zoomdata

Deep Integration with Apache Spark

One of the most impressive features of Zoomdata is its deep integration with Apache Spark. As previously discussed, Zoomdata uses Spark to implement its streaming feature. Zoomdata also uses Spark for query optimization by caching data as Spark DataFrames, then performing the calculation, aggregation, and filtering using Spark against the cached data. Spark also powers another Zoomdata feature called SparkIt. SparkIt boosts performance of slow data sources such as S3 buckets and text files. SparkIt caches these data sets into Spark, which transforms them into queryable, high-performance Spark DataFrames. v Finally, Zoomdata uses Spark to implement Zoomdata Fusion, a high-performance data virtualization layer that virtualizes multiple data sources into one.

Zoomdata Fusion

Zoomdata uses a high-performance data virtualization layer called Zoomdata Fusion. Also powered by Apache Spark, Zoomdata makes multiple data sources appear as one source without physically moving the data sets together to a common location. vi Zoomdata uses Spark to cache data when optimizing cross-platform joins but pushes processing to the data source for maximum scalability. With Zoomdata Fusion, users will be able to conveniently query and join disparate data sets in different formats, significantly shortening the time to insight, increasing productivity, and reducing reliance on ETL (see Figure 9-2).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig2_HTML.jpg
Figure 9-2

Diagram of Zoomdata Fusion

Data Sharpening

Zoomdata was awarded a patent on “data sharpening,” an optimization technique for visualizing large data sets. Data Sharpening works by dividing a large query into a series of micro-queries. These micro-queries are then sent to the data source and executed in parallel. Results from the first micro-query is immediately returned and visualized by Zoomdata as an estimate of the final results. Zoomdata updates the visualization as new data arrives, showing incrementally better and more accurate estimates until the query completes and the full visualization is finally shown. While all of these are happening, users can still interact with the charts and launch other dashboards, without waiting for long-running queries to finish. Underneath, Zoomdata’s streaming architecture makes all of these possible, breaking queries down into micro-queries and streaming results back to the user. vii

Support for Multiple Data Sources

Even though Zoomdata was designed for big data, it also supports the most popular MPP and SQL databases in the market including Oracle, SQL Server, Teradata, MySQL, PostgreSQL, Vertica, and Amazon Aurora. viii In almost all enterprise environments you will find a combination of various flavors of SQL databases and Hadoop distributions. With Zoomdata Fusion, Zoomdata allows users to easily query these disparate data sources just like it was a single data source. Other Zoomdata features that enables fast visual analytics such as data sharpening, Data DVR, and micro-queries all work on these SQL and MPP databases as well. Other popular data sources such as MongoDB, Solr, Elasticsearch, and Amazon S3 are also supported (see Figure 9-3).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig3_HTML.jpg
Figure 9-3

Partial list of Zoomdata’s supported data sources

Note

Consult Zoomdata’s website for the latest instructions on how to install Zoomdata. You also need to install Zoomdata’s Kudu connector to connect to Kudu, which is available on Zoomdata’s website as well.

Let’s get started with an example. Start Zoomdata. You’ll be presented with a login page that looks like Figure 9-4.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig4_HTML.jpg
Figure 9-4

Zoomdata login page

The first thing you need to do is create a new data source (see Figure 9-5).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig5_HTML.jpg
Figure 9-5

Zoomdata homepage

Notice that the Kudu Impala data source added to the list of data sources (see Figure 9-6).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig6_HTML.jpg
Figure 9-6

Available data sources

You will be required to input your credentials, including the JDBC URL and the location of the Kudu Master (Figure 9-7). In our example, the Kudu Master resides at 192.168.56.101:7051. Update the JDBC URL to jdbc:hive2://192.168.56.101:21050/;auth=noSasl. Note that even though we’re connecting to Impala, we need to specify “hive2” in our JDBC URL. Click “Next” when done (Figure 9-8).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig7_HTML.jpg
Figure 9-7

Kudu Impala Connector page

../images/456459_1_En_9_Chapter/456459_1_En_9_Fig8_HTML.jpg
Figure 9-8

Kudu Impala Connector Input Credentials

Pick the table that we will be using in this example (Figure 9-9). You can also select fields (Figure 9-10). Unless you’re running out of RAM, it’s always a good idea to turn on caching. Since our table is from a relational data store, you don’t need to enable SparkIt. If you remember, SparkIt is only for simple data sources such as text files that do not provide caching capabilities. Click Next to proceed to the next step.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig9_HTML.jpg
Figure 9-9

Data Source Tables

../images/456459_1_En_9_Chapter/456459_1_En_9_Fig10_HTML.jpg
Figure 9-10

Data Source Fields

Zoomdata can maintain cached result sets for certain data sources asynchronously (Figure 9-11).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig11_HTML.jpg
Figure 9-11

Data Source Refresh

You can also schedule a job to refresh the cache and metadata (Figure 9-12).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig12_HTML.jpg
Figure 9-12

Data Source Scheduler

For advanced users who are more comfortable with the cron Unix-style scheduler, a similar interface is also available (Figure 9-13).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig13_HTML.jpg
Figure 9-13

Data Source Cron Scheduler

To support real-time visualization, Live Mode and Playback must be enabled (Figure 9-14). As long as the data set has a time attribute, in our case the sensortimestamp, Zoomdata can visualize using the time attribute. I set the refresh rate and delay to 1 second. I also set the range from Now-1 minute to Now.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig14_HTML.jpg
Figure 9-14

Time Bar Global Default Settings

You can configure the available charts (Figure 9-15).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig15_HTML.jpg
Figure 9-15

Global Default Settings – Charts

There are different configuration options available depending on the type of chart (Figure 9-16).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig16_HTML.jpg
Figure 9-16

Map: World Countries – Options

The data source will be saved after you click “Finish.” You can immediately create a chart based on the source (Figure 9-17). In this case we’ll pick “Bars: Multiple Metrics.”
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig17_HTML.jpg
Figure 9-17

Add New Chart

A bar chart similar to Figure 9-18 will be displayed. You can adjust different bar chart options.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig18_HTML.jpg
Figure 9-18

Bar Chart

When a chart is maximized, you’re provided with more configuration options such as Filters, Color, and Chart Style to name a few. In the example shown in Figure 9-19, I’ve changed the color of the bar chart from yellow and blue to purple and green.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig19_HTML.jpg
Figure 9-19

Bar Chart – Color

To add more charts, click “Add Chart” located near the upper-right hand corner of the application (Figure 9-20).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig20_HTML.jpg
Figure 9-20

Add New Chart

The pie chart is one of the most common chart types (Figure 9-21).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig21_HTML.jpg
Figure 9-21

Pie Chart

Zoomdata has numerous chart types as shown in Figure 9-22.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig22_HTML.jpg
Figure 9-22

Multiple Chart Types

Zoomdata has mapping capabilities as shown in Figure 9-23.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig23_HTML.jpg
Figure 9-23

Zoomdata Map

Zoomdata can update the map with markers in real time as shown in Figure 9-24.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig24_HTML.jpg
Figure 9-24

Zoomdata Map with Markers

Real-Time IoT with StreamSets, Kudu, and Zoomdata

One of the most exciting big data use cases is IoT (Internet of Things). IoT enables the collection of data from sensors embedded in hardware such as smart phones, electronic gadgets, appliances, manufacturing equipment, and health care devices to name a few. Use cases range from real-time predictive health care, network pipe leakage detection, monitoring water quality, emergency alert systems, smart homes, smart cities, and connected car to name a few. IoT data can be run through rules engines for real-time decision making or machine learning models for more advanced real-time predictive analytics.

For our IoT example, we will detect network pipe leakage and monitor water quality for a fictional water utility company. We’ll use a shell script to generate random data to simulate IoT data coming from sensors installed on water pipes. We’ll use Kudu for data storage, StreamSets for real-time data ingestion and stream processing, and Zoomdata for real-time data visualization.

In real IoT projects, the data source will most likely be an IoT gateway that serves data over a lightweight messaging protocol such as MQTT. Kafka is also highly recommended as a buffer between the IoT gateway and the data storage to provide high availability and replayability in case old data is needed (Figure 9-25).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig25_HTML.jpg
Figure 9-25

A Typical IoT Architecture using StreamSets, Kafka, Kudu, and Zoomdata

Create the Kudu Table

Zoomdata requires the Kudu table to be partitioned on the timestamp field. To enable live mode, the timestamp field has to be stored as a BIGINT in epoch time. Zoomdata will recognize the timestamp field as “playable” and enable live mode on the data source (Listing 9-1).

CREATE TABLE my_sensors (
rowid BIGINT NOT NULL,
sensortimestamp BIGINT NOT NULL,
deviceid INTEGER,
temperature INTEGER,
pressure INTEGER,
humidity INTEGER,
ozone INTEGER,
sensortimestamp_str STRING,
city STRING,
temperature_warning INTEGER,
temperature_critical INTEGER,
pressure_warning INTEGER,
pressure_critical INTEGER,
humidity_warning INTEGER,
humidity_critical INTEGER,
ozone_warning INTEGER,
ozone_critical INTEGER,
lead_level INTEGER,
copper_level INTEGER,
phosphates_level INTEGER,
methane_level INTEGER,
chlorine_level INTEGER,
lead_warning INTEGER,
lead_critical INTEGER,
copper_warning INTEGER,
copper_critical INTEGER,
phosphates_warning INTEGER,
phosphates_critical INTEGER,
methane_warning INTEGER,
methane_critical INTEGER,
chlorine_warning INTEGER,
chlorine_critical INTEGER,
PRIMARY KEY(rowid,sensortimestamp)
)
PARTITION BY HASH (rowid) PARTITIONS 16,
RANGE (sensortimestamp)
(
PARTITION unix_timestamp('2017-01-01') <= VALUES <
unix_timestamp('2018-01-01'),
PARTITION unix_timestamp('2018-01-01') <= VALUES <
unix_timestamp('2019-01-01'),
PARTITION unix_timestamp('2019-01-01') <= VALUES <
unix_timestamp('2020-01-01')
)
STORED AS KUDU;
Listing 9-1

Table for the sensor data. Must be run in impala-shell

Test Data Source

We’ll execute shell script to generate random data: generatetest.sh. I introduce a 1 second delay every 50 records so I don’t overwhelm my test server (Listing 9-2).

#!/bin/bash
x=0
while true
do
   echo $RANDOM$RANDOM,$(( $RANDOM % 10 + 20 )),
   $(( ( RANDOM % 40 )  + 1 )),$(( ( RANDOM % 80 )  + 1 )),
   $(( ( RANDOM % 30 )  + 1 )),$(( ( RANDOM % 20 )  + 1 )),
   `date "+%Y-%m-%d %T"`,"Melton", 0,0,0,0,0,0,0,0,
   $(( $RANDOM % 10 + 1 )), $(( $RANDOM % 10 + 1 )),
   $(( $RANDOM % 10 + 1 )),$(( $RANDOM % 10 + 1 )),
   $(( $RANDOM % 10 + 1 )),0,0,0,0,0,0,0,0,0,0
        if [ "$x" = 50 ];
        then
                sleep 1
                x=0
        fi
((x++))
Done
Listing 9-2

generatetest.sh shell script to generate test data

The script will generate random sensor data. Listing 9-3 shows what the data looks like.

891013984,23,8,4,24,11,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,2,8,3,1,4,0,0,0,0,0,0,0,0,0,0
1191723491,29,20,68,14,10,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,7,1,6,4,3,0,0,0,0,0,0,0,0,0,0
919749,24,25,67,12,10,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,4,6,10,9,4,0,0,0,0,0,0,0,0,0,0
2615810801,22,21,59,24,11,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,5,7,10,7,2,0,0,0,0,0,0,0,0,0,0
2409532223,25,6,45,21,3,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,4,1,9,4,4,0,0,0,0,0,0,0,0,0,0
2229524773,29,20,68,12,3,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,6,7,2,4,9,0,0,0,0,0,0,0,0,0,0
295358267,22,15,16,7,1,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,1,4,6,10,8,0,0,0,0,0,0,0,0,0,0
836218647,28,25,59,3,19,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,7,2,6,3,4,0,0,0,0,0,0,0,0,0,0
2379015092,24,23,23,10,14,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,2,1,10,8,7,0,0,0,0,0,0,0,0,0,0
189463852,20,2,10,30,16,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,8,4,7,8,7,0,0,0,0,0,0,0,0,0,0
1250719778,26,15,68,30,4,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,7,6,9,8,10,0,0,0,0,0,0,0,0,0,0
1380822028,27,32,40,11,7,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,2,6,7,6,5,0,0,0,0,0,0,0,0,0,0
2698312711,21,14,5,29,19,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,2,2,8,1,3,0,0,0,0,0,0,0,0,0,0
1300319275,23,33,52,24,4,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,2,7,1,1,3,0,0,0,0,0,0,0,0,0,0
2491313552,27,25,69,24,10,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,8,2,4,3,8,0,0,0,0,0,0,0,0,0,0
149243062,21,24,2,15,8,2017-07-09 17:31:33,Melton,
0,0,0,0,0,0,0,0,7,3,3,5,7,0,0,0,0,0,0,0,0,0,0
Listing 9-3

Sample test sensor data

Design the Pipeline

We can now design the StreamSets pipeline. Refer to Chapter 9 for a more in-depth discussion of StreamSets. The first thing we need to do is define an origin, or the data source. StreamSets supports different type of origins such as MQTT, JDBC, S3, Kafka, and Flume to name a few. For this example, we’ll use a “File Tail” origin. Set data source to sensordata.csv. We’ll run a script to populate this file later (Figure 9-26).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig26_HTML.jpg
Figure 9-26

StreamSets file tail origin

The File Tail data source needs a second destination for its metadata. Let’s add a Local FS destination to store metadata (Figure 9-27).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig27_HTML.jpg
Figure 9-27

StreamSets local file system

We’ll need a field splitter processor to split our CSV data into individual columns (Figure 9-28). We’ll specify our SDC fields in the New Split Fields configuration box (Listing 9-4).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig28_HTML.jpg
Figure 9-28

StreamSets field splitter

[
 "/rowid",
 "/deviceid",
 "/temperature",
 "/pressure",
 "/humidity",
 "/ozone",
 "/sensortimestamp_str",
 "/city",
 "/temperature_warning",
 "/temperature_critical",
 "/pressure_warning",
 "/pressure_critical",
 "/humidity_warning",
 "/humidity_critical",
 "/ozone_warning",
 "/ozone_critical",
 "/lead_level",
 "/copper_level",
 "/phosphates_level",
 "/methane_level",
 "/chlorine_level",
 "/lead_warning",
 "/lead_critical",
 "/copper_warning",
 "/copper_critical",
 "/phosphates_warning",
 "/phosphates_critical",
 "/methane_warning",
 "/methane_critical",
 "/chlorine_warning",
 "/chlorine_critical"
]
Listing 9-4

New SDC Fields

We’ll also need a field type converter to convert the data types of the fields generated by the field splitter processor (Figure 9-29).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig29_HTML.jpg
Figure 9-29

StreamSets field converter

When configuring the field type converter processor, we have to specify the fields that we need to convert and the data type to convert the data into. All fields are set to STRING by default by SDC. You need to convert the fields to their correct data types (Figure 9-30).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig30_HTML.jpg
Figure 9-30

StreamSets field converter

Kudu expects the timestamp to be in Unix time format. We’ll use a JavaScript Evaluator (Figure 9-31) and write code to convert the timestamp from STRING to Unix time format as BIGINT ix (Listing 9-5). Note that this limitation has been addressed in recent versions of Kudu. However, Zoomdata still expects the date to be in Unix time format. Consult Chapter 7 to learn more about StreamSets evaluators.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig31_HTML.jpg
Figure 9-31

StreamSets Javascript Evaluator

Date.prototype.getUnixTime = function() { return this.getTime()/1000|0 };
if(!Date.now) Date.now = function() { return new Date(); }
Date.time = function() { return Date.now().getUnixTime(); }
for(var i = 0; i < records.length; i++) {
  try {
    var someDate = new Date(records[i].value['sensortimestamp_str']);
    var theUnixTime = someDate.getUnixTime();
    records[i].value['sensortimestamp'] = theUnixTime;
    output.write(records[i]);
  } catch (e) {
    error.write(records[i], e);
  }
Listing 9-5

Javascript Evaluator code

Finally, we use a Kudu destination as the final destination of our data. Make sure you map the SDC fields to the correct Kudu column names (Figure 9-32).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig32_HTML.jpg
Figure 9-32

StreamSets Kudu destination

That’s it on the StreamSets end.

Configure Zoomdata

Let’s configure a data source for Zoomdata. We’ll use the Kudu Impala connector (Figure 9-33).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig33_HTML.jpg
Figure 9-33

Zoomdata data source

Enter the connection information such as the IP address and port number for the Kudu Masters, the JDBC URL, and so on. Make sure to validate your connection (Figure 9-34).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig34_HTML.jpg
Figure 9-34

Zoomdata data connection input credentials

Choose the table that you just created earlier. You can exclude fields if you wish (Figure 9-35 and Figure 9-36).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig35_HTML.jpg
Figure 9-35

Zoomdata data connection tables

../images/456459_1_En_9_Chapter/456459_1_En_9_Fig36_HTML.jpg
Figure 9-36

Zoomdata data connection fields

You can schedule data caching if you wish. If you have a large data set, caching can help improve performance and scalability (Figure 9-37).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig37_HTML.jpg
Figure 9-37

Zoomdata data connection refresh

Configure the time bar. Make sure to enable live mode and playback (Figure 9-38).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig38_HTML.jpg
Figure 9-38

Zoomdata time bar

You can also set default configuration options for charts (Figure 9-39).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig39_HTML.jpg
Figure 9-39

Zoomdata time bar

Save the data source. You can now start designing the user interface (Figure 9-40).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig40_HTML.jpg
Figure 9-40

Zoomdata add new chart

I choose Bars: Multiple Metrics as my first chart. You won’t see any data yet because we haven’t started the StreamSets pipeline (Figure 9-41).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig41_HTML.jpg
Figure 9-41

An empty chart

Let’s start generating some data. On the server where we installed StreamSets, open a terminal window and run the script. Redirect the output to the File tail destination: ./generatetest.csv >> sensordata.csv.

Start the StreamSets pipeline. In a second, you’ll see data coming in StreamSets via the different statistics about your pipeline such as Record Throughput, Batch Throughput, the Input and Output Record Count for each step. It will also show you if there are any errors in your pipeline (Figure 9-42).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig42_HTML.jpg
Figure 9-42

StreamSets canvas

As you are designing your Zoomdata dashboard, charts are immediately populated (Figure 9-43).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig43_HTML.jpg
Figure 9-43

Live Zoomdata dashboard

Congratulations! You’ve just implemented a real-time IoT pipeline using StreamSets, Kudu, and ZoomData.

Data Wrangling

In most cases, data will not come in the format suitable for data analysis. Some data transformation may be needed to convert a field’s data type; data cleansing may be required to handle missing or incorrect values; or data needs to be joined with other data sources such as CRM, weather data, or web logs. A survey conducted says that 80% of what data scientists do is data preparation and cleansing, and the remaining 20% is spent on the actual data analysis. x In traditional data warehouse and business intelligence environments, data is made fit for analysis through an extract, transform, and load (ETL) process. During off-peak hours, data is cleansed, transformed, and copied from online transaction processing (OLTP) sources to the enterprise data warehouse. Various reports, key performance indicators (KPIs), and data visualization are updated with new data and presented in corporate dashboards, ready to be consumed by users. OLAP cubes and pivot tables provide some level of interactivity. In some cases, power users are granted SQL access to a reporting database or a replicated copy of the data warehouse where they can run ad hoc SQL queries. But in general, traditional data warehousing and business intelligence are fairly static.

The advent of big data has rendered old-style ETL inadequate. Weeks or months of upfront data modeling and ETL design to accommodate new data sets is considered rigid and old-fashioned. The enterprise data warehouse, once considered an organization’s central data repository, now shares that title with the enterprise “data hub” or “data lake,” as more and more workloads are being moved to big data platforms. In cases where appropriate, ETL has been replaced by ELT. Extract, load, and transform (ELT) allows users to dump data with little or no transformation unto a big data platform. It’s then up to the data analyst to make the data suitable for analysis based on his or her needs. ELT is in, ETL is out. EDW is rigid and structured. Big data is fast and agile.

At the same time, users have become savvier and more demanding. Massive amounts of data is available, and waiting weeks or months for a team of ETL developers to design and implement a data ingestion pipeline to make these data available for consumption is no longer an option. A new set of activities, collectively known as data wrangling, has emerged.

Data wrangling is more than just preparing data for analysis. Data wrangling is a part of the data analysis process itself. By performing data wrangling, iteratively discovering, structuring, cleaning, enriching, and validating, users get a better understanding of their data and enable them to ask better questions. As users iteratively wrangle through their data, it exposes hidden patterns, revealing new ways to analyze your data. Data wrangling is the perfect complement for big data and ELT. The pioneers of data wrangling at Stanford and Berkeley, who went on to start Trifacta, came up with six core activities involved in data wrangling (Table 9-1). xi
Table 9-1

Six Core Data Wrangling Activities

Activity

Description

Discovering

This is the data profiling and assessment stage.

Structuring

Set of activities to determine the structure of your data. Creating schemas, pivoting rows, adding or removing columns, etc.

Cleaning

Set of activities that includes fixing invalid or missing values, standardizing the format of specific fields (i.e., date, phone, and state), removing extra characters, etc.

Enriching

Set of activities that involves joining or blending your data with other data sources to improve the results of your data analysis.

Validating

Set of activities to check if the enrichment and data cleansing performed actually achieved its goal.

Publishing

Once you’re happy with the output of your data wrangling, it’s time to publish the results. The results could be input data to a data visualization tool such as Tableau or as input for a marketing campaign initiative.

Note that the data wrangling activities are iterative in nature. You will usually perform the steps multiple times and in different orders to achieve the desired results.

If you’ve ever used Microsoft Excel to make numeric values look like a certain format or removed certain characters using regular expressions, you’ve performed a simple form of data wrangling. Microsoft Excel is actually a pretty good tool for data wrangling, even though it wasn’t strictly designed for such tasks. However, Microsoft Excel falls short when used on large data sets or when data is stored in a big data platform or a relational database. A new breed of interactive data wrangling tools was developed to fill the market niche. These data wrangling tools take interactive data processing to the next level, offering features that make it easy to perform tasks that would normally require coding expertise. Most of these tools automate transformation and offer relevant suggestions based on the data set provided. They have deep integration with big data platforms such as Cloudera Enterprise and can connect via Impala, Hive, Spark HDFS, or HBase. I’ll show examples on how to use some of the most popular data wrangling tools.

Trifacta

Trifacta is one of the pioneers of data wrangling. Developed by Stanford PhD Sean Kandel, UC Berkeley professor Joe Hellerstein, and University of Washington and former Stanford professor Jeffrey Heer. They started a joint research project called the Stanford/Berkeley Wrangler, which eventually became Trifacta. xii Trifacta sells two versions of their product, Trifacta Wrangler and Trifacta Wrangler Enterprise. Trifacta Wrangler is a free desktop application aimed for individual use with small data sets. Trifacta Wrangler Enterprise is designed for teams with centralized management of security and data governance. xiii As mentioned earlier, Trifacta can connect to Impala to provide high-performance querying capabilities. xiv Please consult Trifacta’s website on how to install Trifacta Wrangler.

I’ll show you some of the features of Trifacta Wrangler. Trifacta organizes data wrangling activities based on “flows.” Trifacta Wrangler will ask you to create a new flow when you first start the application.

The next thing you need to do is to upload a data set. Trifacta can connect to different types of data sources, but for now we’ll upload a sample CSV file containing customer information. After uploading the CSV file, you can now explore the data. Click the button next to the “Details” button and select “Wrangle in new Flow” (Figure 9-44).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig44_HTML.jpg
Figure 9-44

Wrangle in new Flow

You will be shown the transformer page that looks like Figure 9-45. Trifacta immediately detects and shows how data in individual columns are distributed.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig45_HTML.jpg
Figure 9-45

Trifacta’s transformer page

You can scroll to the left or right of the chart that shows the distribution of data to quickly view the contents of the column. This will help you determine the data that need to be processed or transformed (Figure 9-46).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig46_HTML.jpg
Figure 9-46

Histogram showing data distribution

Under the column name, you’ll see a data quality indicator for the specific column. Mismatched data types will be shown as red, and empty rows will be shown as gray. The indicator should be green if the data stored in that column are all valid (Figure 9-47).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig47_HTML.jpg
Figure 9-47

Data quality bar

By clicking a specific column, you will be presented with various suggestions on how to transform data stored on that column. This feature is powered by machine learning. Trifacta learns from your past data wrangling activities to make the next suggestions (Figure 9-48).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig48_HTML.jpg
Figure 9-48

Trifacta’s suggestions

Different columns get different suggestions (Figure 9-49).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig49_HTML.jpg
Figure 9-49

Trifacta has different suggestions for different types of data

Click the button next to the column name to get a list of data transformation options for the particular column. You can rename the column name, change data types, filter, clean, or perform aggregation on the column to name a few (Figure 9-50).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig50_HTML.jpg
Figure 9-50

Additional transformation options

Almost every type of data transformation is available (Figure 9-51).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig51_HTML.jpg
Figure 9-51

Additional transformation options

Double-clicking the column will show you another window with various statistics and useful information about the information stored in that column. Information includes top values, the most frequent values, and string length statistics to name a few (Figure 9-52).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig52_HTML.jpg
Figure 9-52

The final output from the Job Results window

Trifacta is smart enough to know if the column includes location information such as zip code. It will try to show the information using a map (Figure 9-53).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig53_HTML.jpg
Figure 9-53

The final output from the Job Results window

Let’s try one of the suggested data transformations. Let’s convert the last name from all uppercase to a proper last name (only the first letter is uppercase). Choose the last suggestion and click the “Modify” button to apply the changes (Figure 9-54).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig54_HTML.jpg
Figure 9-54

Apply the transformation

The last name has been transformed as shown in Figure 9-55.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig55_HTML.jpg
Figure 9-55

The transformation has been applied

Click the “Generate Results” button near the upper-right corner of the page (Figure 9-56).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig56_HTML.jpg
Figure 9-56

Generate Results

You can save the result in a different format . Choose CSV for our data set. You can optionally compress the results if you want. Make sure the “Profile Results” option is checked to generate a profile of your results (Figure 9-57).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig57_HTML.jpg
Figure 9-57

Result Summary

A “Result Summary” window will appear. It will include various statistics and information about your data. Review it to make sure the results are what you expect. Trifacta doesn’t have built-in data visualization. It depends on other tools such as Tableau, Qlik, Power BI, or even Microsoft Excel to provide visualization .

Alteryx

Alteryx is another popular software development company that develops data wrangling and analytics software. The company was founded in 2010 and is based in Irvine, California. It supports several data sources such as Oracle, SQL Server, Hive, and Impala xv for fast ad hoc query capability.

You’ll be presented with a Getting Started window the first time you start Alteryx (Figure 9-58). You can go through the tutorials, open recent workflows, or create a new workflow. For now let’s create a new workflow.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig58_HTML.jpg
Figure 9-58

Getting started window

Alteryx is known for its ease of use. Most of the user’s interactions with Alteryx will involve dropping, dragging, and configuring tools from the Tool Palette (Figure 9-59).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig59_HTML.jpg
Figure 9-59

Alteryx main window

The Tool Palette provides users with tools organized into tool categories. You drag these tools into the canvas and connect them to design your workflows. The tools let users perform different operations that users would normally perform to process and analyze data. Popular operations include select, filter, sort, join, union, summarize, and browse to name a few. These are tasks that users would normally perform by developing SQL queries. Alteryx provides a much easier and more productive method (Figure 9-60).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig60_HTML.jpg
Figure 9-60

Tool Palette – Favorites

Alteryx also provides a myriad of tools for more complicated operations such as correlation, transposition, demographic analysis, behavior analysis, and spatial match to name a few. Explore the other tabs to familiarize yourself with the available tools (Figure 9-61).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig61_HTML.jpg
Figure 9-61

Tool Palette – Preparation

Every workflow starts by specifying a data source using the Input Data tool. Drag and drop the Input Data tool to the workflow window. You can specify a file or connect to an external data source such as Impala, Oracle , SQL Server, or any data source that provides an ODBC interface (Figure 9-62).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig62_HTML.jpg
Figure 9-62

Input Data Tool

We’ll get our data from a file in this example (Figure 9-63).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig63_HTML.jpg
Figure 9-63

Select file

Alteryx comes with several sample files . Let’s use the Customers.csv file (Figure 9-64).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig64_HTML.jpg
Figure 9-64

Select Customers.csv

Now that you’ve configured a data source, let’s drag and drop the Select tool. The Select tool lets you select specific columns. The columns are all selected by default, so let’s deselect some columns for this example. Make sure to connect the Input Data and Select tools as shown in Figure 9-65.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig65_HTML.jpg
Figure 9-65

Select tool

After we’ve selected a few columns, we can sort the results with the Sort tool. Connect the Select tool with the Sort tool. The Sort tool lets you specify how you would like to sort the results. In this example, we’ll sort by City and Customer Segment in ascending order. You can do that via the configuration option (Figure 9-66).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig66_HTML.jpg
Figure 9-66

Sort Tool

We could go on by adding more tools and make the workflow as complex as we want, but we’ll stop here for this example. Almost every Alteryx workflow ends by browsing or saving the results of the workflow .

Let’s browse the results to make sure the data looks correct. Drag a Browse tool on the workflow window and make sure it connects with the Sort tool (Figure 9-67).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig67_HTML.jpg
Figure 9-67

Browse data tool

Run the workflow by clicking the green run button located in the upper corner of the window, near the Help menu item. After a few seconds, the results of your workflow will be shown in the output window. Inspect the results and see if the correct columns were selected and the sort order is as you specified earlier.

Like Trifacta, Alteryx also profiles your data and shows helpful statistics and information about it such as the distribution of data in the column, data quality information, data type of column, number of NULLs or blank values, average length, longest length, etc. In Figure 9-68, it shows statistics about the CustomerID field.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig68_HTML.jpg
Figure 9-68

Browse data

Click on the Customer Segment field to show statistics about the column (Figure 9-69).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig69_HTML.jpg
Figure 9-69

Data quality – Customer Segment field

Figure 9-70 shows statistics about the City field .
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig70_HTML.jpg
Figure 9-70

Data quality – City field

As mentioned earlier, we can also save the results to a file or an external data source such as Kudu, Oracle, SQL Server, or HDFS . Delete the Browse tool from the workflow window and replace it with the Output Data tool (Figure 9-71).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig71_HTML.jpg
Figure 9-71

Output Data tool

Let’s save the results to a file (Figure 9-71). The configuration window on the left side of the screen will let you set options on how to save the file, including the File Format. As shown, you can choose to save the file in a different format such as CSV, Excel, JSON, Tableau data extract, and Qlikview data eXchange to name a few.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig72_HTML.jpg
Figure 9-72

Select file format

Let’s save the data in CSV format (Figure 9-72). After saving the results, let’s inspect the file with Notepad to make sure that the data was saved correctly (Figure 9-73).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig73_HTML.jpg
Figure 9-73

Validate saved data

Congratulations! You now have some basic understanding of and hands-on experience with Alteryx.

Datameer

Datameer is another popular data wrangling tool with built-in data visualization and machine learning capabilities. It has a spreadsheet-like interface and includes over 200 analytics functions. xvi Datameer is based in San Francisco, California, and was founded in 2009. Unlike Trifacta and Alteryx, Datameer has no connector for Impala, although it has connectors for Spark, HDFS, Hive, and HBase. xvii

Let’s explore Datameer. First thing you need to do is upload a data file (Figure 9-74).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig74_HTML.jpg
Figure 9-74

Upload data file

Specify the file and file type and click Next (Figure 9-75).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig75_HTML.jpg
Figure 9-75

Specify file type

You can configure the data fields. Datameer allows you to change the data type, field name, and so on (Figure 9-76).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig76_HTML.jpg
Figure 9-76

Configure the data fields

Data is presented in spreadsheet format (Figure 9-77).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig77_HTML.jpg
Figure 9-77

Spreadsheet-like user interface

Just like Trifacta and Alteryx, Datameer profiles your data and provides statistics and information about your data fields (Figure 9-78).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig78_HTML.jpg
Figure 9-78

Datameer Flipside for visual data profiling

Using a feature called Smart Analytics , Datameer provides built-in support for common machine learning tasks such as clustering, classification with decision trees, and recommendations (Figure 9-79).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig79_HTML.jpg
Figure 9-79

Smart Analytics

Performing clustering can show some interesting patterns about our data (Figure 9-80).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig80_HTML.jpg
Figure 9-80

Clustering

The data will be grouped into clusters . The groupings will be added to your data set as another column. Each cluster will be identified by a cluster id (Figure 9-81).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig81_HTML.jpg
Figure 9-81

Column ID showing the different clusters

Classification can also be performed using decision trees (Figure 9-82).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig82_HTML.jpg
Figure 9-82

Decision trees

Data will be classified via an additional prediction field as shown in Figure 9-83.
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig83_HTML.jpg
Figure 9-83

Prediction column

As previously mentioned, Datameer includes built-in data visualization (Figure 9-84).
../images/456459_1_En_9_Chapter/456459_1_En_9_Fig84_HTML.jpg
Figure 9-84

Datameer data visualization

Summary

Make sure your big data visualization tool can handle the three V’s of big data (volume, variety, and velocity). If you’re just in the process of selecting a tool for your organization, consider if the tool will be able to handle terabyte/petabyte-size data sets. Will your data source be purely relational or will you be analyzing semi-structured and unstructured data sets as well? Do you have requirements to ingest and process data in real time or near real time? Try leveraging your existing BI and data visualization tools and determine if they can fulfill your requirements. You may be connecting to a Hadoop cluster, but your data set may not be that large. This is a common situation nowadays, organizations using big data platforms as a cost-effective way to consolidate expensive reporting servers and data marts.

Data wrangling tools have become popular these past few years. Users have become savvier and more demanding. Who better to prepare the data than the people who understand it best? If your organization has power users who insist on preparing and transforming data themselves, then it might be worthwhile looking at some of the data wrangling tools I covered in the chapter.

References

  1. i.

    Christine Vitron, James Holman; “Considerations for Adding SAS® Visual Analytics to an Existing SAS® Business Intelligence Deployment”, SAS, 2018, http://support.sas.com/resources/papers/proceedings14/SAS146-2014.pdf

     
  2. ii.

    Zoomdata; “ABOUT ZOOMDATA”, Zoomdata, 2018, https://www.zoomdata.com/about-zoomdata/

     
  3. iii.

    Zoomdata; “Real-Time & Streaming Analytics”, Zoomdata, 2018, https://www.zoomdata.com/product/real-time-streaming-analytics/

     
  4. iv.

    Zoomdata; “Real-Time & Streaming Analytics”, Zoomdata, 2018, https://www.zoomdata.com/product/stream-processing-analytics/

     
  5. v.
     
  6. vi.

    Zoomdata; “Real-Time & Streaming Analytics”, Zoomdata, 2018, https://www.zoomdata.com/product/stream-processing-analytics/

     
  7. vii.

    Zoomdata; “Data Sharpening in Zoomdata”, Zoomdata, 2018, https://www.zoomdata.com/docs/2.5/data-sharpening-in-zoomdata.html

     
  8. viii.

    Zoomdata; “Real-Time & Streaming Analytics”, Zoomdata, 2018, https://www.zoomdata.com/product/stream-processing-analytics/

     
  9. ix.

    Coderwall; “How to get the correct Unix Timestamp from any Date in JavaScript”, Coderwall, 2018, https://coderwall.com/p/rbfl6g/how-to-get-the-correct-unix-timestamp-from-any-date-in-javascript

     
  10. x.

    Gil Press; “Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says”, Forbes, 2018, https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/#3bfc5cdc6f63

     
  11. xi.

    Tye Rattenbury; “Six Core Data Wrangling Activities”, Datanami, 2015, https://www.datanami.com/2015/09/14/six-core-data-wrangling-activities/

     
  12. xii.

    Stanford; “Wrangler is an interactive tool for data cleaning and transformation.” Stanford, 2013, http://vis.stanford.edu/wrangler/

     
  13. xiii.

    Trifacta; “Data Wrangling for Organizations”, Trifacta, 2018, https://www.trifacta.com/products/wrangler-enterprise/

     
  14. xiv.

    Trifacta; “The Impala Hadoop Connection: From Cloudera Impala and Apache Hadoop and beyond”, 2018, https://www.trifacta.com/impala-hadoop/

     
  15. xv.

    Alteryx; “Alteryx enables access to Cloudera in a number of ways.”, Alteryx, 2018, https://www.alteryx.com/partners/cloudera

     
  16. xvi.

    Datameer; “Cloudera and Datameer”, Datameer, 2018, https://www.cloudera.com/partners/solutions/datameer.html

     
  17. xvii.

    Datameer; “Connector”, Datameer, 2018, https://www.datameer.com/product/connectors/

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.70.66