A description of time-series aggregated CDR data

We used the Oozie coordinator in Chapter 1, Meet Hunk, to import massive amounts of data. Data is partitioned by date and stored in binary format with a schema. It looks like a production-ready approach. Avro is pretty well supported across the whole Hadoop ecosystem. Now we are going to create a custom application using that data. Have a look at the description of the data.

Here is a description of the data stored in the base table:

  • Square ID: The ID of the square that is part of the Milano grid type: numeric.
  • Time interval: The beginning of the time interval expressed as the number of milliseconds elapsed from the Unix Epoch on January 1, 1970 at UTC. The end of the time interval can be obtained by adding 600,000 milliseconds (10 minutes) to this value.
  • Country code: The phone code of a nation. Depending on the measured activity this value assumes different meanings that are explained later.
  • SMS-in activity: The activity in terms of received SMS inside the square ID, during the time interval and sent from the nation identified by the country code.
  • SMS-out activity: The activity in terms of sent SMS inside the square ID, during the time interval and received by the nation identified by the country code.
  • Call-in activity: The activity in terms of received calls inside the square ID, during the time interval and issued from the nation identified by the country code.
  • Call-out activity: The activity in terms of issued calls inside the square ID, during the time interval and received by the nation identified by the country code.
  • Internet traffic activity: The activity in terms of Internet traffic performed inside the square ID, during the time interval and by the nation of the users performing the connection identified by the country code.

The following screenshot is from the site hosting the dataset:

A description of time-series aggregated CDR data

The idea of this dataset is to divide the city in to equal areas and map the typed subscriber activity on these regions. The assumption is that such mapping can give insights about relations between the hour of the day, type of activity, and area of the city.

Source data

There are two datasets. The first one contains customer activity (CDR). The second dataset looks like a dictionary. It has exact coordinates for each activity square represented in the earlier screenshot.

Creating a virtual index for Milano CDR

You should refer to the section Setting up virtual index for data stored in Hadoop in Chapter 2, Explore Hadoop Data with Hunk. Virtual index creation for Milano CDR is there. You have the CDR layout in HDFS:

[cloudera@quickstart ~]$ hadoop fs -ls -R /masterdata/stream/milano_cdr
drwxr-xr-x   - cloudera supergroup          0 2015-08-03 13:55 /masterdata/stream/milano_cdr/2013
drwxr-xr-x   - cloudera supergroup          0 2015-03-25 02:13 
some output deleted to reduce size of it.
-rw-r--r--   1 cloudera supergroup   69259863 2015-03-25 02:13 /masterdata/stream/milano_cdr/2013/12/07/part-m-00000.avro

So you have relatively compact aggregated data for the first seven days of December 2013. Let's create a virtual index for December 1, 2013. It should have these settings:

Property name

Value

Name

Milano_2013_12_01

Path to data in HDFS

/masterdata/stream/milano_cdr/2013/12/01

Provider

Choose Hadoop hunk provider from the dropdown list

Explore it and check that you see this sample data:

Creating a virtual index for Milano CDR

Creating a virtual index for the Milano grid

There is a file that provides longitude and latitude values for squares. We need to create a virtual index on top of this dictionary, to be joined later with the aggregated data. We need actual coordinates to display squares on Google Maps.

Virtual index settings for the so-called geojson should be:

Property name

Value

Name

geojson

Path to data in HDFS

/masterdata/dict/milano_cdr_grid_csv

Provider

Choose Hadoop hunk provider from the dropdown list

Let's try to explore some data from that virtual index:

Creating a virtual index for the Milano grid

You have to scroll down and verify the advanced settings for the index. The names and values should be:

Property name

Value

Name

geojson

DATETIME_CONFIG

CURRENT

SHOULD_LINEMERGE

False

NO_BINARY_CHECK

True

disabled

False

pulldown_type

True

Save the settings with this name: scv_with_comma_and_title.

Verify that you can see lines with longitude, latitude, and squares.

Use the search application to verify that the virtual index is set correctly. Here is a search query; it selects several fields from the index:

index="geojson" | fields square, lon1,lat1 | head 10

The following screenshot shows the sample output:

Creating a virtual index for the Milano grid

Creating a virtual index using sample data

We would like to shorten the feedback loop while developing our application. Let's trim the source data so our queries work faster.

You need to open the Pig editor http://quickstart.cloudera:8888/pig/#scripts and open the script stored there:

Creating a virtual index using sample data

Then you should see the script. Click the Submit button to run the script and create a sample dataset:

  • --remove output path: We need clean output path before script execution:
    rmf /masterdata/stream/milano_cdr_sample 
    
  • --add jar used by AvroStorage: It's a Pig storage implementation for reading avro data:
    REGISTER 'hdfs:///user/oozie/share/lib/lib_20141218043440/sqoop/avro-mapred-1.7.6-cdh5.3.0-hadoop2.jar'
    
  • --read input data:
    data = LOAD '/masterdata/stream/milano_cdr/2013/12/01' using AvroStorage();
    
  • --filter: To filter out all lines except lines where time_interval is equal to 1385884800000L:
    filtered = FILTER data by time_interval ==  1385884800000L;
    
  • --store filtered data:
    store filtered into '/masterdata/stream/milano_cdr_sample' using AvroStorage();
    

Have a look at the HUE UI. This is an editor for Pig scripts. You should find a ready-to-use script on the VM. The only thing you need to do is click the Submit button:

Creating a virtual index using sample data

This script reads the first day of the dataset and filters it by using the time_interval field. This approach significantly reduces the amount of data. You should get the output in a few minutes:

Input(s):
Successfully read 4637377 records (91757807 bytes) from: "/masterdata/stream/milano_cdr/2013/12/01"

Output(s):
Successfully stored 35473 records (2093830 bytes) in: "/masterdata/stream/milano_cdr_sample"

Counters:
Total records written : 35473
Total bytes written : 2093830
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

There will be more lines in output; just try to find the ones we've mentioned. They say that Pig reads 4.6 million records and stores 35 thousand records. We reduced the amount of data for testing purposes, as described earlier.

Now create a virtual index over the sample data; we can use it while developing our application. Here are the settings; use these to facilitate development:

Property name

Value

Name

milano_cdr_sample

Path to data in HDFS

/masterdata/stream/milano_cdr_sample

Provider

Choose Hadoop hunk provider from the dropdown list

Use the search application to check so that you can correctly access the sample data:

index="milano_cdr_sample" | head 10

You should see something similar to this:

Creating a virtual index using sample data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.59.145