The data at a glance

At this point in the book, with any luck, we have established a familiar project cadence. That is, after a brief dialogue on the objectives of our project, the next step has always been to have a high-level peek at the actual raw data we will be using in the project.

Let's say that in this project we have been provided the following information about the available data:

  • The data gathered for us is based on user online user sessions over periods of time and is in the form of a formatted MS Excel report
  • We expect that the data includes various data points about the session, such as time online, date/timestamps, ads and links clicked on, products browsed, products added to cart,  products purchased, where the session was initiated, and so on
  • Various demographical information has been added to the data, which includes specific details on the user as well as the products involved in the user sessions, including those product's historic performance

As we have been demonstrating, it is quite easy to load data into Watson Analytics, but before loading the data into IBM Watson Analytics, its advantageous for us to perform some data preparation to ensure that all of the analyses we'll perform are as accurate as possible. Since our data is being provided to us as an MS Excel formatted worksheet, let's take a look at it.

Opening the file in Excel, we see a pretty report:

Scrolling through the data, we notice several things:

  • There is Excel conditional formatting applied
  • There are descriptive/useful column headings
  • Some subtotal lines (this one shows Dollar Amount Sold by Product ID Purchased) and other total lines are present:

Dollar Amount Sold by Product ID Purchased columns

Generally speaking, conditional formatting aimed at coloring and arranging for readability doesn't help or hurt Watson Analytics, but as a rule, you should strip it out of the file before loading it.

Some of the must do data preparation tasks include the following:

  • Remove filters and hidden rows or columns
  • Remove total lines/columns as well as nested lines and columns
  • Verify that all columns have names

Since our data is Excel-based, it's an easy (although manual) process to perform the data reformatting. Once we've accomplished the cleanup, we can save our data as an unformatted CSV file, a portion of which is shown as follows in Windows Notepad:

Data file 

Now the data can be loaded into Watson Analytics without concern:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.79.84