Look at your data - au naturel

The very first thing you should always do with unfamiliar data is to look at a good sample of it in raw form. This is a very important and an often overlooked step, even by experienced analysts. Your first instinct will be to run averages and standard deviations to look for trends. Resist the impulse.

You want to view the file first without any formatting or interpretations, so you need to use a software tool that does not apply any formats to it. If you have a Windows laptop, Notepad is a good one to use as long as the file is not too large. Excel is not good to use for this, as it applies its own formatting and interpretations to text files, such as .csv.

On a Mac or a Linux machine, use a command-line text viewing tool. On a Mac, you can open Terminal and navigate to the directory with the file, and type the command cat filename.csv | less. Pressing the spacebar will advance the screen; type q to exit. You can do the same from a Linux command line. The following shows how the raw data looks on a Mac or Linux machine. Windows will show a very similar format:

Example raw data from NOAA U.S. 15 Minute Precipitation Data

From this initial look, you can learn a few things about the makeup of the data and areas you will need to investigate. Some examples are listed next:

  • The first row has the names of the fields in it: This is good as you will not have to figure it out and enter the names yourself. However, you will need to make sure that when the file is loaded into a data store such as HDFS, the first row is not considered a data row.
  • There are two fields each with the name of Measurement Flag, Quality Flag, and Units: This could be duplicate data, or it could be that the order of the field has meaning in relation to another field. You will need to know what each means and which one is which when you are ready to analyze the dataset. The order may be different in the analysis tool than it is in the text file. To prepare for this, find some rows that have different values in each of the fields, then note the fields that uniquely identify it (such as STATION and DATE). You can use those in the analysis tool to verify which field is which.
  • Values in the QPCP field are either very small positive numbers or -9999: -9999 appears to signify something other than a measurement. Good thing you did not run averages using it.
  • There are quite a few missing values in the Measurement Flag and Quality Flag fields: You will definitely need to review the documentation to understand what this means.
  • The date time values are not in a common standard format: You might need to do some parsing to get your analytic tools to recognize the value as a date. Note the structure (yyyymmdd hh:mm).
  • There may not be a record for every 15-minute interval: The rows appear to be in date and time order, but there does not appear to be a record for every 15-minute interval in a day. This may not be the case when the entire file is sorted, but you should note it as something to investigate.

Scan through several screens worth of data and look for any patterns. If the dataset seems to be sorted, scan through at least one transition to another source device to see how values change. We are not attempting to be very scientific at this stage in getting to know your data. The goal is to look for any obvious patterns and data quality issues.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.183.150