Some Complications in Datasets

Capturing Dates & Times

Dates and times can be very valuable and interesting information. For example, a customer’s sign-up date on a credit card is an indication of the customer’s tenure (how long the customer has been with the company), cohort (a group of people joining at the same time), etc. An end date is an indication of customer turnover. The distance between start and end dates is an indication of total tenure or retention. Times during the day might be captured in situations such as employee shift-work; for instance you might log call center agents in and out, and capture this data. Therefore, capturing and analyzing dates and times can be very important.
Capturing dates or times is easy enough. All database or spreadsheet programs have methods for entering them, although there are complexities too. Actually analyzing dates or times in a sensible manner is sometimes more difficult. For example, say I want to estimate the tenure (number of months or days employed with the company) of employees based on each employee’s start date, and today is 22 July 2016. How do I do it? If Nelson started on the 15 February 2010, how do I calculate the tenure that is the gap between now and that start date?
To solve this problem, data analysis software uses specific protocols. For example, in the background, Microsoft Excel stores dates in terms of days away from 1 January 1900. Therefore it turns dates into a proper numerical variable. The background number for 1 January 2008 is 39448 because this date is that many days after the 1 January 1900. Similarly, Excel stores times as fractions of days. SAS uses a similar protocol, just with 1 Jan 1960 as the start date.
This is the protocol used in practically all analysis programs. It allows the program to do date-based analysis very easily, as long as it can recognize that the data represents dates in the first place.

Some More Complex Dataset Types

Sometimes datasets are slightly more complex than those shown to you already. Two major areas of complexity (although the datasets are easy enough to construct once you know how) include the following:
  1. Longitudinal datasets, where each observation is measured on the same variables at multiple different times (e.g. once per year). In some types of statistical analysis for this case, you will have multiple rows for each observation, one row for each timeframe. For example, say you have a sample of customers and each year you measure their sales, satisfaction and trust. In this type of dataset, you often capture each observation on multiple rows, one row for each year;
  2. Multi-level datasets where you have data from multiple levels of observations, such as data on both employees and the departments to which they belong.
  3. Multi-row datasets. Another snag is how to deal with observations that appear in different states. Take for example an employee database where there has been internal movement of staff, i.e. transfers, promotions or demotions. Many systems deal with this by creating a new data line for the same person every time that person changes job definitions, as if the same employee is more than one person. You may discover the multi-row method in some systems, which sometimes makes statistical analysis a little harder, but SAS deals comfortably with this issue.
Last updated: April 18, 2017
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.144.228