Data-gathering exercise

Since the team wanted to leverage internal data, a data quality exercise needed to be conducted. Multiple variables were probably going to be needed to predict CPI. These variables were in turn classified into various categories. The categories were based on areas where the modelers thought the consumers were primarily spending money. The hypothesis was that, by using these categories, they would be able to identify significant categories that predict CPI. The lower the number of categories that would come out as predictors, the less the team would have to invest in ensuring that the data quality remains consistent with implementing the model. The initial list of categories of interest that the team had are as follows:

Clothing and generic shopping
Communication (phone, broadband, and so on)
Eating out (restaurants, takeaways, and so on)
Education (nursery and school fees)
Entertainment (TV subscription, iTunes, Google Play Store, Netflix, and so on)
Furniture and home improvements
Grocery shopping
Spending and savings ratio
Travel, including leisure

To derive these classifications, a considerable effort needed to be spent to ensure that variables that were to be mapped into categories met the data quality requirements. For the entertainment category, the modelers wanted to include spending on the app stores of iPhone and Android handsets. This meant that each such transaction of the customer using various forms of payments needed to be tracked and classified in a data warehouse. Conversations with the IT team revealed that they had started this tracking and classification exercise a few years ago and they even did a historic classification exercise to ensure that data going back to 2012 was available. They believed that the mappings of transactions were correct and the data was of good quality. The modelers decided to have a look at some of the app spend data.

The following code shows the data quality checks:

Data Playstore; 
   Format Date Date.; 
   Input Date : Date. @@; 
Datalines; 
21JAN12 20FEB12 22MAR12 21APR12 21APR12 13MAY12 21JUN12 20JUL12 21AUG12 22SEP12 
15OCT12 16NOV12 21DEC12 21DEC12 20JAN13 19FEB13 19MAR13 18MAY13 17JUN13 18JUL13 
17OCT13 18NOV13 17DEC13 01JAN14 02FEB14 03MAR14 04APR14 05MAY14 06JUN14 07AUG14 
11SEP14 31OCT14 01NOV14 01DEC14 25FEB15 01MAR15 01APR15 01MAY15 01JUN15 01JUL15 
01AUG15 01SEP15 01OCT15 01NOV15 01DEC15 05JAN16 01FEB16 01MAR16 01APR16 01MAY16 
01JUN16 01JUL16 01AUG16 01SEP16 01OCT16 01NOV16 01DEC16 05JAN17 01FEB17 01FEB17 
01APR17 01MAY17 01JUN17 01JUL17 01AUG17 01SEP17 01NOV17 01DEC17 01FEB18 01MAR18 
; 
 
ODS GRAPHICS ON;

PROC TIMEID Data=Playstore PRINT=All PLOT=All; 
   Id Date Interval=MONTH; 
Run;

They used the procedure, TIMEID, which helps to understand the interval between each observation, the number of duplicate observations, and pointers to any missing observations. It has been assumed that there is a monthly interval between the observations:

Figure 5.4: TIMEID procedure–partial results of time component

In Figure 5.4, we have the partial results of the time component analysis. As mentioned earlier, we have assumed that the data is at a monthly interval. We have a record for January 21, 2012 and February 20, 2012 in the Playstore data. For these days, the offset is the difference in the number of dates in the month to the 1st of each month. This highlights how many days it has taken for the report to be published with the previous month-end data. This can set expectations on how regularly the data feed can be expected. In an ideal scenario, the previous month-end data should be available with the modeler in an aggregate manner on the first of each month. This expectation isn't being met at the start of the time series. As the data-collection process evolves, it seems that, looking at the following table, that the data is being made available on the 1st of each month and the offset value is now 0. There is still a four-day offset in JAN2017 when the data wasn't made available on the first. This could be a processing difficulty owing to holidays at the start of the year:

Figure 5.5: TIMEID procedure–regular reporting instances

Going back to Figure 5.4, we also have the Span column. If you look at the Date MAY2013, you can see that the value of the Span is equal to 2. This is because there was no data available for APR2013. The Span highlights the gap in data between MAR2013 and MAY2013. Since we have given the interval as monthly, the query highlights that it has found a gap of two months between MAR2013 and MAY2013. The Interval Count column for APR2012 has a value of 2. This again points to a data quality issue, as we have two rows of data entered in APR2012 with a date of 21.

Figure 5.6 illustrates graphs of Interval Count, Offset, and Span:

Figure 5.6: TIMEID procedure–data decomposition charts

Figure 5.7 provides further details on duplicated time IDs and the number of offset days when the series is assumed to be monthly:

Figure 5.7: TIMEID procedure–offset frequency details

In Figure 5.8, we can see that there are seven instances when a month is missing in the series, as this leads to a Span of 2. There is one instance of a Span equal to 3, as data for August and September 2013 is missing:

Figure 5.8: TIMEID procedure–span component frequency details

This sort of analysis was done for multiple variables that were used to form the categories of interest for forecasting inflation. Conversations with IT, historic data correction where possible, and imputation of missing values helped resolve most of the data quality issues. After this, the modelers shifted their focus to building the model.

Table of Contents for Data-gathering exercise

Create new playlist

Sign In

Sign Up

Table of Contents for
Data-gathering exercise