Chapter 8 described a data handling task involving acquiring data from spreadsheets, a database, JSON, XML, and fixed-width text files. The formats and layouts of the data are documented in that chapter. In this appendix, we give some extra hints about how to proceed. We recommend trying the exercise first, without referring to this appendix until you need to.
Some of these hints come in the form of “pseudocode.” This is the programmer's term for instructions that describe an algorithm in ordinary English, rather than in the strict form of code in R or another computer language. The actual R code we used can be found in the cleaningBook
package.
Reading, cleaning, and combining the loan portfolios (Section 8.4) is the first task in the exercise, and perhaps the most time-consuming. However, none of the tasks needed to complete this task is particularly challenging from a technical standpoint.
If you have a spreadsheet program such as Excel that can open the file, the very first step in this process might be to use that program to view the file. Look at the values. Are there headers? Can you see any missing value codes? Are some columns empty? Do any rows appear to be missing a lot of values? Are values unexpectedly duplicated across rows? Are there dates, currency amounts, or other values that might require special handling?
Then, it is time to read the two data sets into R and produce data frames, using one of the read.table()
functions. We normally start with quote = ""
or quote = """
and comment.char = ""
. You might select stringsAsFactors = FALSE
, in which case numeric fields in the data will appear as numeric columns in the data frame. This sets empty entries in those columns of the spreadsheet to NA
and also has the effect of stripping leading zeros in fields that look numeric, like the application number. As one alternative, you might set colClasses = "character"
, in which case all of the columns of the data frame are defined to be of type character
. In this case, empty fields appear as the empty string ""
. Eventually, columns that are intended to be numeric will have to be converted. On the other hand, leading zeros in application numbers are preserved. From the standpoint of column classes, perhaps the best way to read the data in is to pass colClasses
as a vector to specify the classes of individual columns – but this requires extra exploration up front to determine those classes.
Pseudocode for the step of reading, cleaning, and combining the portfolio data files might look like this. In this description, we treat the files separately. You might equally well treat them simultaneously, creating the joint data set as you go.
for each of {Beachside, Wilson}
read in file
examine keys and missing values
discard unneeded columns
convert formatted columns (currencies, dates) to R types
ensure categorical variables have the correct levels
construct derived variables
ensure categorical variables levels match between data sets
ensure column names match between data sets
ensure keys are unique across data sets
add an identifying column to each data set
join data sets row-wise (vertically)
As you will see, the portfolio data sets are messy, just as real-life data is messy. As you move forward, you will need to keep one eye on the data itself, one eye on the data dictionary, and one eye on the list of columns to construct. (We never said this would be easy.) Consider performing the sorts of checks listed here, as well as any others you can think of. Remember to specify exclude = NULL
or useNA = "ifany"
in calls to table()
to help identify missing values.
Missing
. If there are a number of values that apply to only a few records each, it might be wise to combine them into an Other
category.$1,234.56
? Convert them to numeric.Date
or POSIXt
classes. If all you need to do is to sort dates, you can also use a text format like YYYYMMDD
, since these text values will sort alphabetically just as the underlying date values would. Keeping dates as characters can save you a lot of grief when you might inadvertently turn a Date
or POSIXt
object into numeric form. The date formats are easier to summarize and to use in plots, however. Examine the dates. Are there missing values? What do the ranges look like? Consider plotting dates by month or quarter to see if there are patterns.Once the two loan portfolios have been combined, we can start in on the remaining data sets. In the sections that follow, we will refer to the combined portfolio data set as big
. When we completed this step, our big
data frame had 19,196 rows and 17 columns; yours may be slightly different, depending on the choices you made in this section.
The agency scores (Section 8.5) are stored in an SQLite database. The first order of business is to connect to the database and look at a few of its records. Do the extracted records match the descriptions in the data dictionary in Chapter 8? Count the numbers of records. If these are manageable, we might just as well read the entire tables into R, but if there are many millions or tens of millions of records we might be better off reading them bit by bit.
In this case, the tables are not particularly large. Pseudocode for the next steps might look as follows:
read Cscore table into an R data frame called cscore
read CusApp table into an R data frame called cusapp
add application number from cusapp to cscore
add active date from big to cscore
discard records for which ScoreMonth > active date
discard records with invalid scores
order cscore by customer number and then by date, descending
for each customer i, for each agency j
find the first non-missing score and date
We did not find a particularly efficient way to perform this last step.
As with almost all data sets, the agency scores have some data quality issues. Items to consider include the following:
cusapp
look like? Do they have the expected numbers of digits? Are any duplicated or missing? What proportion of application numbers from cusapp
appears in big
, and what proportion of application numbers from big
appears in cusapp
? Are there customer numbers in cscore
that do not appear in cusapp
or vice versa?ScoreMonth
column in the cscore
table is in YYYYMM
format, without the day of the month. However, the active date field from big
has days of the month (and, depending on your strategy, might be in Date
or POSIXt
format). How should we compare these two types of date?In the final step, we merge the data set of scores with the big
data set created by combining the two loan portfolios. Recall that duplicate keys are particularly troublesome in R's merge()
function – so it might be worthwhile to double-check for duplicates in the key, which in this case is the application number, in the two data frames being merged. Does the merged version of big
have the same number of rows as the original? If the original has, say, 17 columns and the new version 26, you might ensure that the first 17 columns of the new version match the original data, with a command like all.equal(orig, new[,1:17])
.
The co-borrower scores (Section 8.6) make up one of the trickiest of the data sets, because of their custom transactional format (Section 7.4). We start by noting that there are 73,477 records in the data set, so it can easily be handled entirely inside R. If there had been, say, a billion records, we would have needed to devise a different approach. We start off with a pseudocode as follows:
read data into R using scan() with sep = "
"
discard records with codes other than ERS or J01
discard ERS records with invalid scores
At this point, we are left with the J01
records, which name the application number, and the ERS
records, which give the scores. It will now be convenient to construct a data frame with one row for each score. This will serve as an intermediate result, a step toward the final output, which will be a data frame with one row for each application number. This intermediate data frame will, of course, contain an application number field. (This is just one approach, and other equally good ones are possible.) We take the following steps:
extract the application numbers from the J01 records
use the rle() function on the part of the string
that is always either J01 or ERS
Recall that each application is represented by a J01
record followed by a series of ERS
records. If the data dictionary is correct, the result of the rle()
function will be a run of length 1 of J01
, followed by a run of ERS
, followed by another run of length of 1 of J01
, followed by another run of ERS
, and so on. So, values 2, 4, 6, and so on of the length
component of the output of rle()
will give the number of ERS
records for each account. Call that subset of length values lens
. Then, we operate as follows:
construct a data frame with a column called Appl,
produced by replicating the application numbers
from the J01 records lens times each
add the score identifier (RAY, KSC) etc.
add the numeric score value
add the score date
insert the active date from big
delete records for which the score date > active date
At this stage, we have a data set with the information we want, except that some application numbers might still have multiple records for the same agency. A quick way to get rid of the older, unneeded rows is to construct a key consisting of a combination of application number and agency (like, e.g., 3004923564.KSC
). Now sort the data set by key, and by date within key, in descending order. We need to keep only the first record for every key – that is, we can delete records for which duplicated()
, acting on that newly created key field, returns TRUE
.
Following the deletion of older scores, we have a data set (call it co.df
) with all the co-borrower scores, arrayed one per score. (When we did this, we had 4928 scores corresponding to 1567 application numbers.) Now we can use a matrix subscript (Section 3.2.5) to assemble these into a matrix that has one row per application number. We start by creating a data frame with the five desired columns: places for the JNG
, KSC
, NTH
, and RAY
scores and the application number. It is convenient to put the application number last. Call this new data frame co.scores
. Then, we continue as follows:
construct the vector of row indices by matching the
application number from co.df to the application number
from co.scores. This produces a vector of length 4,928
whose values range from 1 to 1,567
construct the vector of column indices by matching the score
id from co.df to the vector c("JNG", "KSC", "NTH",
"RAY"). This produces a vector of length 4,928
whose values range from 1 to 4
combine those two vectors into a two-column matrix
use the matrix to insert scores from co.df into co.scores
As with the other data sets, the most important issues with the co-borrower data are key matching, missing values, and illegal values. You might examine points like the following:
big
? In our experience the proportion of loans with co-borrowers has often been on the order of 20%. Are there application numbers that do not appear in big
? If there are a lot of these we might wonder if the application numbers had been corrupted somehow.JNG
, and so on? Are the score values mostly in the 200–899 range? Do the dates look valid?When the co-borrowers data set is complete, we can merge it to big
. Since big
already has columns names JNG
, KSC
, NE
, and RAY
, it might be useful to rename the columns of our co.score
before the merge to Co.JNG
and so on. (At this point, you have probably noticed that the NorthEast score is identified by NE
in the scores database but by NTH
in the co-borrower scores. It might be worthwhile making these names consistent.)
A final task in this step is to add a co-borrower indicator, which is TRUE
when an application number from big
is found in our co.scores
and FALSE
otherwise. Actually, this approach has the drawback that it will report FALSE
for applications with co-borrowers for which every score was invalid or more recent than the active date (if there are any). If this is an issue, you will need to go back and modify the script by creating the indicator before any ERS
records are deleted. This is a good example of why documenting your scripts as you go is necessary.
The updated KScores (Section 8.7) are given in a series of JSON messages. For each message, we need to extract fields named appid
and record-date
, if they exist. We start with pseudocode as follows:
read the JSON into R with scan() and sep = "
"
remove messages in which the string "KScore" does not appear
write a function to handle one JSON message: return appid,
KScore, record-date if found, and (say) "none" if not
apply that function to each message
Using sapply()
in that last command produces a two-row matrix that should be transposed. Using lapply()
produces a list of character vectors of length 2 that should be combined row-wise, perhaps via do.call()
. In either case, it will be convenient to produce a data frame with three columns. We continue to work on that data frame with operations as follows:
remove rows with invalid scores
add active date from big
remove rows for which KScore date is > active date
keep the most recent row for each application id
This final data set contains the KScores that are eligible to update the ones in big
. Presumably, it only makes sense to update scores for which the one is big
is either missing, or carries an earlier date than the one in our updated KScore data. When we did this, we found 1098 applications in the updated KScore data set, and they all corresponded to scores that were missing in big
. So, all 1098 of these KScore entries in big
should be updated. In a real example, you would have to be prepared to update non-missing KScores as well.
As with the other data sets, the important checks are for keys, missing, duplicated, and illegal values. In this case, we might consider some of the following points:
big
?The Excluder files (Section 8.8) are in XML format. We need to read each XML file and determine whether it has (i) a field called APPID
and (ii) a field called EXCLUDE
, found under one called DATA
. In pseudocode form, the task for the excluder data might look as follows:
acquire a vector of XML file names from the Excluders dir.
use sapply to loop over the vector of names of XML files:
read next XML file in, convert to R list with xmlParse()
if there is a field named "APPID", save it as appid
otherwise set appid to None
if there is a field named "EXCLUDE" save it as exclude
otherwise set exclude to None
return a vector of appid and exclude
The result of this call to sapply()
will be a matrix with two rows. It will be convenient to transpose it, then convert it into a data frame and rename the columns, perhaps to Appl
and ExcCode
. Call this data frame excluder
.
excluders
data set, that record will be deleted, just as if it had appeared once.ExcCode
column what we expect (“Yes,” “No,” “Contingent”)? We can now drop records for which ExcCode
is anything other than Yes or Contingent.excluder
match the ones in the big
(combined) data set? Do they have 10 digits? If there are records with no application number, or whose application numbers do not appear elsewhere, they can be dropped from the excluder
data set.When we are satisfied with the excluder
data set, we can remove the matching rows from big
. When we did this, our data set ended up with 19,034 rows and between 30 and 40 columns, depending on exactly which columns we chose to save.
Recall from Section 8.9 that every row of the payment matrix contains a fixed-layout record consisting of a 10-digit application number, then a space, and then 48 repetitions of numeric fields of lengths 12, 12, 8, and 3 characters. Those values will go into a vector that will define the field widths. We might also set up a vector of column names, by combining names like Appl
and Filler
with 48 repetitions of four names like Pay
, Delq
, Date
, and Mo
. We pasted those 48 replications together with the numbers 48:1
to create unique and easily tractable names. We also know that the first two columns should be categorical; among the repeaters, we might specify that all should be numeric, or perhaps that the two amounts should be numeric and the date and number of months, character.
With that preparation, we are ready to read in the file, using read.fwf()
and the widths, names, and column classes just created. We called that data frame pay
.
Our column naming convention makes it easy to extract the names of the date columns with a command like grep("Date", names(pay))
. For example, recall that we will declare as “Indeterminate” any record with six or fewer non-zero dates. We can use apply()
across the rows of our pay
data frame, operating just on the subset of date columns, and run a function that determines whether the number of zero entries is 6. (Normally, we are hesitant to use apply()
on the rows of a data frame. In this case, we are assured that the relevant columns are all of identical types and lengths.)
Now, we create a column of status values, and wherever the result of this apply()
is TRUE
, we insert Indet
into the corresponding spots in that vector. Our code might look something like the following:
set up pay$Good as a character vector in the pay data frame
apply to the rows of pay[,grep ("Date", names (pay))]
a function that counts the number of entries that are
0 or "00000000"
set pay$Good to Indet when that count is <= 6
Our column naming convention also makes it easy to check our work. For example, we might pick out some number haphazardly – say, 45 – and look at the entries in pay
for the 45th Indet
entry. (This feels wiser than using the very first one, since the first part of the file might be different from the middle part.) In this example, our code might look as follows:
pick some number, like, say, 45
extract the row of pay corresponding to the 45th Indet
-- call that rr
pay[rr, grep ("Date", names (pay))] shows that row
Now we perform similar actions for the other possible outcomes. At this stage, it makes sense to keep the different sorts of bad outcomes separate, combining them only at the end. One bad outcome is when an account is 3 months delinquent. We apply()
a function that determines whether the maximum value in any of the month fields is 3. For those rows that produce TRUE
, we insert a value like bad3
into the pay$Good
unless it is already occupied with an Indet
value. Again, it makes sense to examine a couple of these records to ensure that our logic is correct.
Another bad outcome occurs when there are at least two instances of month values equal to 2, and a third is when the delinquency value passes $50,000. In each of these cases, we update the pay$Good
vector, ensuring that we only update entries that have not already been set. It is a good idea to check a few of the records for each indicator as we did earlier.
Records that do not get assigned Indet
nor one of the bad values are assigned Good
. Once we have tabulated the bad values separately, we can combine them into a single Bad
value. This way, we can compare the frequencies of the different outcomes to see if what we see matches what we expect.
There are lots of ways that these payment records can be inconsistent. The extent of your exploration here might depend on the time available. But we give, as examples, some of the questions you might ask of this data.
big
? What is the right action to take for booked records with no payment information – set their response to Indet
?Indet
we might be concerned.YYYYMMDD
or else a zero or set of zeros. Are they? Are adjacent dates 1 month apart, as we expect? Are there zeros in the interior of a set of non-zero dates?Once the response variable has been constructed, we can merge big
with pay
. Actually, since we only want the Good
column from pay
, we merge big
with the two-column data set consisting of Appl
and Good
extracted from pay
. In this way, we add the Good
column into big
.
Our data set is now nearly ready for modeling, although we could normally expect to convert character
columns to factor
first. Moreover, we often find errors in our cleaning processes or anomalous data when the modeling begins – requiring us to modify our scripts. In the current example, let us see if the Good
column constructed above is correlated with the average of the four scores. This R code shows one way we might examine that hypothesis using the techniques in the book.
> big$AvgScore <- apply (big[,c("RAY", "JNG", "KSC", "NE")],
1, mean, na.rm = T)
> gb <- is.element (big$Good, c("Good", "Bad")) &
!is.na (big$AvgScore)
The gb
vector is TRUE
for those rows of big
for which the Good
column has one of the values Good
or Bad
, and AvgScore
is present, meaning that at least one agency score was present.
We can now divide the records into groups, based on average agency score. Here we show one way to create five groups; we then tabulate the numbers of Good
and Bad
responses by group.
> qq <- quantile (big$AvgScore, seq (0, 1, len=6),
na.rm = TRUE)
> (tbl <- table (big$Good[gb], cut (big$AvgScore[gb], qq,
include.lowest = TRUE), useNA = "ifany"))
[420,598] (598,649] (649,696] (696,748] (748,879]
Bad 641 620 514 449 384
Good 607 721 811 926 1067
> round (100 * prop.table (tbl, 2), 1)
[420,598] (598,649] (649,696] (696,748] (748,879]
Bad 51.4 46.2 38.8 32.7 26.5
Good 48.6 53.8 61.2 67.3 73.5
We can see a relationship between agency scores and outcome from the tables. The first shows the counts, and the second uses prop.table()
to compute percentages within each column. In the group with the lowest average score (left column), there are about as many Bad
entries as Good
ones. As we move to the right, the proportion of Good
entries increases; in the group with the highest average scores, almost 75% of entries are Good
ones. This is consistent with what we expect, since higher scores are supposed to be an indication of higher probability of good response.
18.117.81.240