Introduction

What is data cleaning? In this book, we define data cleaning to include:

  • Making sure that the raw data were accurately entered into a computer readable file.

  • Checking that character variables contain only valid values.

  • Checking that numeric values are within predetermined ranges.

  • Checking if there are missing values for variables where complete data is necessary.

  • Checking for and eliminating duplicate data entries.

  • Checking for uniqueness of certain values, such as patient ID’s.

  • Checking for invalid date values.

  • Checking that an ID number is present in each of “n” files.

  • Verifying that more complex multi-file rules have been followed. For example, if an adverse event of type X occurs in one data set, you expect an observation with the same ID number in another data set. In addition, the date of this observation must be after the adverse event and before the end of the trial.

This book provides many programming examples to accomplish the tasks listed above. In many cases, a given problem is solved in several ways. For example, numeric outliers are detected in a DATA step by using formats and informats, by using SAS procedures, and by SQL queries, which are presented together in Chapter 8. Throughout the book, there are useful macros that you may want to add to your collection of data cleaning tools. However, even if you are not experienced with SAS macros, most of the macros that are presented are first presented in non-macro form, so you should still be able to understand the programming concepts that are presented.

But, there is another purpose for this book. It provides instruction on intermediate and advanced SAS programming techniques. One of the reasons for providing multiple solutions to data cleaning problems is to demonstrate specific features of SAS programming. The more complex programs and macros in this book are described in detail.

It is impossible to provide an example of every data cleaning task. Indeed, some studies require custom programming. For those cases, the tools that are developed in this book can be the jumping-off point for more complex programs.

Many applications that require accurate data entry use customized, and sometimes very expensive, data entry and verification programs. A chapter on PROC COMPARE shows how SAS software can be used in a double-entry data verification process.

Chapter 9 describes the use of validation data sets. In a step-by-step process, programs and macros are developed that can read all of the rules for character and numeric variables from a raw data file (called a validation data file) and produce a validation data set and an exception report. The use of integrity constraints, new with Version 7 SAS software, is also discussed.

Although all of the programs in this book were tested by using either Version 7 or Version 8 SAS software, most of the programs should run under Release 6.12, perhaps with some minor changes (such as shortening variable names). However, the integrity constraints discussed in Chapter 9 require using Version 7 or later.

I have enjoyed writing this book. Writing any book is a learning experience and this book is no exception. I hope that most of the egregious errors have been eliminated. If any remain, I take full responsibility for them. Every program in the text has been run against sample data. However, as experience will tell, no program is foolproof.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.27.211