Debugging Errors [3-5]

h

Data Error

Data error is another type of error in SAS.

It is an error with the data itself (e.g. age = -300, sex = "K", etc.).

In this section, we will look at how to identify data errors that you cannot easily find in the data set.

​Copy and run the code from the yellow box below:

The DEMO data set contains six columns:

  • ID
  • Gender
  • Ethnicity
  • Age
  • Weight
  • Height

In the first glimpse, the data looks fine, with nothing usual. 

However, let's check the data in a more systematic way.

We will run both the Proc Means and Proc Freq on the data:

proc means data=demo;
run;

proc freq data=demo; 
table gender ethnicity; 
run;

The Proc Means procedure generates the summary statistics on the numeric variables.

The means of age, weight and height are 40, 75 and 171, respectively.

These seem fine. However, the maximum value looks strange:

It looks like we have some subjects who are 200 years old, weigh 2000 kg and are 300 cm tall.

Upon checking, there is no avenger character in this data set. So, we can conclude that these values are data errors and need to be corrected.

Let's also look at the Proc Freq results:

There is one observation where the gender is marked as "K". 

Also, there are two subjects with an ethnicity of "Blue".

These also need to be corrected.

Now we are going to create a data set that contains these outliers.

Example

data outliers;
set demo;
where gender not in ('M', 'F') or
      ethnicity = 'Blue' or
      age > 150 or
      height > 250 or
      weight > 300;
run;

The data step above creates a new data set called "outliers".

​Below are the six records that contain incorrect values:

These values all look strange. We will blank out these values in our data set.

Example

data demo2;
set demo;
if gender not in ('M', 'F') then gender = '';
if ethnicity = 'Blue' then ethnicity = '';
if age > 150 then age = .;
if height > 250 then height = .;
if weight > 300 then weight = .;
run;

The code above will find all the incorrect values and replace them with a missing value.

Now, let's run the Proc Means and Proc Freq on the new data set:

proc means data=demo2;
run;

proc freq data=demo2; 
table gender ethnicity; 
run;

The results look fine now without the outliers:


Exercise

Copy and run the code below in SAS:

The STUDY data set contains six columns:

  • STATUS: Dead or Alive
  • SEX: Male or Female
  • AGEATSTART: Age (must be 18 or above)
  • SMOKING: # of cigarettes smoked each day
  • CHOLESTEROL: Cholesterol
  • SMOKING_STATUS: the smoking classification based on the # of cigarette smoked each day

Carefully review each column and report if there are any data issues.

Need some help?

Get Hint

Get Solution