Data error is another type of error in SAS.
It is an error with the data itself (e.g. age = -300, sex = "K", etc.).
In this section, we will look at how to identify data errors that you cannot easily find in the data set.
Copy and run the code from the yellow box below:
The DEMO data set contains six columns:
In the first glimpse, the data looks fine, with nothing usual.
However, let's check the data in a more systematic way.
We will run both the Proc Means and Proc Freq on the data:
proc means data=demo; run; proc freq data=demo; table gender ethnicity; run;
The Proc Means procedure generates the summary statistics on the numeric variables.
The means of age, weight and height are 40, 75 and 171, respectively.
These seem fine. However, the maximum value looks strange:
It looks like we have some subjects who are 200 years old, weigh 2000 kg and are 300 cm tall.
Upon checking, there is no avenger character in this data set. So, we can conclude that these values are data errors and need to be corrected.
Let's also look at the Proc Freq results:
There is one observation where the gender is marked as "K".
Also, there are two subjects with an ethnicity of "Blue".
These also need to be corrected.
Now we are going to create a data set that contains these outliers.
data outliers; set demo; where gender not in ('M', 'F') or ethnicity = 'Blue' or age > 150 or height > 250 or weight > 300; run;
The data step above creates a new data set called "outliers".
Below are the six records that contain incorrect values:
These values all look strange. We will blank out these values in our data set.
data demo2; set demo; if gender not in ('M', 'F') then gender = ''; if ethnicity = 'Blue' then ethnicity = ''; if age > 150 then age = .; if height > 250 then height = .; if weight > 300 then weight = .; run;
The code above will find all the incorrect values and replace them with a missing value.
Now, let's run the Proc Means and Proc Freq on the new data set:
proc means data=demo2; run; proc freq data=demo2; table gender ethnicity; run;
The results look fine now without the outliers:
Copy and run the code below in SAS:
The STUDY data set contains six columns:
Carefully review each column and report if there are any data issues.
Need some help?
Use Proc Means and Proc Freq to compute the summary statistics of each variable.
** Summary statistics for numeric and character variables **;
proc means data=study;
proc freq data=study;
** Found issues with age and smoking **;
where ageatstart < 18 or
smoking > 100;