Master SAS in 30 days!

A jargon-free, easy-to-learn SAS base course that is tailor-made for students with no prior knowledge of SAS.

The Ultimate Guide To Proc SQL

Wondering how you can use SQL code within SAS? Looking to become a more efficient, dynamic programmer?
 
While most tasks that one can do with Proc SQL can also be accomplished using Base SAS, there are often ways to do a task in Proc SQL with fewer lines of code and in a more efficient manner. Depending on your datasets and environment, Proc SQL code can sometimes be faster to run than Base SAS code as well!
 
In this article, we will show you 9 different ways to manipulate and analyze your data using the PROC SQL procedure. Comparisons on how to accomplish the same task with base SAS code are also made throughout the article to help demonstrate any efficiencies that can be gained with the Proc SQL method shown.

Software

Before we continue, make sure you have access to SAS Studio. It's free!

Data Sets

The examples used in this article are based on the CLASS and CLASSFIT data sets from the SASHelp library.

For some examples in this article, you will also need to use a modified version of the CLASSFIT dataset. Here, we will create a dataset called CLASSFIT_MALES which contains all variables from CLASSFIT but only those records where Sex is ‘M’ (i.e. only those records for males). To create this temporary dataset and save it in your work directory, use the following code:

data classfit_males;
 set sashelp.classfit;
  where sex = 'M';
run; 

After running the code above, the classfit_males dataset will remain in your WORK directory for the duration of your SAS session. If you need to close and re-open your SAS session for any reason, simply re-run the code above to re-create the classfit_males dataset to be used with some examples later in this article.

The classfit_males dataset should appear as follows after running the above code:

[Don't have the software yet? Download SAS Studio here for free.]

1. Display Data Set

The most basic usage of Proc SQL is to display (or print) all variables (columns) and observations (rows) from a given dataset in the SAS Results window. Using the SASHELP.CLASS dataset with Base SAS code, you can see here how to print the entire dataset to the results window using the PRINT procedure:
 
proc print data=sashelp.class;
run; 
 
With Proc SQL, the equivalent results can be obtained using a SELECT statement. To display all columns in the Results window, an  asterisk(*) is used following a SELECT to indicate that you would like to keep all variables (columns) in the output. A call to Proc SQL is concluded with a semi-colon, followed by a QUIT statement and another semi-colon as shown here:

proc sql;
 select * from sashelp.class
 ;
quit;   

Both Proc Print and the Proc SQL with SELECT shown above produce the following results:

For datasets with a large number of variables, it may be preferable to only view a subset of those variables. With Proc Print, this would be accomplished with a VAR statement. Here, the VAR statement is used to print only the Name and Age variables from SASHELP.CLASS:
 
proc print data=sashelp.class;
 var name age;
run;
 
In Proc SQL, the same output is produced by replacing the  asterisk(*) from the previous example with a comma-delimited list of the desired variables as shown here:

proc sql;
 select name,age from sashelp.class;
quit;

Both the Proc Print and Proc SQL statements shown above produce the following output of Name and Age in the Results window:

2. Creating Data Set From Existing Data

Similar to the Data Step in base SAS programming, Proc SQL can also be used to create new datasets from existing data. To create a new dataset in the WORK library called class_new, which contains all the variables and observations from SASHELP.CLASS, the Base SAS data step is used along with a SET statement as follows:
 
Data class_new;
 Set sashelp.class;
Run;
 
In Proc SQL, the equivalent output dataset is produced by using CREATE TABLE and AS statements before the SELECT statement:

proc sql;
 create table class_new as
  select * from sashelp.class
  ;
quit; 

To limit which variables are retained in the resulting dataset, the asterisk(*) can be replaced in the Proc SQL code with a comma delimited list of variables. To keep only name and height in the class_new dataset, the following code can be used:

proc sql;
 create table class_new as
  select name,height from sashelp.class;
quit;

This would be equivalent to using a keep statement with the data step code in Base SAS:
 
Data class_new;
 Set sashelp.class;
 Keep name height;
Run;
 
Both the Proc SQL and Base SAS Code shown above produce the following dataset:

3. Dynamically Create Variables

So far, the examples shown require roughly the same amount of code with either Base SAS or Proc SQL. However, using Proc SQL, variables can easily be created dynamically within the SELECT statement, something that requires separate lines of code in base SAS.
 
Using the CLASS dataset as an example, say that you would like to know what the height is for each person in both inches and centimeters. Using a single select statement, the name variable is kept, the current height variable can be renamed to height_inches, and a new height variable called height_cm can be created dynamically by multiplying the height in inches by 2.54 to convert to centimeters. 

proc sql;
 create table class_heights as
 select name, height as height_inches, (height*2.54) as height_cm from sashelp.class;
quit;

The same result can also be produced with the following Base SAS code, but requires the use of the rename and keep statement, in addition to a separate statement for creating the new height_cm variable.
 
data class_heights_base;
 set sashelp.class;
 
 rename height = height_inches;
 height_cm = height*2.54;
 
 keep name height height_cm;
run;
 
Both the Proc SQL and Base SAS code shown above produce the following table:

Do you have a hard time learning SAS?

Take our Practical SAS Training Course for Absolute Beginners and learn how to write your first SAS program!

4. Produce Frequencies/Counts

Proc SQL is also a useful tool for producing frequencies/counts of values within groups. Recall, that using Proc Freq, you can easily obtain the frequencies (or counts) of the distinct values found within a variable using the TABLES statement:

Proc Freq data=sashelp.class;
 Tables sex;
Run;

Which produces the following table (the frequencies are highlighted below):

A similar result, without the percentages, can be produced by using the COUNT function in Proc SQL. The variable for which the values you’d like to count are enclosed in parentheses after the COUNT function. The “as Count” is included after the count(sex) to also give the newly created variable the name count (otherwise it will appear blank in the Result window). 

proc sql;
 select sex,count(sex) as Count from sashelp.class
run;

However, you will notice the results produced by this code are not typically the desired results:

At this point, SAS does not know that you would like the counts for both Males and Females separately, so it simply counts how many values are found with the variable sex throughout the entire dataset.
 
To obtain the desired results in this case, (i.e. the counts of both males and females separately) an additional GROUP BY statement needs to be added to the Proc SQL code:

proc sql;
 select sex,count(sex) as count from sashelp.class
  group by sex;
run;

With the GROUP BY added in, SAS now knows to retrieve the counts for each group, Males and Females and outputs the following result:

5. Frequencies/Count with Multiple By Variables

When compared to PROC FREQ, Proc SQL also has some advantages when it comes to reporting on multiple by variables or variable groupings. Continuing to use the SASHELP.CLASS dataset as an example, consider that you would like to determine the distribution of males and females within each age group for this set of people.
 
Using PROC FREQ in base SAS, this requires two steps. First, you need to sort your dataset by age and sex (in that order) as shown here:
 
proc sort data=sashelp.class out=class;
 by age sex;
run; 
 
Note that we also need to output a new temporary dataset in the WORK library called CLASS, as we are not able to modify datasets stored in the SASHELP library.
 
With the dataset properly sorted, you can now use PROC FREQ as before, except with two by variables, age and sex:
 
proc freq data=class;
 tables sex;
 by age sex;
run;
 
Unfortunately, the PROC FREQ results that are output are split into multiple tables, with one table for each unique combination of BY variable values. For datasets where there are many potential values for either BY variable, this will result in a very lengthy output, as shown here:

Here, there are 2 distinct advantages to using PROC SQL over PROC FREQ when you are looking to determine counts within your dataset using multiple grouping (BY) variables.
 
The first advantage is that you do NOT need to sort your dataset before running the PROC SQL code. This not only requires less code (and thus less typing) but it can be more efficient as well, depending on your environment and the size of your data. The second advantage is that the counts for all groups are included into a single easy to read table.
 
To obtain counts for multiple grouping variables in your dataset with PROC SQL, the code is nearly identical to the code for producing frequencies with a single variable. You simply specify the variables you would like to report on in your SELECT statement (age, sex), include the COUNT function with the variable that you would like frequencies for, and then end with a GROUP BY statement that lists both variables, in the order you would like to group them:

proc sql;
 select age,sex,count(sex) as count from sashelp.class
  group by age,sex;
quit;

Which produces the following easy to read output:

6. Combining/Merging/Joining Datasets

One of the best uses of PROC SQL is for combining datasets. When it comes to combining data, PROC SQL not only requires less code but can often be more computationally efficient when compared with Base SAS.

Depending on your datasets and the task at hand, there are many different scenarios for combining data. The four simplest and likely the most common scenarios are as follows:

There are other combinations of joins that one may come across, however, once you understand the fundamentals behind these 4 joins it is easy to apply your knowledge to other scenarios you may come across.
 
In addition to the desired output, it is also important to consider the contents of the datasets which you are intending to combine. When combining data, there are essentially 4 different types of merges: one-to-one, one-to-many, many-to-one and many-to-many.
 
A one-to-one merge is when you have a unique ID (or Key) variable in each dataset that you are combining. For example, if you are are combining File A with File B shown below, notice that there is only 1 record for each ID in both File A and File B

Proc freq data = sashelp.cars order=freq;
 Tables type /out=cars_freq outcum;
Run;

With a one-to-many merge, there is only 1 record for each ID in File A and multiple records for each ID in file B:

A many-to-one merge is simply the reverse of the example above, with repeating ID values in File A and unique ID values in File B.
 
In a many-to-many merge, both File A and File B have repeating ID values:

These different dataset scenarios are important to consider when deciding whether or not to combine your data using a Base SAS Data Step Merge or a PROC SQL join. The key difference to note is that when you are doing a one-to-one or one-to-many merge, the Data Step Merge and equivalent PROC SQL join produce identical results. However, with a many-to-many merge, the PROC SQL join produces a different result than what you would expect the equivalent Data Step Merge to produce. With a PROC SQL JOIN, a many-to-many join produces what is known as the Cartesian product, which is explained later in the article.
 
In most practical situations of combining data, one-to-one merge or one-to-many merge are the most common. As such, the following X examples on Full Join, Inner Join, Left and Right Join use datasets which have a unique ID variable, so the results produced by the Data Step Merge and PROC SQL join will be identical. Keep in mind that if you use similar code on your data which has a many-to-many relationship, the Data Step and SQL results may not always be the same. This difference is later outlined in the Cartesian product section.

Full Join

First, let’s look at a simple two dataset join (merge) which will keep all records that match between datasets A and B, the non-matching dataset A records, and the non-matching dataset B records. This is the first diagram where all circles are shaded shown above, and will be accomplished with a FULL JOIN in PROC SQL.
 
In this example, we will combine the SASHELP.CLASS dataset to the WORK.CLASSFIT_MALES dataset which we created earlier. Note that when referring to a dataset saved in your temporary work directory, it is not necessary to specify the WORK library name before the dataset name.
 
Let’s look at how this would be accomplished in Base SAS first. Using Base SAS, combining datasets is a two-step process. The first step is to identify which variable in each dataset you are going to merge on and then sort both datasets by that variable. Note that these variables must also have identical variable names. Here, we will sort and merge both SASHELP.CLASS and WORK.CLASSIT_MALES on the variable NAME, as shown here:
 
proc sort data=sashelp.class out=class;
 by name;
run;
 
proc sort data=classfit_males;
 by name;
run;
 
Recall that we also need to create a temporary copy of the CLASS dataset in the WORK library as we are unable to modify datasets found in the SASHELP library.
 
After the datasets have been sorted, you will use the SAS data step combined with the MERGE statement. The BY statement after the MERGE also needs to be included here to ensure that records with matching NAME from both datasets are combined, as shown here:
 
data classfit_combined;
 merge class classfit_males;
  by name;
run
 
The result is a dataset which contains all matching records (based on NAME) from CLASS and CLASSFIT, as well as those records which did not match on the NAME variable.
 
An important note here is that any variables with common names between the two datasets will be overwritten by the second dataset listed in the MERGE statement. For example, if there is a record for “Alfred” in both CLASS and CLASSFIT_MALES, the values for SEX, AGE, HEIGHT and WEIGHT will be overwritten by the values for those same variables found in CLASSFIT_MALES because CLASSFIT_MALES also contains SEX, AGE, HEIGHT and WEIGHT variables.
 
In this example, we know that the Females found in CLASS will not be found in the CLASSFIT_MALES dataset we created, and so all those records with missing values for the variable PREDICT are actually the non-matching records since the PREDICT values only come from the CLASSFIT_MALES dataset, as shown here:

Performing an equivalent task with PROC SQL has two advantages:

  • The first advantage is that you do not need to sort your datasets prior to completing the merge (know as a “join” in SQL).
  • The second advantage is that the variables you are merging on from both datasets do not need to have identical variable names (unlike with a data step merge in Base SAS). Note that in this example, the variables do happen to have identical names already, so we are not taking advantage of this here.

To complete a JOIN using PROC SQL, you first need to start with a CREATE TABLE statement (assuming you would like to create a dataset), followed by a SELECT statement pointing to the first dataset that you plan to join to.

In this example, we plan to create a dataset called CLASSFIT_COMBINED_SQL, starting with selecting all records from the CLASS dataset, as shown here:

proc sql;
 create table classfit_combined_sql as
 select * from class
;
quit;  

Building on the above, we now add 1 more line to indicate the type of join we would like to do, which in this case is a FULL JOIN followed by the variable(s) that we are joining on (1 from each table in this example):

proc sql;
 create table classfit_combined_sql as
 select * from class
  full join classfit_males on class.name = classfit_males.name
;
quit;  

The resulting output should now be identical to the data step merge shown above.
 
Although the output dataset should be the same as the above data step merge, you will notice that with this PROC SQL code you will see the following warnings in your SAS log, which we did not see with the data step merge:

Become a Certified SAS Specialist

Get access to two SAS base certification prep courses and 150+ practice exercises

These appear in our select statement because in this example, the Name, Age, Sex, Height and Weight variables are included in both datasets, thus prompting a WARNING as two variables can not exist in the same dataset with the same name.

Similar to the Data Step, when you complete a SQL join and use “SELECT *”, SAS will automatically pick one of the duplicate variables to keep. In the case of PROC SQL, the first instance of the variable that is selected will be kept. In our example, the value for Sex found in the CLASS dataset will be kept, and the value for Sex found in CLASSFIT_MALES will be ignored.

To get around this, you need to be specific about which variables you would like to keep from which table.

For example, say you would like to keep all the variables from CLASSFIT, but only the PREDICT, LOWERMEAN and UPPERMEAN variables from CLASSFIT_MALES.

To do this, you need to include the table name, followed by a period and then the desired variable name. You can also use a table name with the asterisk(*) to include all variables from a given table a shown here:

proc sql;
 create table classfit_combined_sql as
 select class.*,
 classfit_males.predict,
 classfit_males.lowermean,
 classfit_males.uppermean from class
  full join classfit_males on class.name = classfit_males.name
;
quit;  

Which produces the following dataset:

You may be thinking that if you have long dataset names, listing variables in this manner in your PROC SQL code will become long and tedious. Fortunately, you can also take advantage of SQL alias’ within PROC SQL to create short forms for your table names.
 
To assign an alias, you simply put “AS <ALIAS NAME>” after either a FROM <TABLE NAME> or JOIN <TABLE NAME> within your PROC SQL code, as shown here: 

proc sql;
  create table classfit_combined_sql as
  select cl.*,
  clm.predict,
  clm.lowermean,
  clm.uppermean from class as cl
   full join classfit_males as clm on cl.name = clm.name
  ;
quit;

Once you have assigned an alias, you can use that alias anywhere within your PROC SQL code that you need to reference a table, such as in your SELECT statement or with your JOIN statements, as shown above.
 
Ideally, you want to make your alias’ as short as possible without making them unrecognizable. Usually two or three letter references with some meaningful letters will work for most queries (here “cl” is used a short for  class and an m is added at the end to make “clm” used for classfit_males).

Inner Join

An Inner Join is the second scenario shown in the Venn diagram earlier in this article whereby only those records that are found in both datasets A+B are preserved.
 
Again, let’s start by demonstrating how this would be done in Base SAS. In this example, we will also combine the SASHELP.CLASS dataset to the WORK.CLASSFIT_MALES dataset which we created earlier, but only keep those people who are found in both the CLASS and CLASSFIT_MALES datasets, and exclude any records which do not match based on NAME.
 
As before, we will sort and merge both SASHELP.CLASS and WORK.CLASSIT_MALES on the variable NAME, as shown here:
 
proc sort data=sashelp.class out=class;
 by name;
run;
 
proc sort data=classfit_males;
 by name;
run;
 
After the datasets have been sorted, you will again use the SAS data step combined with the MERGE statement.
 
Under this scenario, we will also need to use the IN dataset option which creates a variable to track which observation originated from which dataset. Note the variable created by the IN option is not available in the SAS dataset itself but can be used throughout the SAS data step in which it was created.
 
The BY statement is again included after the MERGE which also needs to be here to ensure that records with matching NAME from both datasets are combined.
 
Lastly, we include a simple IF statement to tell SAS that if the record is from dataset “a” and dataset “b”, then we would like to include it in our final dataset:
 
data classfit_combined_inner;
 merge class(in=a) classfit_males(in=b);
  by name;
 
 if a and b;
run;

Using PROC SQL, the code is nearly identical to the first example of a full join. As before, we do not need to pre-sort our datasets before combining them and all we need to do here is simply replace the “FULL JOIN” with an “INNER JOIN”.

proc sql;
 create table classfit_combined_sql as
 select * from class
  inner join classfit_males on class.name = classfit_males.name
;
quit;  

With both the above Base SAS code and PROC SQL inner join, the resulting dataset should contain all variables from both CLASS and CLASSFIT_MALES, but only have 10 rows since only the matching names (i.e. the males in this case) are kept from both datasets:

*Note the remaining 4 variables LOWERMEAN, UPPERMEAN, LOWER and UPPER are not shown in the screenshot, but should also be found in your dataset

Left Join

A Left Join is the third scenario shown in the Venn diagram earlier in this article whereby the matching A+B records are preserved in addition to the non-matching records from dataset A.
 
As before, let’s look at how this would be done in Base SAS. In this example, we will again combine the SASHELP.CLASS dataset to the WORK.CLASSFIT_MALES dataset which we created earlier, but keep those people who are found in both the CLASS and CLASSFIT_MALES datasets in addition to those people found in the CLASS dataset but not in the CLASSFIT_MALES.
 
As before, we will sort and merge both SASHELP.CLASS and WORK.CLASSIT_MALES on the variable NAME, as shown here:
 
proc sort data=sashelp.class out=class;
 by name;
run;
 
proc sort data=classfit_males;
 by name;
run;
 
After the datasets have been sorted, you will again use the SAS data step combined with the MERGE statement and the IN operator described earlier.
 
The BY statement is again included after the MERGE which also needs to be here to ensure that records with matching NAME from both datasets are combined.
 
Lastly, we include a simple IF statement to tell SAS that if the record is from dataset “a”, then we would like to include it in our final dataset:
 
data classfit_combined_inner;
 merge class(in=a) classfit_males(in=b);
  by name;
 
 if a; 
run;
 
Using PROC SQL, the code is again nearly identical to the FULL JOIN and INNER JOIN. As before, we do not need to pre-sort our datasets before combining them and all we need to do here is simply replace the “INNER JOIN” from the previous example with a “LEFT JOIN”.

proc sql;
 create table classfit_combined_sql as
 select * from class
  left join classfit_males on class.name = classfit_males.name
;
quit;  

With both the above Base SAS Data Step Merge and PROC SQL LEFT JOIN, the resulting dataset should contain all variables from both CLASS and CLASSFIT_MALES. The resulting dataset should also have 19 records, the same number of records the original CLASS dataset since we are keeping all matching records plus all original records from CLASS:

*Note the remaining 4 variables LOWERMEAN, UPPERMEAN, LOWER and UPPER are not shown in the screenshot, but should also be found in your dataset

Right Join

A RIGHT JOIN is the fourth scenario shown in the Venn diagram earlier in this article and is simply the opposite of the LEFT JOIN. With a RIGHT JOIN, the matching A+B records are preserved in addition to the non-matching records from dataset B (as opposed to the LEFT JOIN where the non-matching records from dataset A are preserved in addition to the matching A+B records).
 
Similar to the LEFT JOIN example, we will again combine the SASHELP.CLASS dataset to the WORK.CLASSFIT_MALES dataset which we created earlier, but this time keep those people who are found in both the CLASS and CLASSFIT_MALES datasets, in addition to those people found in the CLASSFIT_MALES dataset but not in CLASS.
 
As before, we will sort and merge both SASHELP.CLASS and WORK.CLASSIT_MALES on the variable NAME, as shown here:
 
proc sort data=sashelp.class out=class;
 by name;
run;
 
proc sort data=classfit_males;
 by name;
run;
 
The Base SAS data step code is essentially identical to the code used for the LEFT JOIN example, but this time we will modify the IF statement to tell SAS that if the record is from dataset “b”, then we would like to include it in our final dataset:
 
data classfit_combined_inner;
 merge class(in=a) classfit_males(in=b);
  by name;
 
 if b; 
run;
 
Using PROC SQL, the code is again nearly identical to the LEFT JOIN example. As before, we do not need to pre-sort our datasets before combining them and all we need to do here is simply replace the “LEFT JOIN” from the previous example with a “RIGHT JOIN”.

proc sql;
 create table classfit_combined_sql as
 select * from class
  right join classfit_males on class.name = classfit_males.name
;
quit;  

With both the above Base SAS Data Step Merge and PROC SQL RIGHT JOIN, the resulting dataset should contain all variables from both CLASS and CLASSFIT_MALES. The resulting dataset from this example should also have 10 records, the same number of records as the original CLASSFIT_MALES dataset since we are keeping all matching records plus all original records from CLASSFIT_MALES (and excluding any non-matching records from CLASS):

*Note the remaining 4 variables LOWERMEAN, UPPERMEAN, LOWER and UPPER are not shown in the screenshot, but should also be found in your dataset
 

Cartesian Product

As noted earlier, a unique feature of PROC SQL is that when you are attempting a many-to-many merge, the results from a PROC SQL join are different than the comparable code from a Base SAS data step merge. With a many-to-many join in PROC SQL, SAS produces what is known as the Cartesian product.
 
With a basic Cartesian product join, the number of rows in the resulting table is the product of the number of rows found in each of the input tables.
 
In this basic example, PROC SQL is used to select all records from both SASHELP.CLASS and SASHELP.CLASSFIT:

proc sql;
 create table cart_product as
 select * from sashelp.class,sashelp.classfit
 ;
quit

Under this scenario, PROC SQL combines every row from the first table (CLASS, n=19) with every row from the second table (CLASSFIT, n=19) and does not factor in an ID or matching variable. The result is a dataset that is 19x19=361 rows long.
 
This may or may not be useful depending on the desired outcome or task at hand but it is important to understand how the Base SAS data step MERGE and a PROC SQL JOIN behave differently when doing a many-to-many merge.
 
To demonstrate how Base SAS data step MERGE and PROC SQL JOIN behave differently, we will create two new datasets which have duplicate ID values, and then attempt to combine them in the following example.
 
In order to get duplicate ID values using SASHELP data, we will append a copy of the SASHELP.CLASS dataset to itself in addition to combining a copy of the SASHELP.CLASSFIT dataset with itself. To combine two datasets vertically (i.e. stack one on top of the other) we will use the SET statement:
 
data classfit_duplicates;
 set sashelp.classfit sashelp.classfit;
run;
 
data class_duplicates;
 set sashelp.class sashelp.class;
run;
 
After sorting these datasets by name, you can see that we now have two records for every person in each of the two datasets:
 
proc sort data=classfit_duplicates;
 by name;
run;
 
proc sort data=class_duplicates;
 by name;
run; 
 
Here is what WORK.CLASS_DUPLICATES should look like now:

 
Similarly, this is what WORK.CLASSFIT_DUPLICATES should look like:
 

*Note the remaining 4 variables LOWERMEAN, UPPERMEAN, LOWER and UPPER are not shown in the screenshot, but should also be found in your dataset

Next, to combine these datasets and keep only the matching records, we would use the following data step merge code as before:

data classfit_combined_cart;
 merge class_duplicates(in=a) classfit_duplicates(in=b);
  by name;

 if a and b;
run;

The resulting dataset should contain 38 records, the same number of records as both CLASS_DUPLICATES and CLASS_DUPLICATES. Here, the first occurrence of a name from A is matched with the first occurrence of the same name from B, and the second occurrence of a name from A is combined with the second occurrence of the same name from B: 

  • CLASS Name 1, 1st occurrence <merged with> CLASSFIT Name 1, 1st occurrence
  • CLASS Name 1, 2nd occurrence <merged with> CLASSFIT Name 1, 2nd occurrence
  • CLASS Name 2, 1st occurrence <merged with> CLASSFIT Name 2, 1st occurrence
  • CLASS Name 2, 2nd occurrence <merged with> CLASSFIT Name 2, 2nd occurrence

Which produces the following dataset:

*Note the remaining 4 variables LOWERMEAN, UPPERMEAN, LOWER and UPPER are not shown in the screenshot, but should also be found in your dataset

However, when doing a comparable INNER JOIN with PROC SQL, the results are quite different:

proc sql;
 create table classfit_combined_sql_cart as
 select * from class_duplicates
  inner join classfit_duplicates on class_duplicates.name = classfit_duplicates.name
  order by name
;
quit;

Here, the resulting table contains the Cartesian product as described earlier but only keeps those records that satisfy the INNER JOIN condition. In other words, the first occurrence of Name 1 from CLASS_DUPLICATES is matched to the first and also the second occurrence of Name 1 in CLASSFIT_DUPLICATES. In addition, the second occurrence of Name 1 is also matched to the first and second occurrence of Name 1 in CLASSFIT_DUPLICATE, as shown here:
 
  • CLASS Name 1, 1st occurrence <merged with> CLASSFIT Name 1, 1st occurrence
  • CLASS Name 1, 1st occurrence <merged with> CLASSFIT Name 1, 2nd occurrence
  • CLASS Name 1, 2nd occurrence <merged with> CLASSFIT Name 1, 1st occurrence
  • CLASS Name 1, 2nd occurrence <merged with> CLASSFIT Name 1, 2nd occurrence
 
The resulting table should now contain 4 records for each name for a total of 76 rows, as shown here:
*Note the remaining 4 variables LOWERMEAN, UPPERMEAN, LOWER and UPPER are not shown in the screenshot, but should also be found in your dataset
 

7. Summary Statistics

In addition to having the ability to create frequencies in PROC SQL, you can also use PROC SQL to calculate other summary statistics on a single variable such as the sum, minimum, maximum or mean (average). While these calculations can also be done with Base SAS in PROC MEANS, PROC SQL gives you the ability to add these directly to your dataset or use them in other calculations dynamically on the fly.
 
First, let’s look at a simple example of  how to calculate the sum, min, max  and mean of the HEIGHT variable in SASHELP.CLASS using both Base SAS and PROC SQL.
 
Using Base SAS, this can be done using PROC MEANS as noted above. In the SAS code below, the summeanmin and max options with the proc means the statement tells SAS which statistics we would like to calculate. The VAR statement is used to specify which variable(s) we would like to calculate these statistics for.
 
proc means data=sashelp.classfit sum mean min max;
 var height;
run;
 
Using the above code, the following output is produced:
 

With PROC SQL, the same summary statistics can be calculated in a SELECT statement using the sum, min, max and mean functions followed by the variable of interest in parentheses.

Since each calculation is creating a new variable at the same time, these variables will have a blank name in the output. To add a column name to the output, you can use AS followed by an appropriate name:

proc sql;
 select sum(height) AS Sum,
        min(height) AS Minimum,
        max(height) AS Maximum,
        mean(height) AS Mean from sashelp.class
 ;
quit;

Using the code above, the following output is produced:

While both PROC MEANS and PROC SQL can produce similar output for summary statistics, PROC SQL provides you with the unique ability to add these values directly to a dataset and also use them to calculate new values.
 
Using the SASHELP.CLASS dataset, we can for example calculate the difference between each student’s height and the average height of all students, and then add the result directly to a dataset with a single PROC SQL call.
 
As before, we use CREATE TABLE to create a new dataset called height_diff. Next, we use the SELECT statement to specify which variables we would like to keep in the height_diff dataset. Here, we will keep nameageheight from the original dataset and add two new variables. The first new variable, avg_height is created by using AVG function with height in parentheses. The variable height_diff is then created by subtracting the average height from the original height variable as shown here:
 

proc sql;
 create table height_diff as
  select name,
              age,
              height,
              avg(height) as avg_height,
              height-AVG(height) as height_diff from sashelp.class
  ;
quit;

Using the code above, the following dataset is produced:

To further enhance this output, we can use additional code in the same PROC SQL call to add variable labels and format the calculated values to 1 decimal place. To add a variable label, simply add label=”<variable label>” after each variable that you would like to label in your SELECT statement. Similarly, to add a format simply include format=<format> after each variable that you wish to apply the format to. In this example, we use the numeric format 4.1 to ensure we have a numeric length of 4 and round to 1 decimal place for that variable. This is shown here:

proc sql;
 create table height_diff as
  select name,
              age,
              height label='Original Height',
              avg(height) as avg_height label='Average Height of all Students' format=4.1,
              height-AVG(height) as height_diff label='Difference in Height from Average' format=4.1
              from sashelp.class
  ;
quit;

With the label and format statements added, the resulting dataset should look like the following:

*Note that in SAS studio, you must select “View: Column labels” as highlighted above to see the column labels instead of column names in the dataset.

8. Creating Macro Variables

Another useful feature of PROC SQL is the ability to easily create macro variables based on existing data. The simplest example, which becomes a useful utility in more complex SAS macros, is to create a macro variable which contains the number of records found in a dataset.
 
First, let’s look at how to get the number of records in a dataset. Again using the SASHELP.CLASS dataset, we simply use the COUNT function with an asterisk in parentheses to indicate that we would like a count of all the records, as shown here:

proc sql;
 select count(*) from sashelp.class
 ;
quit;

This produces the following output, indicating that we have 19 records in the SASHELP.CLASS dataset.

To put this value into a macro variable, the INTO statement is used followed by a colon and the name of the macro variable which we would like to create. In this example, we are creating a macro variable called numrecs, as shown here:

proc sql;
 select count(*) into :numrecs from sashelp.class
 ;
quit;

To test that the macro variable was properly created, the %put statement can be used to output the value of the macro variable to the log. To reference the newly created macro variable, an ampersand (&) needs to be added directly in front of the macro variable name, as shown here:

%put The number of records is &numrecs.;

After running the above PROC SQL statement in conjunction with the %put statement, you will see the following in the SAS log:

9. SQL Dictionary Tables

The dictionary tables are a valuable resource available to SAS PROC SQL programmers. In the simplest terms, the dictionary tables contain what is known as “Metadata” or, in other words, data about data. The dictionary tables contain high level, real-time information about the datasets, options, macros and various other characteristics applicable in your SAS session.
 
There are a variety of uses for the dictionary tables, which are limited only by your imagination. Here, we will outline a few use examples of the dictionary tables, but there are of course many other potential uses of these tables.
 
First, before determining how you might take advantage of the dictionary views, it is important to know the contents and structure of the tables. To determine the contents of a dictionary table with PROC SQL, the DESCRIBE TABLE statement can be used.
 
For example, to see the contents of a dictionary table, you simply use DESCRIBE TABLE followed by the predefined libname “dictionary”, and the name of the dictionary table you would like more information about. In this example, we would like to know the contents of the dictionary table “tables”. 

proc sql;
 describe table dictionary.tables
 ;
quit;

After running the code above, the DESCRIBE TABLE statement produces a list of the columns, their type, length, as well as any available labels in the SAS log as shown here:

The DICTIONARY.TABLES table is useful for getting a high level overview and attributes of all datasets available in your SAS session.
 
For the first example, the dictionary tables will be used to determine the number of observations found in each dataset of the SASHELP library. To help us determine which datasets have the most observations within SASHELP, the results will then be ordered from most observations to least observations.
 
To generate a list of datasets and the number of observations, we start with a SELECT statement, which indicates that we would like to keep the memname (the dataset name) and nobs (the number of observations) from the DICTIONARY.TABLES table.
 
Next, we add a WHERE statement to indicate that we would only like to keep those records with memtype = ‘DATA’ (i.e. list SAS datasets in the SASHELP library and ignore any SAS views).
 
Then we add an ORDER BY statement to indicate that we would like to order the results by the variable nobs. Finally, the “descending” option is included with the ORDER BY, to sort the output from records with most observations to least observations.

proc sql;
 select memnamenobs from dictionary.tables
  where memtype = 'DATA'
  order by nobs descending
 ;
quit;

After running the above code, the following output is produced in the results window:

*Note that this is only a partial screenshot of the output, there will be additional rows in your output.

It is also worth noting that the SQL dictionary tables have equivalent SASHELP views, which are accessible using Data Step code in Base SAS through the the pre-defined library SASHELP. A complete list of SQL dictionary tables and the corresponding SASHELP views are shown below:

Master SAS in 30 Days

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
iconmail

Get latest articles from SASCrunch

SAS Base Certification Exam Prep Course

Two Certificate Prep Courses and 300+ Practice Exercises