Master SAS in 30 days!

A jargon-free, easy-to-learn SAS base course that is tailor-made for students with no prior knowledge of SAS.

RETAIN Statement Tutorial

Are you struggling to complete complex data manipulations and data step calculations across observations? Do you need a way to calculate counts, subtotals and cumulative totals across observations and within groups of observations? Would you like to better understand how to use the RETAIN statement with FIRST. and LAST. processing?
 
This article will demonstrate how the RETAIN statement works and provide a series of examples with varying complexity to help you get the most out the RETAIN statement.
 
In particular, this article will cover: 
  1. Retain Statement – Basic Usage
  2. Retain Statement with BY Groups
    1. FIRST. and LAST. Processing
    2. Calculating cumulative values within BY Groups
    3. Creating sequential numbers within BY Groups
  3. Re-ordering Variables in Data Sets
 

Software

Before we continue, make sure you have access to SAS Studio. It's free!

Data Sets

The SASHELP.PRDSALE dataset used in this article contains furniture sales data. To simplify the examples, the code below is used to create a subset of PRDSALE by keeping only those products with the name “SOFA” sold in the country “CANADA”. The variables REGION, DIVISION, PRODTYPE, PREDICT and QUARTER are dropped in the WORK.PRDSALE_CDN_SOFA data set created.
 
Any time the article refers to the PRDSALE_CDN_SOFA dataset, simply run the code below each time to create it:

data prdsale_cdn_sofa;
 set sashelp.prdsale;
  where country = 'CANADA' and product = 'SOFA';
 
 drop region division prodtype predict quarter;
run;

After running the code above, we will also sort the PRDSALE_CDN_SOFA dataset by MONTH and ACTUAL for the purposes of the examples in this article:

proc sort data = prdsale_cdn_sofa;
 by month actual ;
run;

You should now have a data set in your SAS work directly that looks like the following:
 

Retain Statement – Basic Usage

In the most general terms, the RETAIN statement is used to keep (retain) values from one observation to the next during iterations of the SAS Data Step. By default, SAS initializes variable values to missing after each iteration of a Data Step until a value is assigned through an INPUT statement or other assignment statement. The RETAIN statement overrides this default behaviour, and retains the value from the previous observation before completing the next iteration of the Data Step.
 
Let’s walk through a simple example to see how RETAIN works. In this example, we will demonstrate how to calculate the cumulative profit after each month of sales in a simple data set which is generated below.
 
The following syntax is used to create the data set PROFITS which contains two variables. The variable MONTH contains the month of sales (numbered from 1 to 12) and the variable PROFIT contains the profits for each given month. The goal is to create a third variable, CUMULATIVE_PROFIT, which contains the cumulative profits from the first month up until the 12th month. So, for example, the cumulative profit in the 3rd month should be the sum of the 1st, 2nd and 3rd months and the cumulative profit in the 4th month should be the sum of the 1st through 4th months.
 
First, let’s create the dataset PROFITS by running the following code:

data profits;
 input month profit;
 datalines;
 1 12451
 2 54325
 3 43514
 4 13455
 5 45161
 6 54151
 7 54261
 8 43251
 9 43515
 10 83711
 11 45236
 12 54361
 ;
run;

After running the code above, you should now have a dataset that looks like the following:
 
Without using the RETAIN statement, let’s see what happens when we use the following code in an attempt to sum the values of profit and cumulative profit:

data cumulative_profit;
 set profits;
 cumulative_profit = sum(profit,cumulative_profit);
run;

As mentioned, SAS simply resets the values of CUMULATIVE_PROFIT to missing for each observation, and the final resulting values of cumulative profit are exactly equal to the values of profit, which is not our desired result. You can see this in the output data set, as shown below:

By simply adding the RETAIN statement to tell SAS to retain the values of CUMULATIVE_PROFIT, we can calculate the cumulative profit by adding the next month’s profit to the previous month’s profit with each Data Step iteration:

data cumulative_profit_retain;
 set profits;
 retain cumulative_profit;
 
 cumulative_profit = sum(profit,cumulative_profit);
run;  

As you can see in the Output Data, we now have the desired result with cumulative profits calculated for each month:

By default, the variable specified in the RETAIN statement will be initialized to the first observation encountered. Depending on the use case, you can modify the RETAIN statement to initialize your retained variable to a different starting value of your choice.
 
Say for example you’d like to start your cumulative profit calculations at -500 to account for $500 of upfront costs. To initialize the variable you are going to retain to a custom value, simply add the desired value after the variable name in your RETAIN statement, as shown below:

data cumulative_profit_retain2;
 set profits;
 retain cumulative_profit -500;
 
 cumulative_profit = sum(profit,cumulative_profit);
run;      

Now, in the Output Data set shown below, you can see that the cumulative profits start at 500 less than in the previous example:

Do you have a hard time learning SAS?

Take our Practical SAS Training Course for Absolute Beginners and learn how to write your first SAS program!

Retain Statement with BY Groups

While the RETAIN statement can be useful for calculations across an entire data set, using RETAIN with BY Group processing can allow you to tackle even more complex data manipulation tasks.
 
Before we can take full advantage of the RETAIN statement, it is important to understand the FIRST. “first dot “ and LAST.  “last dot” variables in SAS. The FIRST. and LAST. Variables are temporary variables created by SAS during DATA step programming but are not added to the output dataset. The FIRST. and LAST. variables can be used to easily identify both the first and last observation within a BY group.
 
The values of the FIRST.variable and LAST.variable can be only take the values of either 1 or 0. The value of FIRST.variable will be 1 for the first observation within the BY group and the value of the LAST.variable will be 1 when the last observation within a BY group is encountered. All other observations will have a value of 0 for the FIRST.variable and LAST.variable
 
If you have not done so already, use the syntax at the beginning of the article to create the WORK.PRDSALE_CDN_SOFA dataset. When creating BY groups with the BY statement, it’s critical that our input dataset is sorted correctly. Here, we will sort the PRDSALE_CDN_SOFA dataset by MONTH since we are creating monthly BY groups but we also sort on ACTUAL so that the first value of actual sales (ACTUAL) within each month starts at the smallest value and the last value of actual sales is the largest value. 

proc sort data = prdsale_cdn_sofa;
 by month actual ;
run;

After running the PROC SORT code above, you can see in the Output Data shown partially below that for each month, there are multiple observations for SOFA, with varying amounts of actual sales (ACTUAL):

Note that the sorting of ACTUAL is not necessary in these examples, but it will help to better illustrate the RETAIN behaviour in the examples that follow.
 
To demonstrate how the FIRST. and LAST. variables work, we will create a new variable, MONTH_ORDER which will have a value of “FIRST” for the first observation within each BY group and a value of “LAST” for the last observation within each BY group.
 
In the syntax below, MONTH is used as the grouping value in our BY group, as defined with the BY statement immediately after the SET statement. The FIRST.month and LAST.month variables are then used to create the new variable, MONTH_ORDER as described above. When FIRST.month = 1 SAS has encountered the first observation in the BY group and when LAST.month = 1 SAS has uncounted the last observation.
 
Note this code uses the WORK.PRDSALE_CDN_SOFA data set created at this beginning of this article and also applies the sort procedure to ensure the input dataset is correctly sorted before creating our BY groups:

proc sort data = prdsale_cdn_sofa;
 by month actual ;
run;
 
data first_last;
 set prdsale_cdn_sofa;
  by month;
 
 if first.month = 1 then month_order = 'FIRST';
 if last.month = 1 then month_order = 'LAST';

 
run;

As you can see in the dataset shown partially below, we have correctly generated values for the new variable, MONTH_ORDER which clearly indicates the first and last observations within each monthly BY group:

To simplify your SAS programming with FIRST. and LAST. variables, you can OMIT the “ = 1” portion of the code, and simply write your code as follows:

data first_last;
 set prdsale_cdn_sofa;
  by month;
 
 if first.month then month_order = 'FIRST';
 if last.month then month_order = 'LAST';

 
run;

Running the code above should yield identical results to the previous example:

Calculating Cumulative Values within BY Groups

Now that you have a good understanding of FIRST. and LAST. processing, we can utilize the FIRST. and LAST. temporary variables together with the RETAIN statement to calculate cumulative values within BY groups.
 
For example, you would like to determine the cumulative actual sales amount within each monthly BY group so that the last observation within the BY group contains the total actual sales for that month and then resets the calculation for the next month.
 
First, let’s start by building our code without the retain statement to see what happens. As before, we will sort our dataset by MONTH and ACTUAL, and then SET the data set by MONTH to create our BY groups. To calculate the cumulative ACTUAL sales, we will initialize CUMULATIVE_ACTUAL to be equal to ACTUAL for the first observation within the BY group. We will then attempt to add the actual sales to this amount for all subsequent observations within the BY group. As before, we will also ensure our WORK.PRDSALE_CDN_SOFA dataset is correctly sorted before running our DATA Step code:

proc sort data = prdsale_cdn_sofa;
 by month;
run;
 
data prdsale_cdn_sofa_noretain;
 set prdsale_cdn_sofa;
  by month;
 
 if first.month then cumulative_actual = actual;
 else cumulative_actual = cumulative_actual + actual;
 
run;

As you can see in the Output Data set PRDSALE_CDN_SOFA_NORETAIN, the results are not what we are looking for as CUMULATIVE_ACTUAL is only equal to ACTUAL for the first observation and then missing for all remaining observations:
 
This occurs because unless otherwise specified, SAS actually resets the value to missing for CUMULATIVE_ACTUAL for each observation, resulting in a missing value for everything except the first observation within the BY group.
 
The solution to this problem is to add a RETAIN statement which tells SAS to RETAIN the values of CUMULATIVE_ACTUAL for each observation within the BY group:
 

data prdsale_cdn_sofa_retain;
 set prdsale_cdn_sofa;
  by month;
 
 retain cumulative_actual;
 
 if first.month then cumulative_actual = actual;
 else cumulative_actual = cumulative_actual + actual;
 
run;  

In the Output Data set shown partially below, you can see we now have the desired result of cumulative actual sales values within each monthly BY group:

Become a Certified SAS Specialist

Get access to two SAS base certification prep courses and 150+ practice exercises

Creating Sequential Numbers within By Groups

The combination of FIRST. and LAST. variables with RETAIN can also be used to generate sequential numbers within each BY group. For example, using RETAIN with the FIRST.month variable you can generate a sequential number from 1 to N within each monthly BY group.
 
In this case, we will start by creating a new variable, COUNTER, and tell SAS to retain that variable. Next, we will initialize the value of COUNTER to 1 at the start of each BY group using the FIRST.MONTH variable. For all remaining observations after the first observation, we will add 1 to the value of COUNTER, always retaining the previous value of COUNTER before adding 1. Here is what the syntax looks like:

data prdsale_cdn_sofa_counter;
 set prdsale_cdn_sofa;
  by month;
 
 retain counter;
 
 if first.month then counter = 1;
 else counter = counter + 1;
 
run;

By reviewing the Output Data set shown partially below, you can see that we have successfully created a sequential number, COUNTER, within each MONTH By group:

Once you know how to create the COUNTER variable, you can easily manipulate this dataset to tell you many observations are within each monthly BY group, without having to run a PROC FREQ, PROC MEANS or other method.
 
Using the SAS syntax from the previous example, we can add a statement that tells SAS when the LAST observation within each BY group is encountered that record should be output in the final dataset (and as a consequence, no other records will be output). A KEEP statement is also added to simplify the output:
 

data prdsale_cdn_sofa_freq;
 set prdsale_cdn_sofa;
  by month;
 
 retain counter;
 
 if first.month then counter = 1;
 else counter = counter + 1;
 
 if last.month then output;
 
 keep year month counter;

 
run; 

In the Output Data shown partially below, you can now easily see the LAST value of counter within each BY group, which indicates how many observations occur within each group. In this case, all BY groups are found to have 4 observations in each group:

Re-Ordering Variables in Data Sets

Another unique but helpful use of the RETAIN statement is to re-order variables in a dataset. While there are other methods to re-order variables in SAS such as the LENGTH or ATTRIBUTE statement, the benefit of using RETAIN to re-order a dataset is that it does not require you to modify the data like you would with a LENGTH or ATTRIBUTE statement.
 
If for example you would like to re-order the PRDSALE_CDN_SOFA dataset to have the variables YEAR, MONTH, PRODUCT , COUNTRY and ACTUAL in this order from left to right, you can use the RETAIN statement as follows:

data prdsale_cdn_sofa_order;
 retain YEAR MONTH PRODUCT COUNTRY ACTUAL;
 set prdsale_cdn_sofa;
run;

As you can see in the Output Data set shown partially below, the variables have now been re-ordered when viewing the data set in SAS:

It’s important to keep in mind that telling SAS to retain all these variables could produce undesirable results if you add additional data processing steps after the RETAIN and SET statements. When using RETAIN to re-order variables, it’s recommended to avoid adding any other data processing steps to a Data Step or exercise extreme caution to ensure you don’t unintentionally modify your data set.

Master SAS in 30 Days

4 1 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
iconmail

Get latest articles from SASCrunch

SAS Base Certification Exam Prep Course

Two Certificate Prep Courses and 300+ Practice Exercises