SAS is the Kapil Dev(Cricket) of computer programming languages. For my non cricket friends, what I mean is it was the best alrounder of its time. Google says ‘SAS language is a computer programming language used for statistical analysis’ but I bet someone has used it to program their refrigerator to tell them it is time to buy eggs. I mean you can pretty much do everything in it. But, just like Kapil Dev and Diego Maradona, SAS had its time. With bigger computers (in speed and computing power), cluster computing and cloud computing, people are skipping their breakfasts to get more done. They want something like Virat Kohli, Joe Root or Babar Azam (another cricket Reference ?).
Over the last few years, I’ve had the chance to learn SAS. And let me tell you, you can do a lot of pretty stupid good things in SAS that you shouldn’t (I think). Here are few of the most common issues.
Leaving the source dataset missing
In SAS, if you don’t specify the source data for a datastep, it assumes the last dataset as source for the current step. Here’s an example
Bad
/* Code 1 */
data lib1.table2;
set table2;
run;
data lib1.table3;
set;
run;
Better
/* Code 2 */
data lib1.table2;
set table2;
run;
data lib1.table3;
set lib1.table2;
run;
In Code1, in the first datastep (line 2-4), table1 dataframe is used as source and table2 dataset is getting created in lib1 library. In the 2nd datastep (line 6-8), the source dataset is missing. So, the last dataset created in the code i.e. lib1.table2 will be assumed as source. Code1 and Code2 may look different (line 7 and line 17) but their functionality is same. It gets crazier when there are multiple datasteps missing source dataset like Code3.
Bad
/* Code 3 */
data lib1.table2;
set table1;
run;
data lib1.table3;
set;
run;
data lib1.table4;
set;
run;
data lib1.table5;
set;
run;
Better
/* Code 4 */
data lib1.table2;
set table1;
run;
data lib1.table3;
set lib1.table2;
run;
data lib1.table4;
set lib1.table3;
run;
data lib1.table5;
set lib1.table4;
run;
Sprinkling in extra spaces
SAS is not strict with spaces in many cases and it gets pretty annoying for new developers. Here’s an example of SQL code from SAS which is valid but will fail in other programming languages.
Bad Spacing In PROC SQL
/* Code 5 */
PROC SQL;
CREATE TABLE table3 as
SELECT
t1. *,
t2.col1,
t2. col2,
FROM table1 t1
LEFT JOIN table2 t2
ON t1.id = t2.id;
QUIT;
In line 5 and line 7, there is an extra space between table alias and column name which is valid in SAS sql.
Bad Spacing In Function Calls
/* Code 5.1 */
data qwe;
set asd;
x = round (y);
z = round( z1);
run;
Code 5.1 is another good example of extra spaces in the code. It is also a valid code but the extra spaces in line 4 and 5 makes it hard to read.
Putting group by anywhere
In SAS, you can put a ‘group by’ statement without declaring any aggregate function in the select statement. It will simply ignore it if not needed. Here are 2 sample codes that will produce same output.
Bad
/* Code 6 */
PROC SQL;
CREATE TABLE table1 as
SELECT
col1,
col2,
col3
FROM table2
GROUP BY col1;
QUIT;
Better
/* Code 7 */
PROC SQL;
CREATE TABLE table1 as
SELECT
col1,
col2,
col3
FROM table2;
QUIT;
Using columns that are not in dataset
In SAS, you can use a column name in your code that is not even in the dataset. This can be done in calculations also. If used in a select statement, SAS will simply create the column in the final output and if used in calculation, it will just ignore it with a warning. Here’s a sample valid code where SAS will just ignore the keep statement for 3 columns that don’t exist.
Bad Reference To Columns That Don't Exist
/* Code 8 */
data table2;
set table1;
keep col4 col5 col6;
run;
Duplicate columns in SAS SQL
Multiple columns with same name can be declared in SAS SQL. It will only keep the last declared one. In the following example (Code9), SAS will ignore line 106 and will keep ‘(col3 * 5)’ as new_col’s value
Duplicate Columns In SQL
/* Code 9 */
PROC SQL;
CREAT TABLE table1 AS
SELECT
col1,
(col2 + col3) as new_col,
(col3 * 5) as new_col
FROM table2;
QUIT;
‘not is’ and ‘is not’ are the same but don’t mix and match
Same logic code can be written in different ways which could be a little confusing for new SAS users. In sample code Code10 and Code11, filter conditions are written differently but they mean the same thing. Pick one way and stick to it, my preference is to use the way that reads easier in plain english (i.e. “is not”).
Bad
/* Code 10 */
data table2;
set table1;
if col1 not is missing;
if col2 is not missing;
run;
Better
/* Code 11 */
data table2;
set table1;
if col1 is not missing;
if col2 is not missing;
run;
Abusing the retain statement
The actual use of retain statement is to retain the value of a variable from one iteration of the data step to the next that is created by an INPUT or assignment statement. But, it can be used for different purposes. Abusing the retain statement to reorder columns is a common problem. Retain can lead to subtle bugs and when used before the set/merge statement is different than after. Many developers don’t understand the difference. Use attrib or label to reorder columns, it’s way safer.
Macrofying everything
Another common mistake that I have seen many times is macrofying everything for no reason. I have seen the simplest code being turned into a macro which is not required and makes the code harder to understand. This overly complicated looking code is just sorting the data and this function can be easily performed by avoiding macro.
Over "macrofying"
/* Code 12 */
%macro sort_nodup(inds=, outds=, key=);
proc sort data=&inds. out=&outds. nodupkey;
by &key.;
run;
%mend;
Conclusion
These are few of the examples where I found SAS just too flexible with the coding styles. Because SAS is an over 50 year old language, there’s so many inconsistencies and many ways to do the same thing. Just because you can, doesn’t mean you should. In the modern world, Python and PySpark by contrast, have considerably stricter enforcement of good coding practices.