Not to date myself, but I’ve been using SAS for over 25 years. I’ve worked at banks like Capital One and Citibank, Telecom companies like Verizon and Bell Canada, government institutions like Statistics Canada, and the company behind the SAS language. In all my years of experience, one challenge with dealing with SAS was omnipresent. When the data gets large, as it often does, it becomes bitterly painful to manage. So, for much of my career, I’ve had to spend considerable time and resources on workarounds to managing big datasets.
These workarounds typically came in the form of offloading data processing that aught to be done in SAS (from a usability perspective), and pushing the data processing down to databases and legacy MPP systems like Teradata. The whole point in using an analytics platform like SAS was that, you learned one language and that opened the door to so many different data sources and analytical techniques. One ring to rule them all so to speak. So having to learn how to push your work away from being processed in SAS, always seemed backwards to me.
But, there is one form of data processing that you simply couldn’t do inside any other language or database. The kind of processing that the SAS language is uniquely capable of handling, and arguably the secret sauce to what was SAS’ nearly 50 year reign as a data processing language king. I’m talking of course about the datastep language. Invented long before I was even born, datasteps allow you to plainly express business logic and combine many steps into a long data processing pipeline to avoid IO.
There are many features in the datastep that are just not available in other languages, including the modern industry standard PySpark. This includes the ability to define looping and conditional processing logic within each record, the ability to retain columns values from previous records, use arrays to flexibly shuffle column values around, and my personal favorite – arbitrarily output records at will.
Many of these features require that the data be processed iteratively, one record after the other. This fact makes them very difficult to scale, and the Achilles’ heel of the datastep API. Notionally, SAS solved this with the clumsy and complicated DS2 sub-language, but it’s so complicated and convoluted, that it’s rarely used in the real world.
What we really needed back then, was datasteps that just worked in parallel, no extra API baggage. It’s something I have spent a great deal of time over the years to try and solve. In fact, I authored a paper back in 2010, which developed a foundation of a solution to the datastep scalability problem. That paper has been cited dozens of times over the years by banks, telecos, insurance companies, NGO’s and government institutions. While nowhere near as elegant as the general solution to parallelization within Apache Spark, developed by fellow Canadian Matei Zaharia; it does solve some of the challenges of working with large data processing in SAS.
The datastep processing features we are discussing, are especially incompatible with shared-nothing architectures like PySpark. PySpark and other MPP architectures achieve parallelization by assuming rows can be processed independently, and have limited API’s to define dependencies. PySpark achieved a massive breakthrough in terms of scalability vs other data processing engines, but does so by imposing other constraints. The most notable constraint is that the DataFrame concept (and its underlying RDD) are based on immutable data structures, and incur significant overhead when dynamically changing a column definition. Combined, these constraints mean that many SAS datastep features are simply incompatible with PySpark DataFrames.
After many years of toiling, the breakthrough finally came in late in 2020. The solution took form as the Row Iterative Processing Language or more commonly known as RIPL. I worked with our incredible R&D team to develop an entirely new data processing language within PySpark. By building out our own API on top of PySpark, we were able to offer up all the richness of the datastep language, while simultaneously leveraging the PySpark’s awesome performance and scalability. Apparently you can have your cake and eat it too.
The DataFrame API is still our target for most datastep code conversion – is after all the simplest and most powerful way to express most data processing tasks. But, we can now offer access to group-by processing, retained columns, arrays, do loops and more, and do so in a modern and simple programming language, Python. Personally, the day we got this all working was one of the highlights of my career, a day I’ll never forget. I was finally able to get closure on something that has plagued me personally, and indeed an entire community of SAS users, for decades.
Want to learn more about the RIPL API? Please contact us at [email protected]