At WiseWithData, our engineers have developed the most sophisticated automation on the planet to transpile code from one language to another. Weighing in at over 500,000 lines, SPROCKET is simultaneously a marvel of simplicity, elegance and complexity.
Complexity, really, but I just said simplicity and elegance. It can actually be both. The underlying problem we are trying to solve is just a complete ugly mess, hence the complexity. The SAS language was invented in the 1960’s and evolved organically without any proper design or software engineering best practices. Unlike modern languages, nothing was ever deprecated or removed from the API surface area, so it’s just layer upon layer of cruft. Our job is to unpack that mess and create simple, elegant and modern PySpark code from it, no simple feat.
Adding to the challenge is that on top of this antique language is a meta-language called SAS Macro. Macro was invented to solve many problems of the underlying SAS language by allowing for re-usability and parametrization of SAS code, but does so in a totally backwards way. Modern software engineers would cringe at the way the SAS Macro language works.
We’re always working to simplify our solution and generate the best quality code possible. A recent conversation erupted within our engineering team about how to solve some specific issues with the way we handle the SAS Macro language. On the one hand, we identified a way to solve the issue in a generalized way that will always work regardless of the use case. On the other hand, that solution means the generated code will be far more complex (and more complex typical human written code).
Let’s take a look at a simple 2 part example to highlight the trade-off:
/* part 1 */
%let x = ABC2022;
%let this_year = %substr(&x, 4, 4);
/* part 2 */
%if &this_year. = 2020 %then %do;
%put Bad year;
%end;
Let’s also assume that as a developer, we don’t know what x and this_year were ahead of time (part # 1 was defined elsewhere in the code).
The other developer might have naively translated part 1 code as :
# part 1
x = 'ABC2022'
this_year = x[4:8]
# part 2
if this_year == 2020:
print('Bad year')
Both segments are correctly translated, but running this together would throw an error. Python, rightly so won’t let you compare a string with an integer. Of course we can create a bunch of logic to check all types before and cast data types accordingly. But, that’s going to significantly complicate the code. In this example we know based on the context of the logic that the comparisons should be both integers, so when this is run, we’ll just wrap an int() function around it. But what if the types on both sides of the comparison aren’t trivial to discover?
Let’s bring this back to automation, where we are faced with a dilemma. Do we try and code for every possible situation, generating very complex Python code in the process? Or do we code it the way that would be most commonly used and deal with errors for edge cases as they come up at runtime?
At WiseWithData, we believe very strongly in the later approach. Generating clean, consistent and readable code is of paramount importance. Our customers are already needing to modernize millions of lines of code from a language that is ancient, verbose and complicated. The last thing we should be doing is exacerbating that situation by further complicating the code, simply to guarantee that the generated code will always work without modification.
This is one of the many reasons why our solution is offered as a service, not just a tool. The tool helps us deliver at incredible speeds, far exceeding the competition, but our skilled delivery team is there to ensure the code is always of the best quality possible. Everything in life is tradeoff, but what’s most important, is that you understand the tradeoffs you’re making.
Want to learn more? – [email protected]