When did PySpark become THE modernization path for legacy SAS?

Back in 2015, when my colleagues and I first started reading about and experimenting with Apache Spark and PySpark, we knew there was something special brewing in open analytics. Being in the data and analytics world since 1997, I’ve seen lots of trends come and go. Trendy technologies abound; R, Teradata, Netezza, Hadoop, and our latest trendy posterboy Snowflake. Snowflake it seems, is finally melting down, at least based upon investors reactions to their latest dismal financials. These are just a few of the trendy technologies of the day. All of them blew over, for various reasons. But, with Python & Spark, I saw something totally different, unique and almost magical.

Of course, the magic I describe is not something that simply appears out of thin air. It’s the combination of thousands of developers, data engineers, and data scientists all working together with a common goal; to create the next generation analytics platform that is simple to use, feature rich, fast and scalable. We’re not talking about some off-the-side-of-your-desk volunteer effort either, this technology was and still is being created and supported by the worlds largest and most tech savy companies. With thousands of companies and individuals contributing, it’s impossible for any proprietary analytics platform to keep up. Once the ball started rolling, it was only a matter of time until it dominated the analytics landscape.

There are so many use cases for PySpark that go far above and beyond what is able to be accomplished in legacy SAS. Anything to do with streaming, graphs, transactional, unstructured, semi-structured or nested data, NLP, Deep Learning models; all of it is just about impossible to do using legacy SAS. Structured Streaming, GraphFrames, Spark-NLP, Delta lakes; I’d love to talk all about those, but lets focus just on what legacy SAS users can currently do.

With version 3.4 due out any day, Apache Spark is not only a viable replacement for nearly all SAS use cases, it meets and exceeds SAS’ capabilities in every respect. Combined with our industry proven SPROCKET Runtime, you have a complete and user-friendly solution to SAS based workloads. But when did that tipping point actually occur, and when did Spark become the defacto platform for the future of analytics? The answer is both it depends and right now.

Spark 1 Era – (2014 – 2016)

In 2015, Apache Spark was still in it’s early days, but it was clear something incredible was in motion. Starting in version 1.3 & 1.4, two critical features were coming into view. SparkSQL and the Spark DataFrame API were the key features that not only enabled simple user-friendly API’s, but also became the foundation of almost unfathomable scalability and performance gains, 100x faster than most legacy tools. For us at WiseWithData, that’s when we knew that the time was right to start building a solution to help customers migrate to this amazing new world.

Appropriate SAS based use cases in those early days revolved around solving for SAS’ 50-year old achiles heal, performance and scalability. SAS9 being single threaded, means that as the size of the data grows, the processing times can quickly become ridiculously long. Feature wise, there were still many features lacking from Spark that would be critical to many SAS customers. But, over the next decade of development, we would witness 10’s of thousands of new features and capabilities being added to Apache Spark and PySpark.

Spark 2 Era – (2016 – 2019)

Spark 2.0, released in the summer of 2016 was what I considered to be the first release of Spark to be really appropriate for many SAS users. There were massive performance gains, via “Whole-stage-code-generation”, but more importantly Python really became a 1st class citizen. Python as we all know now, is the lingua franca of Data Science & Engineering (and indeed all scientific computing). The Spark community really took that message to heart and implemented near perfect feature parity with Scala, the base language of Spark. Back in the early days of Spark, it wasn’t so obvious to many in the community that Python would become the dominant language for using Spark. V2.0 was also the first release to deprecate RDD concepts and fully invested in a DataFrame focused future. Support for ANSI SQL was another important milestone to broad based adoption.

Subsequent releases in the V2 were mainly refinements to the developments started within V2.0, with the bulk of the changes being under the hood enhancements. But from an end user perspective there are a few notable changes in the V2 series to discuss.

Releases 2.1-2.4 brought in a whole lot of scalable versions of ML and statistics based capabilities that align closely with features in SAS/STAT. Of course, with PySpark you always have the ability to push processing to any Python ML library (SK-Learn, SciPy, Statmodels, etc.), but those aren’t high performance and scalable like those in the native PySpark ML library, so they have limited usefulness in the enterprise realm.

Spark 3 Era – (2020+)

Version 3 is where things start getting really interesting. In this post, I’ve deliberately tried not to go down the rabbit whole of how all Spark’s magic really happens, I wanted to keep the focus on the end user functionality. But, I do have to take a side trip here, because it directly impacts SAS use cases.

PySpark is the undisputed king of high performance, scalable big data computing, but that came at a cost. Prior to V3, it was actually pretty bad when it came to dealing with “small data”, the stuff many SAS users process all the time. Many small SAS workloads converted into PySpark actually performed far worse, not an ideal situation for an engine that is commonly understood to be 100x faster than SAS9. This is where the Adaptive Query Execution (AQE) engine kicks in. AQE effectively looks ahead at the size of the data to be processed and decides on an optimal run-time strategy. Kind of like a route planner that adapts current traffic patterns in real-time. While not perfect, AQE effectively closes the gap on the small data issue. So PySpark is now the best engine for processing data, BIG or small.

One of the most exciting features for SAS users in Spark 3, is the DataFrame transform() method, which brings a concise syntax for chaining custom transformations (think SAS Macros but way more powerful). Now you can build entire pipelines that fit within other pipelines in a really modular way. For WWD, this allowed us to drastically simplify many code generation paths, delivering on a core design principle of having a 1-to-1 conversion experience for our customers. It also paved the way for us to be able to build the award winning RIPL API, which brings a full featured implementation of the SAS datastep API directly into PySpark.

Subsequent releases of V3 brought in a many new features that made life easier. Project Zen brought far better and more Pythonic integration for PySpark with Python. Of particular note, error messages in PySpark are now much improved. Gone are the days of horrible, long Java Stack traces. Now you get nice clean Python exception messages which describe the issue and make resolution a snap. A project to allow you to use the Pandas API on top of PySpark DataFrames was merged, furthering the idea of PySpark as being the unified analytics platform.

Spark 3.4 – The Now Era

Now we come to the pending release of Spark 3.4, which is what triggered my idea idea for this post. But before we get to the big headline, let’s just recap of few of the other features coming in this release that are relevant to SAS users. As usual in minor releases, there are many new higher order functions to make your code more concise and performant. One notable function is that PySpark finally gets melt() functionality for data reshaping (in addition to pivot). Now there’s a complete implementation the features of proc transpose, with better performance, simpler syntax, and without the limit of transposing just one series at a time. There’s also much improved coverage of the Pandas API on PySpark.

Now onto the really exciting stuff, Spark Connect. If that name sounds familiar, you might just be having a deja-vu. Isn’t there a SAS Connect? Indeed there is, and it also formed the foundation of the much-maligned SAS Grid architecture. The role that Spark Connect plays is actually quite similar to the SAS incarnation, now more than 25 years old, but with significant improvements.

Based on a client-server architecture, Spark Connect is designed to allow any plain Python session (or another PySpark session) to connect to the driver on a Spark cluster. This differs from the traditional way that PySpark works, where your PySpark driver session acts as the interactive REPL Python session and has to do double duty, where it coordinates and collects workloads by the Spark workers/executors. With Spark Connect, you have both a local “client” and a remote “server” version of code that are able to interoperate in a seamless way. I won’t get into the technical details, but basically the local session checks for syntax errors, and then passes the code off to the “server” side driver to execute.

The community has been working around the limitations of the current driver/worker architecture for a very long time. There were a number of attempts to bridge the gaps, such as Livy and Toree, but they came up short of an optimal solution. Spark Connect now properly decouples the Python session of the user from Spark itself, providing better process isolation, and allowing easier upgrades. Many SAS users are accustomed to having access to both a local SAS session, and being able to push heavier workloads SAS servers/grids via Connect. The big advantage for those users, is that the same architecture and use cases are now possible in PySpark Connect, and its easier to use than rsubmit.

SAS users hesitant to make the shift to modern, have lost the very last excuses not to modernize. PySpark in 2023 is now easier, safer, faster, cheaper, more scalable and more capable than SAS, not just by a little, but by 10 – 100 times! SAS, while a solid platform for over 52 years, has clearly passed its best before date. Now is the time to make the switch and gain the benefits of modern open analytics.

Learn more about how PySpark and SPROCKET can help you modernize your data science and data engineering right now – hello@wisewithdata.com