As Enterprises look to rationalize investments and re-architect technology stacks, their data needs to be open and accessible – not locked into closed formats or proprietary systems. In the world of data science, Apache Spark stands out as the clear leader with its unified analytics engine that delivers unmatched scalability, performance and productivity. How to convert your data from legacy SAS to PySpark becomes a challenge and one that many are attempting to solve with a brute force approach.
Here’s why a brute force of SAS to PySpark isn’t advisable:
Time: at best a developer can convert 200-300 lines of code per day (fully validated) and with most large Enterprises having hundreds of thousands of lines of code to convert, this can mean months, even years in manual effort.
Cost: converting from SAS to PySpark manually entails having 2 resources minimum – a SAS and a PySpark coding expert. Build in time for a business analyst to capture requirements, project management, IT staff and engineering – the cost to migrate from SAS to PySpark can be steep. When time becomes an issue, costs can escalate with SAS license renewals and annual increases.
Quality: when migrating code, consistency with approach, formatting and structure is key which is especially challenging if you are relying on 3rd party outsourced expertise. When your code isn’t consistently converted, it forces PySpark re-engineering.
Let’s not forget, there’s what I like to call the “functional gaps” of converting between platforms – let me explain what I mean. Let’s a assume you have a simple process of 100 lines of code to convert and it includes a datastep with a kurtosis() function which operates on multiple columns. PySpark has a kurtosis function, but it is for 1 column, that operates across rows. A developer will have multiple options and will be forced to iterate as SAS does not use the standard calculation for kurtosis but one that is adjusted to minimize statistical bias, but there is no documentation to that affect. They’ll eventually figure out they need to use a Python UDF to access the Python scipy.stats library kurtosis function, because the scipy.stats kurtosis function has an optional argument to replicate the SAS version of equation. All said and done, those 100 lines just took a SAS and PySpark developer 10 days to debug & work correctly. Time is money.
Want an easier way to convert your SAS to PySpark code? WiseWithData has developed the world’s only fully automated SAS to PySpark migration solution. We call it SPROCKET – it’s fast, simple and accurate.
For more information, contact us @ [email protected]