Moving your SAS libraries into a Lake House architecture using Apache Spark, Parquet and Delta

SAS Libraries

SAS supports a wide array of different library engines that provide a semantic layer between the underlying data source and SAS. The engine provides an abstraction between SAS code and the physical location of the data. In addition to the native SAS dataset format, they support access to a number of other file types including XML, XLS/XLSX(MS Excel). They also support many common databases, providing a translation layer to the native database SQL language. SAS supports passing user credentials to a database (a best practice), but also supports global credentials such as a system or shared account. Libraries that are able to be predefined with the SAS session so that users don’t have to deal with the complexity of managing the library setup. In this way, the physical location of the tables is not presented to users.

In addition to all the file and database engines. The SAS META (metadata) library engine enforces a specific metadata view of a on top of another library engine. The META library definition might be using the BASE engine to a access directory of SAS datasets, or it might refer to a DBMS source using the TERADATA or ORACLE engine. The tables/columns that a user sees in a library are defined as a Metadata construct, and can have access controls layered on top. It is on the users/administrators to ensure the library metadata is in sync with the physical data. This is a fundamental problem with META libraries, in that they do not maintain a proper link with the underlying physical storage, causing a wide variety of issues. Users can see tables/columns that no longer exist, and conversely they can miss tables/columns that exists, but that the Metadata has not been updated to reflect.

Many organizations have configured their SAS environment to make use of predefined libraries to many or all data sources in the organization, including databases. In the context of other databases, SAS libraries in essence act as a sort of lightweight Data Federation system. While useful in theory, the implementation of SAS’ library engine is so problematic that many SAS professionals recommend against the practice. They abstract out useful information about the underlying data that can be used to properly optimize queries. SAS’ query engine has virtually no optimization logic, so it doesn’t use modern Data Federation concepts like cost based optimization, column pruning, aggregate/predicate/filter push down, etc. Often, SAS needs to bring back the entire table into SAS from a database in order to join to a single column.

The problem with Data Federation systems

On the surface, Data Federation concepts are hard to argue with; they leave the data in place, eliminating duplication, and maintain the idea a single copy of your data. In years past, database courses taught that this is was good idea, so what’s the problem?

The notion that Federated Data systems eliminate data duplication is typically just a mirage. Often whole tables are repeatedly copied over network connections to satisfy different queries and processes. This duplication adds additional database and network strain during peak business hours. To add to the problem, at a reporting level, there’s no consistency in the data update timing. A query run a few minutes later than another one, might yield very different results, causing conflicts in end user reports.

Querying operational systems in an ad-hoc way can lead to all kinds of negative operational impacts. End-users often don’t understand that they are executing a query that could have major performance issues, such as the classic accidental Cartesian join problem. Most operational databases were not designed for analytical use cases, they are optimized for rapid Create/Read/Update/Delete (CRUD) transactions on a small number of records at a time.

Another challenge is that operational data can change form, and this can have serious implications for downstream analytical processes. Being able to create views of the data that are independent of the source systems is often a key strategy for handling changes in operational systems.

The search for something better

So if Data Federation isn’t the answer, what is? Historically this has been a complex and expensive issue to address. Many organizations spent millions on complex software platforms that promised to put data all in one spot, in nice organized tables, so it can used by all. In the early 2000’s analytical database vendors like Vertica, Netezza, and Teradata became all the rage. They claimed they could store all your data for analytics and you would never have to move data again. However, they were difficult to operate, they only stored structured data, they lacked key analytical capabilities, and were extremely expensive. The expense limited their use cases, and most organizations never realized the potential value of the solution.

Next came the open source Apache Hadoop in the the early 2010’s, promising to unleash cheep analytical storage to the masses. On top of the Hadoop Distributed File System (HDFS), Hadoop had analytical tools that could use the data in-place, such as Hive, Pig and Mahout, but they were really complicated.

The idea with Hadoop is that everything just gets dumped in the files system as a “data lake”, filed with unstructured files (often csv or json). You only structure what you want to actually use, when you want it. This strategy worked well for the Hadoop vendors like Cloudera and MapR for a while, since they sold much bigger platforms to their customers. Unfortunately, most companies ended up with data swamps, not data lakes. Users were overwhelmed, and without even some basic metadata, they couldn’t find what they needed. At the same time, their Hadoop platform storage requirements grew so big and so fast, files had to be zipped up, creating another barrier to the using the data.

A solution emerges

With the Hadoop compatible Apache Spark exploding on the scene in the mid-to-late 2010’s, change was afoot. Suddenly, data was easy to access in any format. In a single line statement you could read in data from all kinds of different sources, and automatically infer data schema, with incredible processing speed. Spark is 100’s of times faster than other data processing systems including SAS.

Spark also uses an incredible columnar storage format called Apache Parquet as it’s default format. Compatible with Hadoop and local/shared file-systems, Parquet is vastly more efficient that almost every other format, including SAS datasets. Parquet is able to store a variety of data including unstructured data, semi-structured data, binary data, and complex nested structures (like XML and JSON), all within a flexible but defined data schema. It also comes with a number of performance, functionality, and security related features such as partitioning, column statistics, indexing, data skipping, column pruning, partition level updates, schema evolution, and table / column level encryption. Combined together, these feature make Parquet fast and simple.

In a recent blog article, SAS’ R&D team admitted that Parquet is a staggering 25x+ more efficient at storing typical enterprise data than the SAS dataset format. In their comparison, an 11.6GB dataset in SAS was only 483MB in Parquet. This efficiency makes it possible to bring huge amounts of the enterprise data into your data lake without incurring the high storage costs associated with raw text files.

In addition to these software advancements, high speed Solid State Disk (SSD) storage technology has developed at an incredible pace in the last few years. SSD’s have developed so fast, their performance blew past the limits of traditional storage interfaces (SATA, SAS) which are now considered obsolete, and Non-Volatile Memory express (NVMe) is the de-facto standard for analytic storage. It is now possible to get 100TB SSD drives, and NVMe drives with read/write performance capabilities > 5GB/s, something unthinkable just 5 years ago.

With the confluence of these technological advancements, it’s finally possible to realize the dream of a perfect enterprise analytics data solution, one where data is available in one spot, and available to all. Data is stored in a way that is both efficient and accessible, with the ability to add file and column level access controls. But building the processes that transfer the data can still be difficult. Doing full copies of large tables daily, weekly or even monthly poses serious challenges. Updating just the changes (deltas) is a better option, but Parquet by itself won’t work, because it doesn’t support atomic transactions. A partition must be completely rewritten to make updates, and during that process the data is not in a readable state. Keeping the analytical data synchronized with the operational data can be challenging, requiring many additional process layers.

Delta – The icing on the Lake

Delta Lake was developed by much of the same team that developed Apache Spark. Delta was built to provide a transactional layer on top of Spark and Parquet. It leverages Parquet as the underlying data store, but provides additional capabilities. It is fully ACID compliant, meaning data updates can be guaranteed to be Atomic, Consistent, Isolated, and Durable. The key hear is that you can read a Delta table as it’s being updated.

Delta also supports some truly game changing features such as streaming updates, Time-travelling and Rollback, and Data Quality constraints. Time-travelling allows users to query the data “as of” a certain time. Incredibly useful in production settings, failures are able to be easily backed out and consistent data restored. It is also handy for analytical purposes, where model scoring processes can be “retro-scored” for re-grounding policies. It’s worth noting that Delta is not the only solution available in Spark; Apache Hudi and Iceberg also share many of the same features, but Delta is the most elegant and mature.

Data Lakes with Data Warehouses – Paradise at the Lake House

Bringing together raw data from operational sources into your Delta Lake is a great step, but to make the most of your data there’s more to be done. To make the data truly safe, accessible and easy to use, dimensional data warehousing concepts are still required.

The authors of Delta Lakes have laid out the concept of 3 layers of organizational data. Bronze tables contain raw extracts from source systems for power users and for downstream processing. Silver tables contain cleansed, filtered, augmented or re-organized data tables that are geared towards more general consumption. Finally Gold tables are business level aggregates that leverage dimensional data warehousing concepts such as fact and dimension tables. This make end user consumption fast and simple, with consistent analytical results.

The powerful combination of Spark, Parquet and Delta Lake makes this panacea an achievable end state for enterprises. With the simplicity of these solutions, this end-state is far simpler than you would expect. Of course our SPROCKET SAS to PySpark migration solution supports database and meta libraries concepts, but these are only bridges. In the cloud or on-premise, the preferred end-state solution is to leverage the the incredible power of Parquet and Delta tables as sources for converted data sources. Interested in knowing more, or looking for help to modernize your legacy SAS analytics with Spark, let us know by sending us an email: hello@wisewithdata.com