The concept of dark data – i.e. information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes such as analytics, business relationships and direct monetizing – isn’t new nor is the plethora of ML/AI applications being developed and available to “get access” to unstructured information.
What is relatively new and what few people know or even talk about is that apache spark’s open source unified analytics platform handles both structured and unstructured data – at processing performance speeds unmatched by proprietary platforms. Considering we estimate that between 80-90% of an Enterprise’s big data is “dark data” or unstructured, Spark becomes a huge game changer – this in addition to it being open source and free.
At its core and most basic abstraction, Spark developed a data structure called RDD (Resilient Distributed Dataset) that utilizes a distributed computing feature to process unstructured data. RDD offers various transformations to parse and process unstructured data like map, flatMap, filter, union, reduceByKey, union, etc, to name a few. RDDs supported API models now include Datasets and Dataframes.
All of this data structure yields a number of benefits, including better performance and space efficiency across Spark components, however, let’s not overlook here a key advantage – the ability to handle unstructured data of any type or format with your data analytics efforts.
While most organizations are still grappling with how to categorize or classify dark data, consider that Spark allows you to not only easily do this but now start reaping the rewards of applying natural language processing, advance analytics models as well as combine SQL, streaming, and complex analytics – all with a single platform that’s API enabled.
A few scenarios start playing out in my mind … like:
1) what if you could add video surveillance, social media, email and text data to your anti-money laundering solutions?
2) what if you could add customer text, social media and email to your CRM analytics?
3) how could you improve your call center armed with customers’ text, email, phone messages and/or video?
A whole new paradigm of possibilities of “big data analytics” start to surface.
However, first, Enterprises have to get their data off legacy systems and onto the Spark platform. This is where Wise With Data comes in… we’ve developed the world’s only SAS to PySpark migration solution. While moving your data to Spark is a key consideration, ensuring your SAS code is migrated is a prerequisite. Our goal is to help you get into the world of Apache Spark. Want to learn more? Contact us at: [email protected]