RDD, DataFrame, and Dataset are the three most common data structures in Spark, and they make processing very large data easy and convenient. Because of the lazy evaluation algorithm of Spark, these data structures are not executed right way during creations, transformations, and functions etc. Only when they encounter actions, will they start the traversal operation.
Spark RDD – since Spark 1.0
RDD stands for Resilient Distributed Dataset. It is a collection of recorded immutable partitions. RDD is the fundamental data structure of Spark whose partitions are shuffled, sent across nodes and operated in parallel. It allows programmers to perform complex in-memory analysis on large clusters in a fault-tolerant manner. RDD can handle structured and unstructured data easily and effectively as it has lots of built-in functional operators like group, map and filter etc.
However, when encountering complex logic, RDD has a very obvious disadvantage – operators cannot be re-used. This is because RDD does not know the information of the stored data, so the structure of the data is a black box which requires a user to write a very specific aggregation function to complete an execution. Therefore, RDD is preferable on unstructured data, to be used for low-level transformations and actions.
RDD provides users with a familiar object-oriented programming style, along with a distributing collection of JVM objects, that indicate it is compile-time type safety. Using RDD is very flexible as it provides Java, Scala, Python and R APIs. But there is a big limitation on RDD, it cannot be used within Spark SQL as it does not have optimizations for special scenarios.
Spark DataFrame – since Spark 1.3
DataFrame is a distributed dataset based on RDD which organizes the data in named columns before Spark 2.0. It is similar to a two-dimensional table in the relational database, so it introduces the database’s schema. Because of that, it can be treated as an optimization on top of RDD – for example, RDD knows the structure of the stored data, which allows users to perform high-level operations. With respect to that, it handles structured and semi-structured data. Users can be specific on which column to perform what operations. This allows an operator to be used into multiple columns and makes an operator reusable.
DataFrame also makes revamping operations easier and flexible. If users have extra requires on the existing operation that needs to include another column into the operation, users can just write an operator for that extra column and add it into the existing operation. Whereas, RDD needs to make a lots of changes on the existing aggregation. Compared to RDD, DataFrame does not provide compile-time type safety as it is a distributed collection of Row objects. Like RDD, DataFrame also supports various APIs. Unlike RDD, DataFrame is able to be used with Spark SQL as the structure of data it stores, so it can provide more functional operators and allow users to perform expression-based operations and UDFs. Last but not least, it can enhance the execution efficiency, reduce the cost of loading the data, and optimize the logical plans.
Spark Dataset – since Spark 1.6
Dataset API is like an extension and enhancement of DataFrame API. Externally, Dataset is a collection of JVM objects. Internally, Dataset has an un-typed view called a DataFrame, which is a Dataset of Row since Spark 2.0. Dataset merges the advantages of RDD and DataFrame. Like RDD, it supports structured, unstructured and custom data storing, and it provides users an object-oriented programming style and compile-time type safety. Like DataFrame, it takes advantages of the Catalyst optimizer to allow users to perform structured SQL queries on data, but it is slower than DataFrame. Unlike RDD and DataFrame, it only supports Java and Scala APIs. APIs for Python and R are still under development.
To learn more about Apache Spark, please contact us at: [email protected]