Since year end 2014, there has been an increase in the number of Google searches comparing Apache Spark to Hadoop. What brings people who are experts in Big Data, Data Science, and Data Analysis to Apache Spark (Spark)? Spark is a fast and expressive cluster computing engine compatible with Apache Hadoop, and it is up to 10 times faster on disk and 100 times faster in memory. Let us do a comparison with a similar data processing framework – MapReduce to discover and understand why Spark becomes so much faster.

MapReduce data processing actually has 5 steps which are Map, Sort, Combine, Shuffle and Reduce. As can be presumed from the name, Map and Reduce are the 2 most important steps among these 5 steps. Consider the following:

  1. the Map step is a transformation, it transforms data into key-value pair; the idea of key is convenient in some ways as it can be used as an identification of the value, tracking where the value is sent and the statistical dimension of value
  2. the Reduce step executes aggregations or folds; Map and Reduce steps both perform independently and synchronously on different machines
  3. the Shuffle step is more expensive because it incurs massive amounts (nearly three times the size of the dataset) of network I/O, due to the fact that one MapReduce task can have and only have one Map and one Reduce step. Therefore after running each job, MapReduce has to persist across the full dataset to HDFS that is also costly because the replication causes three times the size of the dataset in disk I/O.

Spark is well-known for its lazy evaluation algorithm which helps avoid unnecessary load and write back. Similar to MapReduce, Spark also generally groups data processing into Map and Reduce steps, but the enhancement is that within one task, there can be a sequence of Map and Reduce steps. Therefore, the intermediate results of the execution can be efficiently transferred to the next operation step to improve the performance of the algorithm. Even more relevant, Spark introduces an intelligence that automatically pipelines operations to optimize the order of the sequence of Map and Reduce steps, and it also brings the idea of partitioning to avoid the Shuffle step. More importantly, Spark makes an innovation on in-memory caching abstraction, so this is much faster but it is non-persistent. This fantastic innovation allows multiple operations access to the same data sets. Users can point what datasets they want to store into memory, and if the data is too big to fit into memory, Spark will intelligently load what it can and use disk storage for the rest. If these datasets are to be used in multiple operations, users can store them into memory, which results in time savings as they do not need to be loaded from disk for those operations. Fully cached data will improve the speed up to 6 times more. Thus why we can draw the conclusion that Spark performance of the algorithm is improved by 10 to 100 times faster with some experiments.

To learn more about Apache Spark, please contact us at: [email protected]