From the course: Data Engineering Foundations

Unlock the full course today

Join today to access over 24,000 courses taught by industry experts.

Spark

Spark

- [Instructor] The other parallel computation framework we'll introduce is called Spark. Spark distributes data processing tasks between clusters of computers. But why did we need a tool like Spark? So MapReduce based systems tend to need expensive disk writes between jobs. Spark tries to keep as much processing as possible in memory. In that sense, Spark was an answer to the limitations of MapReduce, the disk writes of MapReduce were especially limiting an interactive exploratory data analysis, where each step builds on top of a previous step. Spark originates from the University of California where it was developed at the Berkeley's AMPLab. And currently the project is maintained by the Apache Software Foundation. A Spark relies on a data structure called resilient distributed datasets, or RDDs. Now, without diving into technicalities, this is a data structure that maintains data which is distributed between multiple…

Contents