From the course: Data Engineering Foundations
Unlock the full course today
Join today to access over 24,000 courses taught by industry experts.
Spark
- [Instructor] The other parallel computation framework we'll introduce is called Spark. Spark distributes data processing tasks between clusters of computers. But why did we need a tool like Spark? So MapReduce based systems tend to need expensive disk writes between jobs. Spark tries to keep as much processing as possible in memory. In that sense, Spark was an answer to the limitations of MapReduce, the disk writes of MapReduce were especially limiting an interactive exploratory data analysis, where each step builds on top of a previous step. Spark originates from the University of California where it was developed at the Berkeley's AMPLab. And currently the project is maintained by the Apache Software Foundation. A Spark relies on a data structure called resilient distributed datasets, or RDDs. Now, without diving into technicalities, this is a data structure that maintains data which is distributed between multiple…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.