LinkedIn and 3rd parties use essential and non-essential cookies to provide, secure, analyze and improve our Services, and to show you relevant ads (including professional and job ads) on and off LinkedIn. Learn more in our Cookie Policy.

Select Accept to consent or Reject to decline non-essential cookies for this use. You can update your choices at any time in your settings.

Start free trial Sign in

From the course: Data Engineering Foundations

Unlock the full course today

Join today to access over 24,000 courses taught by industry experts.

Spark

Spark

From the course: Data Engineering Foundations

Start my 1-month free trial Buy for my team

Spark

“

- [Instructor] The other parallel computation framework we'll introduce is called Spark. Spark distributes data processing tasks between clusters of computers. But why did we need a tool like Spark? So MapReduce based systems tend to need expensive disk writes between jobs. Spark tries to keep as much processing as possible in memory. In that sense, Spark was an answer to the limitations of MapReduce, the disk writes of MapReduce were especially limiting an interactive exploratory data analysis, where each step builds on top of a previous step. Spark originates from the University of California where it was developed at the Berkeley's AMPLab. And currently the project is maintained by the Apache Software Foundation. A Spark relies on a data structure called resilient distributed datasets, or RDDs. Now, without diving into technicalities, this is a data structure that maintains data which is distributed between multiple…

Contents