The sparklyr webinar describes how to use R and Apache Spark with the new sparklyr package from RStudio. You can access the presentation materials [here](sparklyr webinar 2016.pdf). There is also a nice set of examples on the main site.
Three scripts are referenced in the webinar. If you install the sparklyr package the first two can be reproduced in local mode. The third script will only run on a Spark cluster with preprocessed data loaded into Hive, therefore it is here for instructional purposes only.
- Initialize a spark connection and load data into it
- Run sparklyr using dplyr in local mode
- Analyze 1 billion records in a Spark cluster the NYC taxi data
For a complete set of scripts see the sparkdemos github repository.
We had a lot of great questions during the video and we were not able to answer all of them. I have gone through and tried to answer each question below. If you have more questions, please submit questions to this google group. If you have issues with the software, please submit them to the github repos.