As I prepared for the Developer Certification for Apache Spark by Databricks and O’Reilly i noticed that there weren’t that many resources around so I thought I’d collect and share the resources I used to prepare for the exam. Hope it helps!
I do a lot of different development on my laptop and I’m well and truly over having to keep debugging issues with software versions or missing libraries. These days I try and isolate any development environments either as Docker containers or Vagrant images which allows me to run multiple versions locally for development. For Spark I use the Vagrant image from Gustavo here. That includes Spark and Zeppelin. If you’re wanting to run a later version such as 1.6 then you’ll need to edit the install-02-spark.sh file and add the latest Spark and Hadoop versions:
install-04.zeppelin.sh and add the latest Maven build:
and finally the install-99-cleanup.sh:
Follow the steps in the blog post linked above and you’ll have a functioning Spark test environment in a little under an hour. This is great for working through examples quickly but it’s important to remember that the exam is aimed at those having experience running Spark in a production environment, that means you’ll need some time on a cluster to understand how Spark works.
The 5 Main Themes
- Understanding the breadth of the Spark API usage across Scala, Java and Python
- Applying Best Practices to avoid runtime issues and performance bottlenecks
- Distinguishing Spark features and practices from MapReduce usage
- Integrating SQL, Streaming, ML and Graph atop the Spark unified engine
- Solving typical use cases with Spark in Scala, Java and Python
The exam isn’t language specific and includes code examples in Scala, Java, Python and SQL. Some questions certainly ask you to compare or identify equivalent Spark techniques across languages but the exam is specifically about Spark and doesn’t go into language nuances or require you to write any code, only pick the best answer from several code blocks. So you’ll need to be able to read all languages and I’d suggest going through examples in multiple languages rather than just the one your most familiar with.
You can take the exam either online or at a test center. O’Reilly use Kryterion test centers so you’re bound to find one near you. Note that if you want to do it online there are some prerequisites and if you’re taking the online exam it contains some known issues and bugs. Please read this information for workarounds and support.
You’re given 90 minutes to complete the 40 multiple choice questions, although most people find they can complete the exams in an hour.
- Introduction to Apache Spark – Essentials to Get Started Running Apps by Paco Nathan. These videos are almost a 5 hour introduction to Spark, from downloading and installing through some theory and building Spark apps. If you’re new to Spark this is a great way to get up to speed on Spark 1.3 quickly and easily. Paco’s style is easy going and although some videos are very basic it quickly moves onto the more detailed examples. If you want to follow along with the examples download the files here.
- Apache Spark – The SDK for All Big Data Platforms from Cassandra Summit 2014.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing – Matei Zaharia’s excellent RDD paper.
- Introduction to AmpLab Spark Internals – Matei’s Internals of Spark presentation.
- A Deeper Understanding of Spark Internals – Aaron Davidson’s presentation.
- Tuning and Debugging in Apache Spark – Patrick Wendell’s excellent internals and debugging presentation.
- Anomaly Detection with Spark – Sean Owen from Cloudera on anomaly detection using Spark.
- Spark Summit Training – both an introductory track and an advanced track. Slides, files and video are all available for download from the Spark Summit site. Make sure you’re comfortable with the materials from the advanced track if you’re preparing for the certification.
- Learning Spark. Make sure you’ve read and understood everything in this book and gone through the examples.
- Databricks Spark Reference Applications on GitBook.
- Databricks Spark Knowledge Base on GitBook. Good guide on best practices, certainly worth reading.
- Apache Spark Quick Start – Go through all the examples on the Apache site.
- Spark Interview Questions & Answers
- Spark Interview Questions
- Spark best practices