Distributed Computing with Spark

As computer clusters scale up, data flow models such as MapReduce have emerged as a way to run fault-tolerant computations on commodity hardware. Unfortunately, MapReduce is limited in efficiency for many numerical algorithms. We show how new data flow engines, such as Apache Spark, enable much faster iterative and numerical computations, while keeping the scalability and fault-tolerance properties of MapReduce. In this tutorial, we will begin with an overview of data flow computing models and the commodity cluster environment in comparison with traditional HPC and message-passing environments. We will then introduce Spark and show how common numerical and machine learning algorithms have been implemented on it. We will cover both algorithmic ideas and a practical introduction to programming with Spark.

Speaker Bio

Reza Zadeh is a Consulting Professor of Computational Mathematics at Stanford, and technical Advisor at Databricks. He focuses on Discrete Applied Mathematics, Machine Learning Theory and Applications, and Large-Scale Distributed Computing. More information available on his website: http://stanford.edu/~rezab/

Stop Thinking, Just Do!