The Spark SQL Optimizer and External Data Sources API
This meet-up will be geared towards advanced users of Spark SQL, in particular those who are interested in contributing to the project. I walk through the optimization workflow, explaining how Spark SQL automatically rewrites query plans to execute more efficiently. I’ll also preview the new external data sources API that is being added for 1.2 and show how we can add easily add support for reading new types of data.
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.