Stop Thinking, Just Do!

Sungsoo Kim's Blog

Advancing Spark - The Photon Whitepaper

tagsTags

3 January 2023


Article Source


Advancing Spark - The Photon Whitepaper

We’ve been seeing what the Databricks Photon engine can do for quite some time now, and it clearly makes a dramatic difference to certain types of query. But how does it actually work? What’s the secret sauce that provides the acceleration that we see?

Earlier this year, Databricks released the Photon Whitepaper, digging into the depths of the engine and the design choices they made behind it. In this extra-long video, Simon walks through the photon whitepaper, discussing the design choices and attempting to explain the tech… warning: this is a long and nerdy video!

Photon: A Fast Query Engine for Lakehouse Systems

Abstract

Many organizations are shifting to a data management paradigm called the “Lakehouse,” which implements the functionality of structured data warehouses on top of unstructured data lakes. This presents new challenges for query execution engines. The execution engine needs to provide good performance on the raw uncurated datasets that are ubiquitous in data lakes, and excellent performance on structured data stored in popular columnar file formats like Apache Parquet. Toward these goals, we present Photon, a vectorized query engine for Lakehouse environments that we developed at Databricks. Photon can outperform existing cloud data warehouses in SQL workloads, but implements a more general execution framework that enables efficient processing of raw data and also enables Photon to support the Apache Spark API. We discuss the design choices we made in Photon (e.g., vectorization vs. code generation) and describe its integration with our existing SQL and Apache Spark runtimes, its task model, and its memory manager. Photon has accelerated some customer workloads by over 10X and has recently allowed Databricks to set a new audited performance record for the official 100TB TPC-DS benchmark.


comments powered by Disqus