Spark SQL: Why the choice of language doesn’t impact performance

3 min readMay 24, 2024

Introduction

At my work and in my communities, when writing Spark SQL code, we mainly use Python and SQL. Other places might use R, Scala, or even .NET. Nothing wrong with that. One unique feature of Spark SQL is that language doesn’t really matter in regards to performance. This enables teams to select a language that fits them without penalties.

Spark is mainly written in Scala, and Spark jobs are Java bytecode running on our Spark workers, so it might be thought that there would be some gains by using Scala or Java with Spark SQL.

So let’s dissect why this is not the case.

Spark and Spark SQL

Spark SQL is a module within Spark. When mentioning Spark without SQL, it refers to core Spark. Spark doesn’t have concepts of data frames; it works with so-called RDDs (Resilient Distributed Datasets) and controls the cluster.

Spark SQL has officially been a part of Spark since version 1.0.0. It brings many features to Spark, with the most well-known being data frames.

At the heart of Spark SQL lies the Catalyst optimizer, which processes AST trees rather than interpreting and optimizing code from specific languages.

This key aspect makes Spark SQL language-agnostic regarding performance.

AST trees

Abstract syntax trees, also known as ASTs or AST trees, are a common technique within computer science, especially when building compilers. They are a data structure that represents a program’s structure.

As an example, here is a conceptual representation of how a SQL statement could be laid out in an AST tree. Being an AST tree, it is no longer SQL or any other language. It is a data structure:

AST trees are easy to write optimizing rules for, and optimization typically about reducing a tree based on rules. AST trees are also easy to transform into languages or bytecode, such as Java bytecode, which can be distributed across the worker nodes of a Spark cluster.

I will cover Catalyst optimization in more depth in a future blog post. However, for this blog, AST trees can be seen as a middle station between your code and the tasks running on the worker nodes of the Spark cluster.

Processing data frames and building the AST-trees

When creating a Spark SQL DataFrame, an AST tree is generated, regardless of the language used. Similarly, when performing transformations on a Spark SQL DataFrame, nodes are added to the same AST tree, regardless of the language. Throughout a Spark SQL script, one can switch between languages, and it’s the same AST tree that gets extended.

The creation of AST trees represents the lazy-loading aspect for which Spark SQL is known.

However, when an action like collect(), show(), write() or display() is called, the working language used for processing the DataFrame is no longer in effect. Instead, the optimized AST tree is processed into Spark tasks defined in Java bytecode. At this point, all traces of the language and its features used are gone (except possibly for a serialized Python UDF, which, in itself, is not a Spark SQL transformation).

In the end, all languages end up as Java bytecode, and it’s the Java bytecode that does the heavy lifting. This is why the choice of Spark SQL language doesn’t matter for performance.

Short discussion

I believe the languages a team can agree on should be used. I use languages in the plural because some things are easier to express in SQL, while others are easier in other languages.