Azure Batch (HPC): Introducing extreme data processing with Apache Arrow in Azure
Motivation
Data comes in many shapes, and sometimes technologies like Apache Spark are not always the best solution for certain processing tasks. What is common among almost all data technologies is that data must conform to specific shapes and patterns to make these technologies most effective.
However, sometimes data doesn’t conform, this isn’t as rare as it may seem. It could be many small files from IoT sensors, large amounts of audio or video, parallel ingestion from APIs, or even rendering the next animation movie hit. All these scenarios with more, can require some form of demanding processing.
High-Performance Computing (HPC)
HPC is a category of computing where workloads are allocated to pools of servers for parallel processing, which can be taken to the extreme. One difference between this and a tool like Apache Spark, which is also a form of HPC, is that Spark carries overhead costs for administration and task management. Spark generally works best with structured data, though it is possible to push beyond its boundaries using user-defined functions (UDFs), which can be costly in terms of performance. Still, in many cases, Spark is a good choice because it is a general-purpose technology, and users don’t have to be dedicated developers.
HPC often requires writing dedicated jobs. These jobs can be based on any language and framework, so the choice truly matters. While Python is an option, native compiled languages like C++, Rust, or Go can significantly improve performance.
Azure Batch
Azure Batch is one of Azure’s HPC Services. It is a quite simple Virtual Machine orchestrator. It has 3 main concepts:
- Pools: A collection of Azure VMs of own choice. For some workloads GPU-enabled VMs can be an advantage. All VMs can be configured as any VM in Azure. They should have a shared storage.
- Jobs: Is an allocation of a Pool, and is administration of Tasks. A job runs until it’s stopped. Tasks can be added at any time the Job is running.
- Tasks: A unit of work. The simple explanation of a Task, is that it is a wrapped bash/cmd script calling some installed code. The code is installed by so-called Application Packages, which is zip files with the file extension zip. Such installations are done before running Tasks.
Resource limits
Scale out and up possibilties are really extreme:
Easy CI/CD makes Azure Batch easy to use
CI/CD is very straightforward, and there’s almost no limit to how you can implement it. I have open-sourced my boilerplate setup for Azure Batch jobs on GitHub: https://github.com/ChristianHenrikReich/AzureBatchCICD
The repository contains a few files — some for provisioning Azure resources and configuring Azure Batch, and deploying a code example. It also creates one job and submits one task.
Selecting programming languages and frameworks
While unstructured and semi-structured data might benefit from specialized code, structured data can often take advantage of more general-purpose frameworks. I’ve had the pleasure of meeting Matt Topol a few times. Matt is a developer on the “unsung hero” (as he calls it) Apache Arrow project, and the last time I met him, I learned about ADBC.
Apache Arrow is a high-performance, in-memory format used by many technologies, such as Apache Spark, Polars, pandas, and DuckDB. Its power lies in its ability to share data between processes both locally and over the network without needing to serialize or deserialize (changing format). Just as importantly, Arrow has the ADBC specification, which enables databases like Postgres (that are ADBC-compliant) to be read from or written to with high performance.
It’s still possible to write Parquet and other formats to storage.
Additionally, there are official Apache Arrow modules for Go. One benefit of Go is that it compiles natively into a single binary, making deployment much easier. Recent support for containers on Azure Batch has further created possibilities for deployment processes.
Wrapping up
Having a tool like Azure Batch is a strong option when everything else doesn’t seem to perform. It’s easy to get started. If you take inspiration from my boilerplate setup, don’t be discouraged by the Go example — Python or any other language can be used.
Github: https://github.com/ChristianHenrikReich/AzureBatchCICD