Apache Spark shuffles intermediate data between stages by writing it to local executor disks. In cloud-native deployments where compute and storage are disaggregated, this creates a hard dependency on local storage that limits elasticity and can become a performance bottleneck. The Spark S3 Shuffle Plugin removes this dependency by redirecting shuffle I/O to S3-compatible object storage.
I developed this plugin at IBM Research Zurich as an open-source project. The initial implementation was based on Apache Spark PR #34864.
How It Works#
The plugin replaces Spark’s default shuffle manager and data I/O layer. It implements the ShuffleManager and ShuffleDataIO interfaces, routing all shuffle reads and writes through the Hadoop FileSystem abstraction. This makes it compatible with multiple storage backends: AWS S3 (via S3A), IBM Cloud Object Storage (via Stocator), and local filesystems.
On the write path, shuffle output is buffered and written to object storage as data and index files. On the read path, the plugin uses a prefetching iterator with adaptive concurrency - it measures I/O latencies at runtime and dynamically adjusts the number of prefetch threads to optimize throughput.
Key Features#
- Adaptive prefetching: The prefetch iterator monitors I/O latencies and automatically tunes concurrency between 1 and a configurable maximum, adapting to varying network conditions and object store response times.
- Prefix folding: Shuffle files are distributed across configurable prefix folders (
${rootDir}/${mapId % prefixes}/...) to work around S3 per-prefix rate limits. - Checksum validation: Optional ADLER32 or CRC32 integrity checks on shuffle data with in-memory caching of checksums and partition metadata to reduce redundant reads.
- Configurable buffering: Write buffer sizes and per-task memory limits are tunable to balance memory usage against write throughput.
Compatibility#
The plugin supports Spark 3.2 through 3.5 on the main branch. Spark 3.1 is supported on a dedicated spark-3.1 branch. Spark 4.0 has not been tested but should work given that the plugin targets stable Spark interfaces.
Source Code#
The project is open source under the Apache 2.0 license: IBM/spark-s3-shuffle on GitHub.