uber/petastorm

GitHubGH

Scalable data loading library that enables deep learning frameworks (TensorFlow, PyTorch) to ingest data directly from Apache Parquet format, bridging the gap between Big Data (Spark/Hadoop) and ML training.

byuber

View on GitHub

Published Jun 15, 2018

Utility

7.0/10

stars

1,882

forks

285

Platform Dominationmedium

Market Consolidationhigh

Displacement Horizon1-2 years

REASONING

Petastorm is a mature, infrastructure-grade project developed by Uber that solved a critical bottleneck in the ML lifecycle: the transition from analytical data stores (Parquet/Spark) to training-ready tensors. Its defensibility is rooted in 'data gravity' and deep integration with the Apache ecosystem; once an enterprise has petabytes of Parquet data formatted for Petastorm, the switching costs are high. With 1,882 stars and significant fork activity, it remains a standard tool for Spark-heavy environments. However, its 'zero velocity' indicates it is likely in maintenance mode or being superseded by newer paradigms. It faces stiff competition from modern alternatives like Ray Data, which offers more flexible distributed execution, and specialized libraries like MosaicML's StreamingDataset or Hugging Face 'datasets' which are better optimized for cloud-native and LLM workflows. While frontier labs are unlikely to compete directly, platform providers like Databricks or AWS are increasingly abstracting this layer away, posing a medium risk of platform domination. Its displacement horizon is 1-2 years as teams migrate toward more performance-optimized streaming loaders for multi-node GPU clusters.

COMPOSABILITY

TECH STACK

PythonApache ParquetPySparkPyArrowTensorFlowPyTorch

INTEGRATION

pip_installable

distributed_trainingdata_loadingparquet_integrationspark_mldataset_serialization

READINESS

Composabilitycomponent

Depthproduction

PATTERNS

The reusable building blocks distilled from this project — each a mechanism you could lift into your own.

framework-agnostic reader adapter

othertransform

Stream<Row> -> FrameworkDataset

Wrap a thread-safe background row-generator stream into PyTorch IterableDataset or TensorFlow Dataset adapters.

schema-driven tensor decoding

othertransform

Row<SerializedBytes> -> Row<Tensor>

Deserialize multi-dimensional tensors and images stored as raw bytes in Parquet columns using custom metadata schemas.

uber/petastorm

REASONING

COMPOSABILITY

PATTERNS

framework-agnostic reader adapter

schema-driven tensor decoding

rank-based dataset sharding

two-tier parquet shuffling