MosaicML Optimizes Cloud-Native Training with PyTorch 'Streaming' Library
Bridging the gap between cloud object storage and GPU memory for the generative AI era
In October 2022, just one month prior to the industry-shifting release of ChatGPT, MosaicML introduced 'Streaming,' a library designed to resolve critical bottlenecks in loading terabyte-scale datasets from cloud object storage directly into PyTorch training loops.
As deep learning models transitioned from gigabyte-scale specialized networks to terabyte-scale foundation models in late 2022, data infrastructure became a primary constraint on training velocity. MosaicML, an infrastructure startup focused on efficient model training (acquired by Databricks in June 2023), released 'Streaming' to address the limitations of standard PyTorch data loading mechanisms when interfacing with cloud storage buckets like AWS S3 or Google Cloud Storage.
The Latency Bottleneck
Prior to this release, training large language models (LLMs) often required copying entire datasets to a local cluster's disk before training could commence—a process that became untenable as datasets grew into the multi-terabyte range. Alternatively, engineers utilized the standard PyTorch IterableDataset, which allowed for streaming but often suffered from poor shuffling capabilities and lack of deterministic resumption after crashes.
MosaicML’s library was positioned as a "drop-in replacement for the PyTorch IterableDataset class", specifically engineered to handle the high throughput required by GPUs while reading directly from remote storage. The core signal from the release indicated that the library "allows users to stream training data from cloud-based object storage" while maintaining the performance characteristics of local data access.
Technical Architecture and Determinism
The 'Streaming' library introduced a mechanism to handle data granularity that differed from competing solutions like WebDataset. By utilizing a specific binary format (MDS), the library enabled elastic determinism. This meant that training runs could be paused, resumed, or scaled across different numbers of nodes without losing the position in the dataset or altering the data shuffling order—a critical requirement for training foundation models that can take weeks to converge.
The library also addressed the "network latency dependency for cloud streaming" by implementing smart local caching and pre-fetching. While the primary source is cloud storage, the library "reads files from local disk" if they have already been cached, effectively creating a tiered storage system that saturates GPU bandwidth even when network performance fluctuates.
Retrospective: The Infrastructure War of 2023
Viewing this October 2022 release through a retrospective lens, the timing was strategic. At the time, the ecosystem was fragmented between various data loading solutions including PyTorch DataPipes, WebDataset, and proprietary loaders from labs like OpenAI or DeepMind. MosaicML’s entry targeted the "mid-market" of AI developers—enterprises and research labs that needed to train LLMs but lacked the internal tooling teams of a hyperscaler.
The release of 'Streaming' presaged the massive demand for cloud-native training infrastructure that exploded in 2023. As organizations rushed to train LLaMA derivatives and custom GPT models, the ability to stream tokens efficiently from S3 became a non-negotiable requirement. This specific capability likely contributed to the valuation rationale behind Databricks' $1.3 billion acquisition of MosaicML less than a year later. The library demonstrated that efficient data plumbing was just as valuable as the model architecture itself.
Limitations and Ecosystem Fit
Despite the performance claims, the library introduced friction regarding data preparation. To achieve optimal streaming performance, raw data (JSON, Parquet, Images) typically requires conversion into MosaicML’s specific shard format. While this conversion is a one-time cost, it creates a degree of vendor or ecosystem lock-in compared to reading raw files directly. Furthermore, while the release highlighted support for cloud storage, specific benchmarks comparing throughput against highly optimized C++ loaders or Ray Data were not immediately exhaustive at launch.
Ultimately, MosaicML Streaming represented a shift in ML engineering: acknowledging that for the era of Generative AI, the dataset is too large to move, and the compute must effectively come to the data via intelligent streaming pipelines.
Key Takeaways
- **Cloud-Native Design**: The library enables direct streaming from object storage (S3/GCS) to GPU memory, bypassing the need for massive local disk arrays.
- **Deterministic Resumption**: Unlike standard iterables, 'Streaming' supports deterministic shuffling and resumption, critical for long-running LLM training jobs.
- **Strategic Timing**: Released one month before ChatGPT, the tool addressed a specific infrastructure gap for the impending wave of foundation model training.
- **Acquisition Context**: This infrastructure capability became a core component of the Databricks Mosaic AI stack following the 2023 acquisition.