MosaicML Optimizes Cloud-Native Training with PyTorch 'Streaming' Library

Bridging the gap between cloud object storage and GPU memory for the generative AI era

· Editorial Team

In October 2022, just one month prior to the industry-shifting release of ChatGPT, MosaicML introduced 'Streaming,' a library designed to resolve critical bottlenecks in loading terabyte-scale datasets from cloud object storage directly into PyTorch training loops.

As deep learning models transitioned from gigabyte-scale specialized networks to terabyte-scale foundation models in late 2022, data infrastructure became a primary constraint on training velocity. MosaicML, an infrastructure startup focused on efficient model training (acquired by Databricks in June 2023), released 'Streaming' to address the limitations of standard PyTorch data loading mechanisms when interfacing with cloud storage buckets like AWS S3 or Google Cloud Storage.

The Latency Bottleneck

Prior to this release, training large language models (LLMs) often required copying entire datasets to a local cluster's disk before training could commence—a process that became untenable as datasets grew into the multi-terabyte range. Alternatively, engineers utilized the standard PyTorch IterableDataset, which allowed for streaming but often suffered from poor shuffling capabilities and lack of deterministic resumption after crashes.

MosaicML’s library was positioned as a "drop-in replacement for the PyTorch IterableDataset class", specifically engineered to handle the high throughput required by GPUs while reading directly from remote storage. The core signal from the release indicated that the library "allows users to stream training data from cloud-based object storage" while maintaining the performance characteristics of local data access.

Technical Architecture and Determinism

The 'Streaming' library introduced a mechanism to handle data granularity that differed from competing solutions like WebDataset. By utilizing a specific binary format (MDS), the library enabled elastic determinism. This meant that training runs could be paused, resumed, or scaled across different numbers of nodes without losing the position in the dataset or altering the data shuffling order—a critical requirement for training foundation models that can take weeks to converge.

The library also addressed the "network latency dependency for cloud streaming" by implementing smart local caching and pre-fetching. While the primary source is cloud storage, the library "reads files from local disk" if they have already been cached, effectively creating a tiered storage system that saturates GPU bandwidth even when network performance fluctuates.

Retrospective: The Infrastructure War of 2023

Viewing this October 2022 release through a retrospective lens, the timing was strategic. At the time, the ecosystem was fragmented between various data loading solutions including PyTorch DataPipes, WebDataset, and proprietary loaders from labs like OpenAI or DeepMind. MosaicML’s entry targeted the "mid-market" of AI developers—enterprises and research labs that needed to train LLMs but lacked the internal tooling teams of a hyperscaler.

The release of 'Streaming' presaged the massive demand for cloud-native training infrastructure that exploded in 2023. As organizations rushed to train LLaMA derivatives and custom GPT models, the ability to stream tokens efficiently from S3 became a non-negotiable requirement. This specific capability likely contributed to the valuation rationale behind Databricks' $1.3 billion acquisition of MosaicML less than a year later. The library demonstrated that efficient data plumbing was just as valuable as the model architecture itself.

Limitations and Ecosystem Fit

Despite the performance claims, the library introduced friction regarding data preparation. To achieve optimal streaming performance, raw data (JSON, Parquet, Images) typically requires conversion into MosaicML’s specific shard format. While this conversion is a one-time cost, it creates a degree of vendor or ecosystem lock-in compared to reading raw files directly. Furthermore, while the release highlighted support for cloud storage, specific benchmarks comparing throughput against highly optimized C++ loaders or Ray Data were not immediately exhaustive at launch.

Ultimately, MosaicML Streaming represented a shift in ML engineering: acknowledging that for the era of Generative AI, the dataset is too large to move, and the compute must effectively come to the data via intelligent streaming pipelines.

Key Takeaways

Sources