CI-VID: BAAI Releases 6.5TB Dataset to Bridge the Gap Between Video Clips and Coherent Narratives
New open-source dataset targets the 'amnesia' problem in generative AI video by focusing on interleaved sequences.
The current landscape of video generation is dominated by models capable of producing high-fidelity clips lasting a few seconds. However, extending these clips into a coherent narrative often results in temporal hallucinations, where protagonists spontaneously change appearance or backgrounds shift inexplicably between frames. The Beijing Academy of Artificial Intelligence (BAAI) has released CI-VID to provide the infrastructure necessary for the next generation of "long-form" video models.
The Shift to Interleaved Data
Most existing datasets, such as WebVid-10M or Panda-70M, rely on a simple one-to-one mapping of video clips to text descriptions. While effective for training models to understand basic motion and semantics, this approach fails to teach models how sequential shots relate to one another. CI-VID distinguishes itself by offering a dataset of over 340,000 interleaved sequences.
These sequences are not merely collections of random clips; they are structured to enable multi-clip coherent video generation. By training on data that links multiple shots under a unified narrative thread, models can learn to preserve subject identity and logical progression over longer durations. The dataset comprises 334,000 training samples and 8,000 test samples, totaling 6.5TB of data hosted on Hugging Face.
Hierarchical Captioning Architecture
To construct a dataset capable of teaching narrative consistency, the researchers employed an advanced data construction pipeline. The methodology utilizes similarity analysis and subject entity segmentation to split longer videos into distinct but related clips.
The labeling process goes beyond simple tagging. The system generates hierarchical captions that function on two levels:
- Single-clip captions: Describing the immediate action within a specific timeframe.
- Cross-clip transition captions: Describing the logical flow or change between sequences.
This dual-layer approach provides the semantic glue missing from previous datasets, allowing models to understand not just what is happening in a scene, but how that scene connects to the one before and after it.
Evaluation and Benchmarking
Measuring success in creative generation is notoriously difficult. To validate the utility of CI-VID, the team established a multi-dimensional evaluation framework. This system combines professional human review with automated assessments based on Visual Language Models (VLMs) and objective similarity calculations. This triangulation attempts to quantify subjective qualities like "story coherence" alongside technical metrics like frame consistency.
Limitations and Industry Impact
The release of CI-VID comes with specific constraints. The dataset is currently restricted to a non-commercial license, limiting its immediate utility to academic and research environments. Furthermore, the sheer size of the dataset—6.5TB—presents a significant computational barrier, likely restricting its use to well-funded research labs and enterprise-grade infrastructure.
Despite these limitations, CI-VID represents a critical step in the maturation of generative video. As the industry looks to move beyond the "stock footage" era of AI video—exemplified by tools like Runway and Pika—toward actual filmmaking and storytelling, data infrastructure must evolve to support temporal coherence. CI-VID suggests that the solution lies not just in larger models, but in data that reflects the complexity of narrative structure.