nvidia / cosmos-transfer1-7b

Cosmos-Transfer1-7B Overview

Description:

Cosmos-Transfer1-7B is a family of highly performant pre-trained world foundation models that are designed to generate videos aligned with the input control conditions. The Cosmos Transfer1 models are a collection of diffusion-based world foundation models that generate dynamic, high quality videos from text and control video inputs. It can serve as the building block for various applications or research that are related to world generation. This model is ready for commercial use.

License/Terms of Use:

The trial service is governed by the NVIDIA API Trial Terms of Service; and use of the Cosmos-Transfer1-7B model is governed by the NVIDIA Open Model License

Deployment Geography:

Global

Use Case:

Physical AI developers for world generation.

Release Date:

Github 03/02/2025 via [https://github.com/nvidia-cosmos/cosmos-transfer1]

Huggingface 03/02/2025 via [https://huggingface.co/nvidia/Cosmos-Transfer1-7B]

NGC 06/03/2025 via [https://catalog.ngc.nvidia.com/orgs/nvidia/teams/cosmos-transfer1/containers/cosmos-transfer1-7b]

Reference(s):

Cosmos-Transfer1: A Suite of Diffusion-based World-to-World Transfer Models

Githb repo

Model Architecture:

Cosmos-Transfer1-7B is a diffusion transformer model designed for video denoising in the latent space, modulated by multiple control branches with a spatiotemporal control map.

The diffusion transformer network ("the base model") is composed of interleaved self-attention, cross-attention, and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layers, adaptive layer normalization is applied to embed the time information for denoising.

The control branch is formed by replicating the first few transformer blocks of the base model. It processes the control input video to extract control signals, which are then injected into the corresponding transformer blocks of the base model, guiding the denoising process with structured control. When multiple control input videos are provided, each is processed by a dedicated control branch, trained to extract modality-specific control signals. These signals are then combined through a weighted sum and injected into the corresponding transformer blocks in the base model.

Input:

Input Type(s): Text+Video

Input Format(s):

Text: String

Video: mp4

Control Input Video: mp4

Input Parameters:

Text: One-dimensional (1D)

Video: Three-dimensional (3D)

Control Input video: Three-dimensional (3D)

Other Properties Related to Input:

The input string should contain fewer than 300 words and should provide descriptive content for world generation, such as a scene description, key objects or characters, background, and any specific actions or motions to be depicted within the 5-second duration.

The model supports control input videos of varying lengths, but a length which is multiples of 120 frames (e.g., 120, 240, or 360 frames) performs the best.

* The model supports four types of control input videos: blurred video, Canny edge, depth map, segmentation mask, and human keypoints. When multiple control inputs are provided, they must be derived from the same source video, representing different modalities of the same content while maintaining identical spatio-temporal dimensions.

Output:

Output Type(s): Video

Output Format: mp4

Output Parameters: Three-dimensional (3D)

Other Properties Related to Output: By default, the generated video is the same length, height, width and frames-per-second (fps) as the input. The video content transforms the input video into a photorealistic scene, following the input text description.

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA's hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s):

Cosmos-Transfer1

Supported Hardware Microarchitecture Compatibility:

NVIDIA Hopper H200
NVIDIA Hopper H100 SXM
NVIDIA Hopper H100 NVL
NVIDIA Hopper H100 PCIe
NVIDIA Ampere A100
NVIDIA Ada Lovelace L40S
NVIDIA Hopper H20
NVIDIA Blackwell

Preferred/Supported Operating System(s):

Linux

Model Version(s):

Cosmos-Transfer1-7B

Evaluation Dataset :

Please see our technical paper for detailed evaluations of the base model. The control models are built upon the base foundation model.

** Data Collection Method by dataset

Hybrid: Human, Synthetic, Automated

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Users are responsible for model inputs and outputs. Users are responsible for ensuring safe integration of this model, including implementing guardrails as well as other safety mechanisms, prior to deployment.

For more detailed information on ethical considerations for this model, please see the Model Card++ Explainability, Bias, Safety & Security, and Privacy Subcards.
Please report security vulnerabilities or NVIDIA AI Concerns here