nvidia / cosmos-predict1-7b

Cosmos-Predict1: A Suite of Diffusion-Based World Foundation Models

Cosmos | Code | Paper

Model Overview

Description:

Cosmos World Foundation Models: A family of highly performant pre-trained world foundation models purpose-built for accelerating synthetic data generation.

Cosmos world foundation models include:

  1. Cosmos Predict: Generalist model for predictive video generation from text, image, or video prompts that produces future frame sequences based on input context. Trained on 20M hours of physical AI data, the model serves as a strong foundation for post-training into specialized models for autonomous systems.
  2. Cosmos Transfer: Multicontrol model to generate videos conditioned on ground-truth simulations or structured video inputs for physical accuracy, enabling amplification to diverse environments and lighting conditions.
  3. Cosmos Reason: Multimodal reasoning model for planning text response based on spatial and temporal understanding from input video.

Model Developer: NVIDIA

Model Versions

The Cosmos Predict 1 release, includes the following models:

License:

This model is released under the NVIDIA Open Model License. For a custom license, please contact [email protected].

Under the NVIDIA Open Model License, NVIDIA confirms:

  • Models are commercially usable.
  • You are free to create and distribute Derivative Models.
  • NVIDIA does not claim ownership of any outputs generated using the Models or Derivative Models.

Important Note: If you bypass, disable, reduce the efficacy of, or circumvent any technical limitation, safety guardrail or
associated safety guardrail hyperparameter, encryption, security, digital rights management, or authentication mechanism contained
in the Model, your rights under NVIDIA Open Model License Agreement will automatically terminate.

Model Architecture:

Cosmos-Predict1-7B-Text2World and Cosmos-Predict1-7B-Video2World are diffusion transformer models designed for video denoising in the latent space. The network is composed of interleaved self-attention, cross-attention, and feedforward layers as its building blocks. The cross-attention layers allow the model to condition on input text throughout the denoising process. Before each layer, adaptive layer normalization is applied to embed the time information for denoising. When an image or video is provided as input, their latent frames are concatenated with the generated frames along the temporal dimension. Augment noise is added to conditional latent frames to bridge the training and inference gap.

Cosmos-Predict1-7B-Text2World Input/Output Specifications

  • Input

    • Input Type(s): Text
    • Input Format(s): String
    • Input Parameters: One-dimensional (1D)
    • Other Properties Related to Input:
      • The input string should contain fewer than 300 words and should provide descriptive content for world generation, such as a scene description, key objects or characters, background, and any specific actions or motions to be depicted within the 5-second duration.
  • Output

    • Output Type(s): Video
    • Output Format(s): mp4
    • Output Parameters: Three-dimensional (3D)
    • Other Properties Related to Output: The generated video will be a 5-second clip with a resolution of 1280x704 pixels at 24 frames per second (fps). The content of the video will visualize the input text description as a short animated scene, capturing the main elements mentioned in the input within the time constraints.

Cosmos-Predict1-7B-Video2World Input/Output Specifications

  • Input

    • Input Type(s): Text+Image, Text+Video
    • Input Format(s):
      • Text: String
      • Image: jpg, png, jpeg, webp
      • Video: mp4
    • Input Parameters:
      • Text: One-dimensional (1D)
      • Image: Two-dimensional (2D)
      • Video: Three-dimensional (3D)
    • Other Properties Related to Input:
      • The input string should contain fewer than 300 words and should provide descriptive content for world generation, such as a scene description, key objects or characters, background, and any specific actions or motions to be depicted within the 5-second duration.
      • The input image should be of 1280x704 resolution.
      • The input video should be of 1280x704 resolution and 9 input frames.
  • Output

    • Output Type(s): Video
    • Output Format(s): mp4
    • Output Parameters: Three-dimensional (3D)
    • Other Properties Related to Output: The generated video will be a 5-second clip with a resolution of 1280x704 pixels at 24 frames per second (fps). The content of the video will use the provided image as the first frame and visualize the input text description as a short animated scene, capturing the main elements mentioned in the input within the time constraints.

Software Integration

Runtime Engine(s):

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Blackwell
  • NVIDIA Hopper
  • NVIDIA Ampere

Note: We have only tested inference with BF16 precision.

Operating System(s):

  • Linux (We have not tested it on other operating systems.)

Usage

Evaluation

Please see our technical paper for detailed evaluations.

Inference Time and GPU Memory Usage

The numbers provided below may vary depending on system specs and are for reference only.

We report the maximum observed GPU memory usage during end-to-end inference. Additionally, we offer a series of model offloading strategies to help users manage GPU memory usage effectively.

For GPUs with limited memory (e.g., RTX 3090/4090 with 24 GB memory), we recommend fully offloading all models. For higher-end GPUs, users can select the most suitable offloading strategy considering the numbers provided below.

Cosmos-Predict1-7B-Text2World

Offloading Strategy7B Text2World14B Text2World
Offload prompt upsampler74.0 GB> 80.0 GB
Offload prompt upsampler and guardrails57.1 GB70.5 GB
Offload prompt upsampler, guardrails and T5 encoder38.5 GB51.9 GB
Offload prompt upsampler, guardrails, T5 encoder and tokenizer38.3 GB51.7 GB
Offload prompt upsampler, guardrails, T5 encoder, tokenizer and diffusion model24.4 GB39.0 GB

The table below presents the end-to-end inference runtime on a single H100 GPU, excluding model initialization time.

7B Text2World (offload prompt upsampler)14B Text2World (offload prompt upsampler, guardrails)
~380 seconds~590 seconds

Cosmos-Predict1-7B-Video2World

Offloading Strategy7B Video2World14B Video2World
Offload prompt upsampler76.5 GB> 80.0 GB
Offload prompt upsampler and guardrails59.9 GB73.3 GB
Offload prompt upsampler, guardrails and T5 encoder41.3 GB54.8 GB
Offload prompt upsampler, guardrails, T5 encoder and tokenizer41.1 GB54.5 GB
Offload prompt upsampler, guardrails, T5 encoder, tokenizer and diffusion model27.3 GB39.0 GB

The following table shows the end-to-end inference runtime on a single H100 GPU, excluding model initialization time:

7B Video2World (offload prompt upsampler)14B Video2World (offload prompt upsampler, guardrails)
~383 seconds~593 seconds

Ethical Considerations

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

For more detailed information on ethical considerations for this model, please see the subcards of Explainability, Bias, Safety & Security, and Privacy below. Please report security vulnerabilities or NVIDIA AI Concerns here.