Creates a model response for the given chat conversation.

post https://integrate.api.nvidia.com/v1/chat/completions

Given a list of messages comprising a conversation, the model will return a response. Compatible with OpenAI. See https://platform.openai.com/docs/api-reference/chat/create

Model Overview

Description:

Colosseum-355B is a large language model (LLM) that is designed for use cases in regulated industries—such as financial services, government, and heavy industry. It supports multi-lingual single-turn and multi-turn chat formats, with a context length of up to 16,384 tokens.

The baseline model was pre-trained on a corpus of 9 trillion tokens drawn from a diverse assortment of English texts, more than 50 natural languages, and a wide range of coding languages.

During Continual Pretraining (CPT), the model was trained on an additional 2 trillion tokens, bringing the total to 11 trillion. This phase expanded the model’s parameters and context length, implemented continual pretraining in FP8, and aligned the model's capabilities with specific domain expertise.

Finally, the Colosseum-355B model went through alignment steps, including:

Supervised Fine-tuning (SFT)
Direct Preference Optimization (DPO)

Because training datasets primarily consist of free web documents and open-source repositories, it is essential to build a CPT dataset that preserves the original data distribution and remains consistent with the composition of the original dataset.

The model supports over 50 languages, with a strong focus on European languages such as German, French, Italian, Spanish, Portuguese, Russian, Romanian, and Polish. Its training dataset also includes substantial representations of non-European languages, including Japanese, Chinese, Arabic, Vietnamese, and Korean.

Additionally, the model integrates specialized sources from domains such as finance and reasoning, drawing from high-quality datasets to enhance performance in these areas.

This model is for research and development only. For commercial use please follow the Terms of Use

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case

License/Terms of Use:

GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the License agreement for Colosseum

Intended use

Colosseum-355B is an instruct model intended for use for multi-lingual use-cases.

It is designed for highly regulated industries and general purpose use cases to enable developers and enterprises to build and customize their own LLMs and LLM applications.

The instruct model can be further customized using the NeMo Framework, which offers a suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA, and more), and Model Alignment (SFT, SteerLM, RLHF, and more) using NeMo-Aligner. For examples and detailed instructions, refer to the documentation.

Model Architecture:

Architecture Type: Transformer Decoder (auto-regressive language model)

Network Architecture: Colosseum-355B

Input:

Input Type(s): Text

Input Format(s): String

Input Parameter(s): 1D

Output:

Output Type(s): Text

Output Format: String

Output Parameter(s): 1D

Supported Hardware Microarchitecture Compatibility:

NVIDIA Ampere
NVIDIA Hopper

Supported Operating System(s):

Linux
Windows

Model Version(s):

Colosseum-355B v1.0

Inference:

Engine: [TensorRT_LLM, Triton, vLLM]

BF16 Inference:
8x H200 (1x H200 node)
16x H100 (2x H100 nodes)
16x A100 80GB (2x A100 80GB nodes)

FP8 Inference:
8x H200 (1x H200 node)
8x H100 (1x H100 node)

Prompt Format

Single Turn

<extra_id_0>System

<extra_id_1>User
{prompt}
<extra_id_1>Assistant

Multi-Turn or Few-shot

<extra_id_0>System

<extra_id_1>User
{prompt 1}
<extra_id_1>Assistant
{response 1}
<extra_id_1>User
{prompt 2}
<extra_id_1>Assistant
{response 2}
...
<extra_id_1>User
{prompt N}
<extra_id_1>Assistant

Evaluation Results

MT-Bench (GPT-4-Turbo)
Evaluated using MT-Bench judging by GPT-4-0125-Preview as described in Appendix H in the HelpSteer2 Dataset Paper

8.11

IFEval

Evaluated using the Instruction Following Eval (IFEval) introduced in Instruction-Following Evaluation for Large Language Models.

Prompt-Strict Acc: 78,34
Instruction-Strict Acc: 80,02

MMLU

Evaluated using the Multi-task Language Understanding benchmarks as introduced in Measuring Massive Multitask Language Understanding.

5-shot: 82,04

MMLU (CoT)

84,5

ARC-C
The AI2’s Reasoning Challenge (ARC-C) dataset is a multiple-choice question-answering dataset, containing questions from science exams from grade 3 to grade 9.

5-shot: 94,81

Usage

Deployment and inference with Colosseum-355B can be done in many ways -
Deployment of the TRTLLM engines with Triton with TRTLLM backend (multi-node in BF16 / Single node in FP8)
Deployment as NIM - Nvidia Inference Microservices
Deployment with PyTriton in multi-node BF16

Limitations

The model was trained on data collected from the internet, which may contain language that is biased or inappropriate. As a result, the model might occasionally reflect these biases or generate responses that are inaccurate, omit key information, or include irrelevant or redundant text. There is also the possibility that it could produce content that is socially unacceptable or undesirable, even if the prompt does not include any offensive material.

Ethical Considerations:

We believe that developing trustworthy AI is a shared responsibility and have established policies and practices to support the development of a wide array of AI applications. When using this model in accordance with our terms of service, developers are encouraged to work with their internal teams to ensure the model meets the requirements of their specific industry and use case, and to address any potential misuse. If you have any concerns or wish to report security vulnerabilities, please contact us here

Please report security vulnerabilities or NVIDIA AI Concerns here.

Language

Credentials

Bearer

Click Try It! to start a request and see the response here!