Creates a model response for the given chat conversation.

Given a list of messages comprising a conversation, the model will return a response. Compatible with OpenAI. See https://platform.openai.com/docs/api-reference/chat/create

Model Overview

Description:

Colosseum-355B is a large language model (LLM) that is designed for use cases in regulated industries—such as financial services, government, and heavy industry. It supports multi-lingual single-turn and multi-turn chat formats, with a context length of up to 16,384 tokens.

The baseline model was pre-trained on a corpus of 9 trillion tokens drawn from a diverse assortment of English texts, more than 50 natural languages, and a wide range of coding languages.

During Continual Pretraining (CPT), the model was trained on an additional 2 trillion tokens, bringing the total to 11 trillion. This phase expanded the model’s parameters and context length, implemented continual pretraining in FP8, and aligned the model's capabilities with specific domain expertise.

Finally, the Colosseum-355B model went through alignment steps, including:

  • Supervised Fine-tuning (SFT)
  • Direct Preference Optimization (DPO)

Because training datasets primarily consist of free web documents and open-source repositories, it is essential to build a CPT dataset that preserves the original data distribution and remains consistent with the composition of the original dataset.

The model supports over 50 languages, with a strong focus on European languages such as German, French, Italian, Spanish, Portuguese, Russian, Romanian, and Polish. Its training dataset also includes substantial representations of non-European languages, including Japanese, Chinese, Arabic, Vietnamese, and Korean.

Additionally, the model integrates specialized sources from domains such as finance and reasoning, drawing from high-quality datasets to enhance performance in these areas.

This model is for research and development only. For commercial use please follow the Terms of Use

Third-Party Community Consideration

This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case

License/Terms of Use:

GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the License agreement for Colosseum

Intended use

Colosseum-355B is an instruct model intended for use for multi-lingual use-cases.

It is designed for highly regulated industries and general purpose use cases to enable developers and enterprises to build and customize their own LLMs and LLM applications.

The instruct model can be further customized using the NeMo Framework, which offers a suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA, and more), and Model Alignment (SFT, SteerLM, RLHF, and more) using NeMo-Aligner. For examples and detailed instructions, refer to the documentation.

Model Architecture:

Architecture Type: Transformer Decoder (auto-regressive language model)

Network Architecture: Colosseum-355B

Input:

Input Type(s): Text

Input Format(s): String

Input Parameter(s): 1D

Output:

Output Type(s): Text

Output Format: String

Output Parameter(s): 1D

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Hopper

Supported Operating System(s):

  • Linux
  • Windows

Model Version(s):

Colosseum-355B v1.0

Inference:

Engine: [TensorRT_LLM, Triton, vLLM]

BF16 Inference:
8x H200 (1x H200 node)
16x H100 (2x H100 nodes)
16x A100 80GB (2x A100 80GB nodes)

FP8 Inference:
8x H200 (1x H200 node)
8x H100 (1x H100 node)

Prompt Format

Single Turn

<extra_id_0>System

<extra_id_1>User
{prompt}
<extra_id_1>Assistant

Multi-Turn or Few-shot

<extra_id_0>System

<extra_id_1>User
{prompt 1}
<extra_id_1>Assistant
{response 1}
<extra_id_1>User
{prompt 2}
<extra_id_1>Assistant
{response 2}
...
<extra_id_1>User
{prompt N}
<extra_id_1>Assistant

Evaluation Results

MT-Bench (GPT-4-Turbo)
Evaluated using MT-Bench judging by GPT-4-0125-Preview as described in Appendix H in the HelpSteer2 Dataset Paper

8.11

IFEval

Evaluated using the Instruction Following Eval (IFEval) introduced in Instruction-Following Evaluation for Large Language Models.

Prompt-Strict Acc: 78,34
Instruction-Strict Acc: 80,02

MMLU

Evaluated using the Multi-task Language Understanding benchmarks as introduced in Measuring Massive Multitask Language Understanding.

5-shot: 82,04

MMLU (CoT)

84,5

ARC-C
The AI2’s Reasoning Challenge (ARC-C) dataset is a multiple-choice question-answering dataset, containing questions from science exams from grade 3 to grade 9.

5-shot: 94,81

Usage

Deployment and inference with Colosseum-355B can be done in many ways -
Deployment of the TRTLLM engines with Triton with TRTLLM backend (multi-node in BF16 / Single node in FP8)
Deployment as NIM - Nvidia Inference Microservices
Deployment with PyTriton in multi-node BF16

Limitations

The model was trained on data collected from the internet, which may contain language that is biased or inappropriate. As a result, the model might occasionally reflect these biases or generate responses that are inaccurate, omit key information, or include irrelevant or redundant text. There is also the possibility that it could produce content that is socially unacceptable or undesirable, even if the prompt does not include any offensive material.

Ethical Considerations:

We believe that developing trustworthy AI is a shared responsibility and have established policies and practices to support the development of a wide array of AI applications. When using this model in accordance with our terms of service, developers are encouraged to work with their internal teams to ensure the model meets the requirements of their specific industry and use case, and to address any potential misuse. If you have any concerns or wish to report security vulnerabilities, please contact us here

Please report security vulnerabilities or NVIDIA AI Concerns here.

Body Params
string
Defaults to igenius/colosseum_355b_instruct_16k
messages
array of objects
required

A list of messages comprising the conversation so far. The roles of the messages must be alternating between user and assistant. The last input message should have role user. A message with the the system role is optional, and must be the very first message if it is present; context is also optional, but must come before a user question.

Messages*
number
0 to 1
Defaults to 0.2

The sampling temperature to use for text generation. The higher the temperature value is, the less deterministic the output text will be. It is not recommended to modify both temperature and top_p in the same call.

number
≤ 1
Defaults to 0.7

The top-p sampling mass used for text generation. The top-p value determines the probability mass that is sampled at sampling time. For example, if top_p = 0.2, only the most likely tokens (summing to 0.2 cumulative probability) will be sampled. It is not recommended to modify both temperature and top_p in the same call.

0 to 18446744073709552000

If specified, our system will make a best effort to sample deterministically, such that repeated requests with the same seed and parameters should return the same result.

number
-2 to 2
Defaults to 0

Indicates how much to penalize new tokens based on their existing frequency in the text so far, decreasing model likelihood to repeat the same line verbatim.

number
-2 to 2
Defaults to 0

Positive values penalize new tokens based on whether they appear in the text so far, increasing model likelihood to talk about new topics.

integer
1 to 4096
Defaults to 1024

The maximum number of tokens to generate in any given call. Note that the model is not aware of this value, and generation will simply stop at the number of tokens specified.

boolean
Defaults to false

If set, partial message deltas will be sent. Tokens will be sent as data-only server-sent events (SSE) as they become available (JSON responses are prefixed by data: ), with the stream terminated by a data: [DONE] message.

stop

A string or a list of strings where the API will stop generating further tokens. The returned text will not contain the stop sequence.

Headers
string
enum
Defaults to application/json

Generated from available response content types

Allowed:
Responses

Language
Credentials
Bearer
Response
Click Try It! to start a request and see the response here! Or choose an example:
application/json
text/event-stream
country_code