Given a list of messages comprising a conversation, the model will return a response. Compatible with OpenAI. See https://platform.openai.com/docs/api-reference/chat/create
Model Overview
Description:
Colosseum-355B is a large language model (LLM) that is designed for use cases in regulated industries—such as financial services, government, and heavy industry. It supports multi-lingual single-turn and multi-turn chat formats, with a context length of up to 16,384 tokens.
The baseline model was pre-trained on a corpus of 9 trillion tokens drawn from a diverse assortment of English texts, more than 50 natural languages, and a wide range of coding languages.
During Continual Pretraining (CPT), the model was trained on an additional 2 trillion tokens, bringing the total to 11 trillion. This phase expanded the model’s parameters and context length, implemented continual pretraining in FP8, and aligned the model's capabilities with specific domain expertise.
Finally, the Colosseum-355B model went through alignment steps, including:
- Supervised Fine-tuning (SFT)
- Direct Preference Optimization (DPO)
Because training datasets primarily consist of free web documents and open-source repositories, it is essential to build a CPT dataset that preserves the original data distribution and remains consistent with the composition of the original dataset.
The model supports over 50 languages, with a strong focus on European languages such as German, French, Italian, Spanish, Portuguese, Russian, Romanian, and Polish. Its training dataset also includes substantial representations of non-European languages, including Japanese, Chinese, Arabic, Vietnamese, and Korean.
Additionally, the model integrates specialized sources from domains such as finance and reasoning, drawing from high-quality datasets to enhance performance in these areas.
This model is for research and development only. For commercial use please follow the Terms of Use
Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case
License/Terms of Use:
GOVERNING TERMS: This trial service is governed by the NVIDIA API Trial Terms of Service. Use of this model is governed by the License agreement for Colosseum
Intended use
Colosseum-355B is an instruct model intended for use for multi-lingual use-cases.
It is designed for highly regulated industries and general purpose use cases to enable developers and enterprises to build and customize their own LLMs and LLM applications.
The instruct model can be further customized using the NeMo Framework, which offers a suite of customization tools including Parameter-Efficient Fine-Tuning (P-tuning, Adapters, LoRA, and more), and Model Alignment (SFT, SteerLM, RLHF, and more) using NeMo-Aligner. For examples and detailed instructions, refer to the documentation.
Model Architecture:
Architecture Type: Transformer Decoder (auto-regressive language model)
Network Architecture: Colosseum-355B
Input:
Input Type(s): Text
Input Format(s): String
Input Parameter(s): 1D
Output:
Output Type(s): Text
Output Format: String
Output Parameter(s): 1D
Supported Hardware Microarchitecture Compatibility:
- NVIDIA Ampere
- NVIDIA Hopper
Supported Operating System(s):
- Linux
- Windows
Model Version(s):
Colosseum-355B v1.0
Inference:
Engine: [TensorRT_LLM, Triton, vLLM]
BF16 Inference:
8x H200 (1x H200 node)
16x H100 (2x H100 nodes)
16x A100 80GB (2x A100 80GB nodes)
FP8 Inference:
8x H200 (1x H200 node)
8x H100 (1x H100 node)
Prompt Format
Single Turn
<extra_id_0>System
<extra_id_1>User
{prompt}
<extra_id_1>Assistant
Multi-Turn or Few-shot
<extra_id_0>System
<extra_id_1>User
{prompt 1}
<extra_id_1>Assistant
{response 1}
<extra_id_1>User
{prompt 2}
<extra_id_1>Assistant
{response 2}
...
<extra_id_1>User
{prompt N}
<extra_id_1>Assistant
Evaluation Results
MT-Bench (GPT-4-Turbo)
Evaluated using MT-Bench judging by GPT-4-0125-Preview as described in Appendix H in the HelpSteer2 Dataset Paper
8.11
IFEval
Evaluated using the Instruction Following Eval (IFEval) introduced in Instruction-Following Evaluation for Large Language Models.
Prompt-Strict Acc: 78,34
Instruction-Strict Acc: 80,02
MMLU
Evaluated using the Multi-task Language Understanding benchmarks as introduced in Measuring Massive Multitask Language Understanding.
5-shot: 82,04
MMLU (CoT)
84,5
ARC-C
The AI2’s Reasoning Challenge (ARC-C) dataset is a multiple-choice question-answering dataset, containing questions from science exams from grade 3 to grade 9.
5-shot: 94,81
Usage
Deployment and inference with Colosseum-355B can be done in many ways -
Deployment of the TRTLLM engines with Triton with TRTLLM backend (multi-node in BF16 / Single node in FP8)
Deployment as NIM - Nvidia Inference Microservices
Deployment with PyTriton in multi-node BF16
Limitations
The model was trained on data collected from the internet, which may contain language that is biased or inappropriate. As a result, the model might occasionally reflect these biases or generate responses that are inaccurate, omit key information, or include irrelevant or redundant text. There is also the possibility that it could produce content that is socially unacceptable or undesirable, even if the prompt does not include any offensive material.
Ethical Considerations:
We believe that developing trustworthy AI is a shared responsibility and have established policies and practices to support the development of a wide array of AI applications. When using this model in accordance with our terms of service, developers are encouraged to work with their internal teams to ensure the model meets the requirements of their specific industry and use case, and to address any potential misuse. If you have any concerns or wish to report security vulnerabilities, please contact us here
Please report security vulnerabilities or NVIDIA AI Concerns here.