dbrx-instruct
Model Information
Description:
DBRX is a transformer-based decoder-only large language model (LLM) that was trained using next-token prediction. It uses a fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input.
Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses a converted version of the GPT-4 tokenizer as defined in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments.
Third-Party Community Consideration
This model is not owned or developed by NVIDIA. This model has been developed and built to a third-party’s requirements for this application and use case; see link to Non-NVIDIA DBRX Model Card.
License and Terms of use
GOVERNING TERMS: Your use of this API is governed by the NVIDIA API Trial Service Terms of Use; and the use of this model is governed by the NVIDIA AI Foundation Models Community License and Databricks Open Model License.
References(s):
Model Architecture:
Architecture Type: Transformer
Network Architecture: Fine-grained Mixture of Experts (MoE)
Input:
Input Format: Text
Input Parameters: Temperature, Top P, Max Output Tokens
Output:
Output Format: Text
Software Integration:
- Supported Hardware Platform(s): Hopper
[Preferred/Supported] Operating System(s):
- Linux
Training, Testing, and Evaluation Datasets:
Training Dataset:
Properties (Quantity, Dataset Descriptions, Sensor(s)): Pre-trained on 12T tokens of text and code data.
Inference:
Engine: Triton, TRT-LLM
Test Hardware: H100