dranger003's picture
Update README.md
f85aad8 verified
metadata
license: other
license_name: databricks-open-model-license
library_name: gguf
license_link: https://www.databricks.com/legal/open-model-license
pipeline_tag: text-generation
base_model: databricks/dbrx-instruct

Quants from @phymbert (author of the support for this model in llama.cpp) are posted here
The quants here are meant to test imatrix quantized weights.
If you run metal, you may need this PR

Added ggml-dbrx-instruct-16x12b-f16_imatrix-wiki.dat which is a 2K batches (1M tokens) on FP16 weights using wiki.train.

Quant IMatrix Quant/Dataset/Chunks Size (GiB) PPL (wiki.test)
IQ4_XS Q8_0/wiki.train/200 65.29 5.2260 +/- 0.03558
IQ4_XS FP16/wiki.train/2000 65.29 5.2241 +/- 0.03559
IQ4_XS - 66.05 5.2546 +/- 0.03570

2024-04-13: Support for this model has just being merged - PR #6515.
You will need this llama.cpp commit 4bd0f93e to run this model

Quants in this repo are tested running the following command (quants under IQ3 are very sensitive and unreliable so far - the imatrix may require to be trained on FP16 weights rather than Q8_0 and for longer than 200 chunks):

./build/bin/main -ngl 41 -c 4096 -s 0 -e -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite an essay about AI.<|im_end|>\n<|im_start|>assistant\n" -m ggml-dbrx-instruct-16x12b-<<quant-to-test>>.gguf

DBRX is a transformer-based decoder-only large language model (LLM) that was trained using next-token prediction. It uses a fine-grained mixture-of-experts (MoE) architecture with 132B total parameters of which 36B parameters are active on any input. It was pre-trained on 12T tokens of text and code data. Compared to other open MoE models like Mixtral-8x7B and Grok-1, DBRX is fine-grained, meaning it uses a larger number of smaller experts. DBRX has 16 experts and chooses 4, while Mixtral-8x7B and Grok-1 have 8 experts and choose 2. This provides 65x more possible combinations of experts and we found that this improves model quality. DBRX uses rotary position encodings (RoPE), gated linear units (GLU), and grouped query attention (GQA). It uses the GPT-4 tokenizer as provided in the tiktoken repository. We made these choices based on exhaustive evaluation and scaling experiments.

Layers Context Template
40
32768
<|im_start|>system
{system}<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
  • 16x12B MoE
  • 16 experts (12B params per single expert; top_k=4 routing)
  • 36B active params (132B total params)
  • Trained on 12T tokens
  • 32k sequence length training