anneketh-vij commited on
Commit
c024e48
·
verified ·
1 Parent(s): 4ffe084

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +194 -0
README.md ADDED
@@ -0,0 +1,194 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ - es
6
+ - fr
7
+ - de
8
+ - it
9
+ - pt
10
+ - ru
11
+ - ar
12
+ - hi
13
+ - ko
14
+ - zh
15
+ library_name: transformers
16
+ base_model:
17
+ - arcee-ai/Trinity-Mini
18
+ base_model_relation: quantized
19
+ ---
20
+ <div align="center">
21
+ <picture>
22
+ <img
23
+ src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/i-v1KyAMOW_mgVGeic9WJ.png"
24
+ alt="Arcee Trinity Mini"
25
+ style="max-width: 100%; height: auto;"
26
+ >
27
+ </picture>
28
+ </div>
29
+
30
+ # Trinity Mini FP8-Block
31
+
32
+ **This repository contains the FP8 block-quantized weights of Trinity-Mini (FP8 weights and activations with per-block scaling).**
33
+
34
+ Trinity Mini is an Arcee AI 26B MoE model with 3B active parameters. It is the medium-sized model in our new Trinity family, a series of open-weight models for enterprise and tinkerers alike.
35
+
36
+ This model is tuned for reasoning, but in testing, it uses a similar total token count to competitive instruction-tuned models.
37
+
38
+ ***
39
+
40
+ Trinity Mini is trained on 10T tokens gathered and curated through a key partnership with [Datology](https://www.datologyai.com/), building upon the excellent dataset we used on [AFM-4.5B](https://huggingface.co/arcee-ai/AFM-4.5B) with additional math and code.
41
+
42
+ Training was performed on a cluster of 512 H200 GPUs powered by [Prime Intellect](https://www.primeintellect.ai/) using HSDP parallelism.
43
+
44
+ More details, including key architecture decisions, can be found on our blog [here](https://www.arcee.ai/blog/the-trinity-manifesto)
45
+
46
+ Try it out now at [chat.arcee.ai](http://chat.arcee.ai/)
47
+
48
+ ***
49
+
50
+ ## Model Details
51
+
52
+ * **Model Architecture:** AfmoeForCausalLM
53
+ * **Parameters:** 26B, 3B active
54
+ * **Experts:** 128 total, 8 active, 1 shared
55
+ * **Context length:** 128k
56
+ * **Training Tokens:** 10T
57
+ * **License:** [Apache 2.0](https://huggingface.co/arcee-ai/Trinity-Mini#license)
58
+ * **Recommended settings:**
59
+ * temperature: 0.15
60
+ * top_k: 50
61
+ * top_p: 0.75
62
+ * min_p: 0.06
63
+
64
+ ***
65
+
66
+ ## Quantization Details
67
+
68
+ - **Scheme:** `FP8 Block` (FP8 weights and activations, per-block scaling with E8M0 scale format)
69
+ - **Format:** `compressed-tensors`
70
+ - **Intended use:** High-throughput FP8 deployment of Trinity-Mini with near-lossless quality, optimized for NVIDIA Hopper GPUs
71
+ - **Supported backends:** [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM), vLLM CUTLASS, Triton
72
+
73
+ ## Benchmarks
74
+
75
+ ![](https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/UMV0OZh_H1JfvgzBTXh6u.png)
76
+
77
+ <div align="center">
78
+ <picture>
79
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/6435718aaaef013d1aec3b8b/sSVjGNHfrJKmQ6w8I18ek.png" style="background-color:ghostwhite;padding:5px;" width="17%" alt="Powered by Datology">
80
+ </picture>
81
+ </div>
82
+
83
+ ### Running our model
84
+
85
+ - [VLLM](https://huggingface.co/arcee-ai/Trinity-Mini-FP8-Block#vllm)
86
+ - [Transformers](https://huggingface.co/arcee-ai/Trinity-Mini-FP8-Block#transformers)
87
+
88
+ ## VLLM
89
+
90
+ Supported in VLLM release 0.18.0+ with DeepGEMM FP8 MoE acceleration.
91
+
92
+ ```
93
+ # pip
94
+ pip install "vllm>=0.18.0"
95
+ ```
96
+
97
+ Serving the model with DeepGEMM enabled:
98
+
99
+ ```
100
+ VLLM_USE_DEEP_GEMM=1 vllm serve arcee-ai/Trinity-Mini-FP8-Block \
101
+ --trust-remote-code \
102
+ --max-model-len 4096 \
103
+ --enable-auto-tool-choice \
104
+ --reasoning-parser deepseek_r1 \
105
+ --tool-call-parser hermes
106
+ ```
107
+
108
+ Serving without DeepGEMM (falls back to CUTLASS/Triton):
109
+
110
+ ```
111
+ vllm serve arcee-ai/Trinity-Mini-FP8-Block \
112
+ --trust-remote-code \
113
+ --max-model-len 4096 \
114
+ --enable-auto-tool-choice \
115
+ --reasoning-parser deepseek_r1 \
116
+ --tool-call-parser hermes
117
+ ```
118
+
119
+ ## Transformers
120
+
121
+ Use the `main` transformers branch
122
+
123
+ ```
124
+ git clone https://github.com/huggingface/transformers.git
125
+ cd transformers
126
+
127
+ # pip
128
+ pip install '.[torch]'
129
+
130
+ # uv
131
+ uv pip install '.[torch]'
132
+ ```
133
+
134
+ ```python
135
+ from transformers import AutoTokenizer, AutoModelForCausalLM
136
+ import torch
137
+
138
+ model_id = "arcee-ai/Trinity-Mini-FP8-Block"
139
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
140
+ model = AutoModelForCausalLM.from_pretrained(
141
+ model_id,
142
+ torch_dtype=torch.bfloat16,
143
+ device_map="auto",
144
+ trust_remote_code=True
145
+ )
146
+
147
+ messages = [
148
+ {"role": "user", "content": "Who are you?"},
149
+ ]
150
+
151
+ input_ids = tokenizer.apply_chat_template(
152
+ messages,
153
+ add_generation_prompt=True,
154
+ return_tensors="pt"
155
+ ).to(model.device)
156
+
157
+ outputs = model.generate(
158
+ input_ids,
159
+ max_new_tokens=256,
160
+ do_sample=True,
161
+ temperature=0.15,
162
+ top_k=50,
163
+ top_p=0.75,
164
+ min_p=0.06
165
+ )
166
+
167
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
168
+ print(response)
169
+ ```
170
+
171
+ ## API
172
+
173
+ Trinity Mini is available today on openrouter:
174
+
175
+ https://openrouter.ai/arcee-ai/trinity-mini
176
+
177
+ ```
178
+ curl -X POST "https://openrouter.ai/v1/chat/completions" \
179
+ -H "Authorization: Bearer $OPENROUTER_API_KEY" \
180
+ -H "Content-Type: application/json" \
181
+ -d '{
182
+ "model": "arcee-ai/trinity-mini",
183
+ "messages": [
184
+ {
185
+ "role": "user",
186
+ "content": "What are some fun things to do in New York?"
187
+ }
188
+ ]
189
+ }'
190
+ ```
191
+
192
+ ## License
193
+
194
+ Trinity-Mini-FP8-Block is released under the Apache-2.0 license.