Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,231 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# π SuperNova Medius Compressed Model (W4A16)
|
2 |
+
|
3 |
+
[![Model Size](https://img.shields.io/badge/Size-Compressed-green)]()
|
4 |
+
[![Quantization](https://img.shields.io/badge/Quantization-W4A16-blue)]()
|
5 |
+
[![Max Sequence Length](https://img.shields.io/badge/Max%20Length-4096-orange)]()
|
6 |
+
|
7 |
+
> **Model ID**: `arcee-ai/SuperNova-Medius-CM-w4a16`
|
8 |
+
|
9 |
+
## π Table of Contents
|
10 |
+
- [Overview](#overview)
|
11 |
+
- [Quick Start](#quick-start)
|
12 |
+
- [Model Details](#model-details)
|
13 |
+
- [Usage Guide](#usage-guide)
|
14 |
+
- [Quantization Process](#quantization-process)
|
15 |
+
- [Performance & Benchmarks](#performance--benchmarks)
|
16 |
+
- [Technical Details](#technical-details)
|
17 |
+
- [Limitations & Biases](#limitations--biases)
|
18 |
+
- [Citations & Acknowledgements](#citations--acknowledgements)
|
19 |
+
|
20 |
+
## π Overview
|
21 |
+
|
22 |
+
SuperNova Medius CM W4A16 is a quantized version of the `arcee-ai/SuperNova-Medius` model, optimized for efficient deployment. Using GPTQ (Generalized Post-Training Quantization), we've achieved significant size reduction while maintaining near-original performance.
|
23 |
+
|
24 |
+
### β¨ Key Features
|
25 |
+
- 4-bit weight quantization
|
26 |
+
- 16-bit activation quantization
|
27 |
+
- 4096 token context window
|
28 |
+
- Optimized for deployment on consumer hardware
|
29 |
+
|
30 |
+
## π Quick Start
|
31 |
+
|
32 |
+
```python
|
33 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
34 |
+
|
35 |
+
# Load model and tokenizer
|
36 |
+
tokenizer = AutoTokenizer.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
|
37 |
+
model = AutoModelForCausalLM.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
|
38 |
+
|
39 |
+
# Simple inference
|
40 |
+
text = "Hello, how are you?"
|
41 |
+
inputs = tokenizer(text, return_tensors="pt")
|
42 |
+
outputs = model.generate(**inputs)
|
43 |
+
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
|
44 |
+
```
|
45 |
+
|
46 |
+
## π Model Details
|
47 |
+
|
48 |
+
### Specifications
|
49 |
+
- **Base Model**: arcee-ai/SuperNova-Medius
|
50 |
+
- **Quantization Method**: GPTQ
|
51 |
+
- **Maximum Sequence Length**: 4096
|
52 |
+
- **Calibration Samples**: 1024
|
53 |
+
|
54 |
+
### Quantization Parameters
|
55 |
+
| Parameter | Value |
|
56 |
+
|-----------|--------|
|
57 |
+
| Weight Bits | 4 |
|
58 |
+
| Activation Bits | 16 |
|
59 |
+
| Ignored Layers | lm_head |
|
60 |
+
| Dampening Fraction | 0.1 |
|
61 |
+
| Calibration Dataset | neuralmagic/LLM_compression_calibration |
|
62 |
+
|
63 |
+
## π» Usage Guide
|
64 |
+
|
65 |
+
### Basic Usage
|
66 |
+
See Quick Start section above.
|
67 |
+
|
68 |
+
### Advanced Usage
|
69 |
+
|
70 |
+
```python
|
71 |
+
# Advanced generation with parameters
|
72 |
+
output = model.generate(
|
73 |
+
input_ids,
|
74 |
+
max_length=100,
|
75 |
+
num_beams=4,
|
76 |
+
temperature=0.7,
|
77 |
+
no_repeat_ngram_size=2,
|
78 |
+
do_sample=True
|
79 |
+
)
|
80 |
+
```
|
81 |
+
|
82 |
+
### Memory Optimization
|
83 |
+
|
84 |
+
```python
|
85 |
+
# Load model with device map for multi-GPU setup
|
86 |
+
model = AutoModelForCausalLM.from_pretrained(
|
87 |
+
"arcee-ai/SuperNova-Medius-CM-w4a16",
|
88 |
+
device_map="auto",
|
89 |
+
torch_dtype=torch.bfloat16
|
90 |
+
)
|
91 |
+
```
|
92 |
+
|
93 |
+
## βοΈ Quantization Process
|
94 |
+
|
95 |
+
```python
|
96 |
+
import torch
|
97 |
+
from datasets import load_dataset
|
98 |
+
from transformers import AutoTokenizer
|
99 |
+
from llmcompressor.modifiers.quantization import GPTQModifier
|
100 |
+
from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
|
101 |
+
from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
|
102 |
+
|
103 |
+
# Configuration
|
104 |
+
MODEL_ID = "arcee-ai/SuperNova-Medius"
|
105 |
+
NUM_SAMPLES = 1024
|
106 |
+
MAX_LENGTH = 4096
|
107 |
+
SEED = 42
|
108 |
+
|
109 |
+
# Calculate device map
|
110 |
+
device_map = calculate_offload_device_map(
|
111 |
+
MODEL_ID,
|
112 |
+
num_gpus=torch.cuda.device_count(),
|
113 |
+
reserve_for_hessians=True,
|
114 |
+
torch_dtype=torch.bfloat16
|
115 |
+
)
|
116 |
+
|
117 |
+
# Load model and tokenizer
|
118 |
+
model = SparseAutoModelForCausalLM.from_pretrained(
|
119 |
+
MODEL_ID,
|
120 |
+
device_map=device_map,
|
121 |
+
torch_dtype=torch.bfloat16
|
122 |
+
)
|
123 |
+
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
|
124 |
+
|
125 |
+
# Prepare calibration dataset
|
126 |
+
ds = load_dataset("neuralmagic/LLM_compression_calibration")
|
127 |
+
ds = ds["train"].shuffle(seed=SEED).select(range(NUM_SAMPLES))
|
128 |
+
|
129 |
+
def preprocess(example):
|
130 |
+
return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
|
131 |
+
|
132 |
+
ds = ds.map(preprocess)
|
133 |
+
|
134 |
+
def tokenize(sample):
|
135 |
+
return tokenizer(
|
136 |
+
sample["text"],
|
137 |
+
padding=False,
|
138 |
+
max_length=MAX_LENGTH,
|
139 |
+
truncation=True,
|
140 |
+
add_special_tokens=False
|
141 |
+
)
|
142 |
+
|
143 |
+
ds = ds.map(tokenize)
|
144 |
+
|
145 |
+
# Configure quantization
|
146 |
+
recipe = GPTQModifier(
|
147 |
+
targets="Linear",
|
148 |
+
scheme="W4A16",
|
149 |
+
ignore=["lm_head"],
|
150 |
+
dampening_frac=0.1
|
151 |
+
)
|
152 |
+
|
153 |
+
# Execute quantization
|
154 |
+
oneshot(
|
155 |
+
model=model,
|
156 |
+
dataset=ds,
|
157 |
+
recipe=recipe,
|
158 |
+
oneshot_device=device_map,
|
159 |
+
max_seq_length=MAX_LENGTH,
|
160 |
+
num_calibration_samples=NUM_SAMPLES,
|
161 |
+
accelerator_config={
|
162 |
+
'split_batches': True,
|
163 |
+
'dispatch_batches': None,
|
164 |
+
'even_batches': True,
|
165 |
+
'use_seedable_sampler': True,
|
166 |
+
'non_blocking': False,
|
167 |
+
'gradient_accumulation_kwargs': None,
|
168 |
+
'use_configured_state': False
|
169 |
+
}
|
170 |
+
)
|
171 |
+
|
172 |
+
# Save quantized model
|
173 |
+
model.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16", save_compressed=True)
|
174 |
+
tokenizer.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16")
|
175 |
+
```
|
176 |
+
|
177 |
+
## π οΈ Technical Details
|
178 |
+
|
179 |
+
### Dependencies
|
180 |
+
| Package | Version |
|
181 |
+
|---------|---------|
|
182 |
+
| Python | 3.9.x |
|
183 |
+
| torch | 2.5.1 |
|
184 |
+
| transformers | 4.46.2 |
|
185 |
+
| llmcompressor | 0.5.0 |
|
186 |
+
| vllm | 0.6.4 |
|
187 |
+
| datasets | 3.1.0 |
|
188 |
+
| huggingface_hub | 0.24.7 |
|
189 |
+
| compressed-tensors | 0.8.0 |
|
190 |
+
|
191 |
+
### Hardware Requirements
|
192 |
+
- **Minimum**: 8GB VRAM
|
193 |
+
- **Recommended**: 16GB VRAM
|
194 |
+
- **Optimal**: 24GB VRAM or multiple GPUs
|
195 |
+
|
196 |
+
## β οΈ Limitations & Biases
|
197 |
+
|
198 |
+
### Known Limitations
|
199 |
+
- Slight performance degradation compared to full-precision model
|
200 |
+
- Limited to 4096 token context window
|
201 |
+
- May require careful memory management on consumer GPUs
|
202 |
+
|
203 |
+
### Inherited Biases
|
204 |
+
- Carries over biases from base model
|
205 |
+
- Users should implement appropriate content filtering
|
206 |
+
- Regular evaluation recommended for production deployments
|
207 |
+
|
208 |
+
## π Citations & Acknowledgements
|
209 |
+
|
210 |
+
### Citation
|
211 |
+
|
212 |
+
```bibtex
|
213 |
+
@misc{SuperNovaMediusCMW4A16,
|
214 |
+
author = {Edward Kim and Jaro Uljanovs},
|
215 |
+
title = {SuperNova Medius Compressed Model W4A16},
|
216 |
+
year = {2024},
|
217 |
+
howpublished = {\url{https://huggingface.co/arcee-ai/SuperNova-Medius-CM-w4a16}},
|
218 |
+
}
|
219 |
+
```
|
220 |
+
|
221 |
+
### π Acknowledgements
|
222 |
+
- Original Model: arcee-ai/SuperNova-Medius
|
223 |
+
- Quantization Tools: LLM Compressor
|
224 |
+
- Contributors: Edward Kim and Jaro Uljanovs
|
225 |
+
|
226 |
+
---
|
227 |
+
|
228 |
+
## π Version History
|
229 |
+
|
230 |
+
- v1.0.0 (2024-03): Initial release
|
231 |
+
- v1.0.1 (2024-03): Documentation updates
|