JustJaro commited on
Commit
fbc5ed8
Β·
verified Β·
1 Parent(s): 5a79fe5

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +231 -0
README.md ADDED
@@ -0,0 +1,231 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # πŸš€ SuperNova Medius Compressed Model (W4A16)
2
+
3
+ [![Model Size](https://img.shields.io/badge/Size-Compressed-green)]()
4
+ [![Quantization](https://img.shields.io/badge/Quantization-W4A16-blue)]()
5
+ [![Max Sequence Length](https://img.shields.io/badge/Max%20Length-4096-orange)]()
6
+
7
+ > **Model ID**: `arcee-ai/SuperNova-Medius-CM-w4a16`
8
+
9
+ ## πŸ“‹ Table of Contents
10
+ - [Overview](#overview)
11
+ - [Quick Start](#quick-start)
12
+ - [Model Details](#model-details)
13
+ - [Usage Guide](#usage-guide)
14
+ - [Quantization Process](#quantization-process)
15
+ - [Performance & Benchmarks](#performance--benchmarks)
16
+ - [Technical Details](#technical-details)
17
+ - [Limitations & Biases](#limitations--biases)
18
+ - [Citations & Acknowledgements](#citations--acknowledgements)
19
+
20
+ ## πŸ” Overview
21
+
22
+ SuperNova Medius CM W4A16 is a quantized version of the `arcee-ai/SuperNova-Medius` model, optimized for efficient deployment. Using GPTQ (Generalized Post-Training Quantization), we've achieved significant size reduction while maintaining near-original performance.
23
+
24
+ ### ✨ Key Features
25
+ - 4-bit weight quantization
26
+ - 16-bit activation quantization
27
+ - 4096 token context window
28
+ - Optimized for deployment on consumer hardware
29
+
30
+ ## πŸš€ Quick Start
31
+
32
+ ```python
33
+ from transformers import AutoTokenizer, AutoModelForCausalLM
34
+
35
+ # Load model and tokenizer
36
+ tokenizer = AutoTokenizer.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
37
+ model = AutoModelForCausalLM.from_pretrained("arcee-ai/SuperNova-Medius-CM-w4a16")
38
+
39
+ # Simple inference
40
+ text = "Hello, how are you?"
41
+ inputs = tokenizer(text, return_tensors="pt")
42
+ outputs = model.generate(**inputs)
43
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
44
+ ```
45
+
46
+ ## πŸ“Š Model Details
47
+
48
+ ### Specifications
49
+ - **Base Model**: arcee-ai/SuperNova-Medius
50
+ - **Quantization Method**: GPTQ
51
+ - **Maximum Sequence Length**: 4096
52
+ - **Calibration Samples**: 1024
53
+
54
+ ### Quantization Parameters
55
+ | Parameter | Value |
56
+ |-----------|--------|
57
+ | Weight Bits | 4 |
58
+ | Activation Bits | 16 |
59
+ | Ignored Layers | lm_head |
60
+ | Dampening Fraction | 0.1 |
61
+ | Calibration Dataset | neuralmagic/LLM_compression_calibration |
62
+
63
+ ## πŸ’» Usage Guide
64
+
65
+ ### Basic Usage
66
+ See Quick Start section above.
67
+
68
+ ### Advanced Usage
69
+
70
+ ```python
71
+ # Advanced generation with parameters
72
+ output = model.generate(
73
+ input_ids,
74
+ max_length=100,
75
+ num_beams=4,
76
+ temperature=0.7,
77
+ no_repeat_ngram_size=2,
78
+ do_sample=True
79
+ )
80
+ ```
81
+
82
+ ### Memory Optimization
83
+
84
+ ```python
85
+ # Load model with device map for multi-GPU setup
86
+ model = AutoModelForCausalLM.from_pretrained(
87
+ "arcee-ai/SuperNova-Medius-CM-w4a16",
88
+ device_map="auto",
89
+ torch_dtype=torch.bfloat16
90
+ )
91
+ ```
92
+
93
+ ## βš™οΈ Quantization Process
94
+
95
+ ```python
96
+ import torch
97
+ from datasets import load_dataset
98
+ from transformers import AutoTokenizer
99
+ from llmcompressor.modifiers.quantization import GPTQModifier
100
+ from llmcompressor.transformers import SparseAutoModelForCausalLM, oneshot
101
+ from llmcompressor.transformers.compression.helpers import calculate_offload_device_map
102
+
103
+ # Configuration
104
+ MODEL_ID = "arcee-ai/SuperNova-Medius"
105
+ NUM_SAMPLES = 1024
106
+ MAX_LENGTH = 4096
107
+ SEED = 42
108
+
109
+ # Calculate device map
110
+ device_map = calculate_offload_device_map(
111
+ MODEL_ID,
112
+ num_gpus=torch.cuda.device_count(),
113
+ reserve_for_hessians=True,
114
+ torch_dtype=torch.bfloat16
115
+ )
116
+
117
+ # Load model and tokenizer
118
+ model = SparseAutoModelForCausalLM.from_pretrained(
119
+ MODEL_ID,
120
+ device_map=device_map,
121
+ torch_dtype=torch.bfloat16
122
+ )
123
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
124
+
125
+ # Prepare calibration dataset
126
+ ds = load_dataset("neuralmagic/LLM_compression_calibration")
127
+ ds = ds["train"].shuffle(seed=SEED).select(range(NUM_SAMPLES))
128
+
129
+ def preprocess(example):
130
+ return {"text": tokenizer.apply_chat_template(example["messages"], tokenize=False)}
131
+
132
+ ds = ds.map(preprocess)
133
+
134
+ def tokenize(sample):
135
+ return tokenizer(
136
+ sample["text"],
137
+ padding=False,
138
+ max_length=MAX_LENGTH,
139
+ truncation=True,
140
+ add_special_tokens=False
141
+ )
142
+
143
+ ds = ds.map(tokenize)
144
+
145
+ # Configure quantization
146
+ recipe = GPTQModifier(
147
+ targets="Linear",
148
+ scheme="W4A16",
149
+ ignore=["lm_head"],
150
+ dampening_frac=0.1
151
+ )
152
+
153
+ # Execute quantization
154
+ oneshot(
155
+ model=model,
156
+ dataset=ds,
157
+ recipe=recipe,
158
+ oneshot_device=device_map,
159
+ max_seq_length=MAX_LENGTH,
160
+ num_calibration_samples=NUM_SAMPLES,
161
+ accelerator_config={
162
+ 'split_batches': True,
163
+ 'dispatch_batches': None,
164
+ 'even_batches': True,
165
+ 'use_seedable_sampler': True,
166
+ 'non_blocking': False,
167
+ 'gradient_accumulation_kwargs': None,
168
+ 'use_configured_state': False
169
+ }
170
+ )
171
+
172
+ # Save quantized model
173
+ model.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16", save_compressed=True)
174
+ tokenizer.save_pretrained("./arcee-ai/SuperNova-Medius-CM-w4a16")
175
+ ```
176
+
177
+ ## πŸ› οΈ Technical Details
178
+
179
+ ### Dependencies
180
+ | Package | Version |
181
+ |---------|---------|
182
+ | Python | 3.9.x |
183
+ | torch | 2.5.1 |
184
+ | transformers | 4.46.2 |
185
+ | llmcompressor | 0.5.0 |
186
+ | vllm | 0.6.4 |
187
+ | datasets | 3.1.0 |
188
+ | huggingface_hub | 0.24.7 |
189
+ | compressed-tensors | 0.8.0 |
190
+
191
+ ### Hardware Requirements
192
+ - **Minimum**: 8GB VRAM
193
+ - **Recommended**: 16GB VRAM
194
+ - **Optimal**: 24GB VRAM or multiple GPUs
195
+
196
+ ## ⚠️ Limitations & Biases
197
+
198
+ ### Known Limitations
199
+ - Slight performance degradation compared to full-precision model
200
+ - Limited to 4096 token context window
201
+ - May require careful memory management on consumer GPUs
202
+
203
+ ### Inherited Biases
204
+ - Carries over biases from base model
205
+ - Users should implement appropriate content filtering
206
+ - Regular evaluation recommended for production deployments
207
+
208
+ ## πŸ“š Citations & Acknowledgements
209
+
210
+ ### Citation
211
+
212
+ ```bibtex
213
+ @misc{SuperNovaMediusCMW4A16,
214
+ author = {Edward Kim and Jaro Uljanovs},
215
+ title = {SuperNova Medius Compressed Model W4A16},
216
+ year = {2024},
217
+ howpublished = {\url{https://huggingface.co/arcee-ai/SuperNova-Medius-CM-w4a16}},
218
+ }
219
+ ```
220
+
221
+ ### πŸ‘ Acknowledgements
222
+ - Original Model: arcee-ai/SuperNova-Medius
223
+ - Quantization Tools: LLM Compressor
224
+ - Contributors: Edward Kim and Jaro Uljanovs
225
+
226
+ ---
227
+
228
+ ## πŸ“ Version History
229
+
230
+ - v1.0.0 (2024-03): Initial release
231
+ - v1.0.1 (2024-03): Documentation updates