Update README.md
Browse files
README.md
CHANGED
@@ -34,7 +34,7 @@ This optimization reduces the number of bits per parameter from 16 to 8, reducin
|
|
34 |
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
|
35 |
[AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with 512 sequences of UltraChat.
|
36 |
|
37 |
-
## Deployment
|
38 |
|
39 |
### Use with vLLM
|
40 |
|
@@ -65,12 +65,13 @@ generated_text = outputs[0].outputs[0].text
|
|
65 |
print(generated_text)
|
66 |
```
|
67 |
|
68 |
-
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
|
69 |
|
70 |
## Creation
|
71 |
|
72 |
This model was created by applying [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py), as presented in the code snipet below.
|
73 |
Although AutoFP8 was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoFP8.
|
|
|
74 |
|
75 |
```python
|
76 |
from datasets import load_dataset
|
@@ -105,6 +106,7 @@ model.save_quantized(quantized_model_dir)
|
|
105 |
## Evaluation
|
106 |
|
107 |
The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
|
|
|
108 |
```
|
109 |
lm_eval \
|
110 |
--model vllm \
|
|
|
34 |
Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-tensor quantization is applied, in which a single linear scaling maps the FP8 representations of the quantized weights and activations.
|
35 |
[AutoFP8](https://github.com/neuralmagic/AutoFP8) is used for quantization with 512 sequences of UltraChat.
|
36 |
|
37 |
+
<!-- ## Deployment
|
38 |
|
39 |
### Use with vLLM
|
40 |
|
|
|
65 |
print(generated_text)
|
66 |
```
|
67 |
|
68 |
+
vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. -->
|
69 |
|
70 |
## Creation
|
71 |
|
72 |
This model was created by applying [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py), as presented in the code snipet below.
|
73 |
Although AutoFP8 was used for this particular model, Neural Magic is transitioning to using [llm-compressor](https://github.com/vllm-project/llm-compressor) which supports several quantization schemes and models not supported by AutoFP8.
|
74 |
+
Note that ```transformers``` must be built from source.
|
75 |
|
76 |
```python
|
77 |
from datasets import load_dataset
|
|
|
106 |
## Evaluation
|
107 |
|
108 |
The model was evaluated on the [OpenLLM](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard) leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/383bbd54bc621086e05aa1b030d8d4d5635b25e6) (commit 383bbd54bc621086e05aa1b030d8d4d5635b25e6) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command:
|
109 |
+
Note that ```vllm``` must be built from source.
|
110 |
```
|
111 |
lm_eval \
|
112 |
--model vllm \
|