neuralmagic
/

Meta-Llama-3.1-8B-Instruct-FP8-dynamic

Text Generation

text-generation-inference

Inference Endpoints

compressed-tensors

Model card Files Files and versions Community

ekurtic commited on Oct 19, 2024

Commit

b0b04ce

·

verified ·

1 Parent(s): 0906e2c

Update README

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -42,7 +42,7 @@ This model was obtained by quantizing the weights and activations of [Meta-Llama
 This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
-[LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization with 512 sequences of UltraChat.
 ## Deployment

 This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.
 Only the weights and activations of the linear operators within transformers blocks are quantized. Symmetric per-channel quantization is applied, in which a linear scaling per output dimension maps the FP8 representations of the quantized weights and activations. Activations are also quantized on a per-token dynamic basis.
+[LLM Compressor](https://github.com/vllm-project/llm-compressor) is used for quantization.
 ## Deployment