nm-testing
/

Meta-Llama-3-8B-Instruct-FP8-K-V

Text Generation

text-generation-inference

Inference Endpoints

compressed-tensors

Model card Files Files and versions Community

Meta-Llama-3-8B-Instruct-FP8-K-V / README.md

mgoin's picture

Update README.md

9755602 verified 3 months ago

|

history blame contribute delete

1.68 kB

	---
	tags:
	- fp8
	- vllm
	license: llama3
	license_link: https://llama.meta.com/llama3/license/
	language:
	- en
	---



	# Meta-Llama-3-8B-Instruct-FP8

	## Model Overview
	- Model Architecture: Meta-Llama-3
	- Input: Text
	- Output: Text
	- Model Optimizations:
	- Weight quantization: FP8
	- Activation quantization: FP8
	- KV cache quantization: FP8
	- Intended Use Cases: Intended for commercial and research use in English. Similarly to [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct), this models is intended for assistant-like chat.
	- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
	- Release Date: 6/8/2024
	- Version: 1.0
	- License(s): [Llama3](https://llama.meta.com/llama3/license/)
	- Model Developers: Neural Magic

	Quantized version of [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct).


	```
	lm_eval --model vllm --model_args pretrained=nm-testing/Meta-Llama-3-8B-Instruct-FP8-K-V,kv_cache_dtype=fp8,add_bos_token=True --tasks gsm8k --num_fewshot 5 --batch_size auto

	vllm (pretrained=nm-testing/Meta-Llama-3-8B-Instruct-FP8-K-V,kv_cache_dtype=fp8,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
	\|Tasks\|Version\| Filter \|n-shot\| Metric \| \|Value \| \|Stderr\|
	\|-----\|------:\|----------------\|-----:\|-----------\|---\|-----:\|---\|-----:\|
	\|gsm8k\| 3\|flexible-extract\| 5\|exact_match\|↑ \|0.7748\|± \|0.0115\|
	\| \| \|strict-match \| 5\|exact_match\|↑ \|0.7763\|± \|0.0115\|
	```