mgoin's picture
Update README.md
9755602 verified
metadata
tags:
  - fp8
  - vllm
license: llama3
license_link: https://llama.meta.com/llama3/license/
language:
  - en

Meta-Llama-3-8B-Instruct-FP8

Model Overview

  • Model Architecture: Meta-Llama-3
    • Input: Text
    • Output: Text
  • Model Optimizations:
    • Weight quantization: FP8
    • Activation quantization: FP8
    • KV cache quantization: FP8
  • Intended Use Cases: Intended for commercial and research use in English. Similarly to Meta-Llama-3-8B-Instruct, this models is intended for assistant-like chat.
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
  • Release Date: 6/8/2024
  • Version: 1.0
  • License(s): Llama3
  • Model Developers: Neural Magic

Quantized version of Meta-Llama-3-8B-Instruct.

lm_eval --model vllm --model_args pretrained=nm-testing/Meta-Llama-3-8B-Instruct-FP8-K-V,kv_cache_dtype=fp8,add_bos_token=True --tasks gsm8k --num_fewshot 5 --batch_size auto

vllm (pretrained=nm-testing/Meta-Llama-3-8B-Instruct-FP8-K-V,kv_cache_dtype=fp8,add_bos_token=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7748|±  |0.0115|
|     |       |strict-match    |     5|exact_match|↑  |0.7763|±  |0.0115|