Lin-K76 commited on
Commit
019cb18
1 Parent(s): b38cf0b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +55 -1
README.md CHANGED
@@ -7,7 +7,15 @@ tags:
7
  # Qwen2-7B-Instruct-FP8
8
 
9
  ## Model Overview
10
- Qwen2-7B-Instruct quantized to FP8 weights and activations using per-tensor quantization, ready for inference with vLLM >= 0.5.0.
 
 
 
 
 
 
 
 
11
 
12
  ## Usage and Creation
13
  Produced using [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py).
@@ -37,8 +45,54 @@ model.quantize(examples)
37
  model.save_quantized(quantized_model_dir)
38
  ```
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ## Evaluation
41
 
 
 
42
  ### Open LLM Leaderboard evaluation scores
43
  | | Qwen2-7B-Instruct | Qwen2-7B-Instruct-FP8<br>(this model) |
44
  | :------------------: | :----------------------: | :------------------------------------------------: |
 
7
  # Qwen2-7B-Instruct-FP8
8
 
9
  ## Model Overview
10
+ * <h3 style="display: inline;">Model Architecture:</h3> Based on and identical to the Qwen2-7B-Instruct architecture
11
+ * <h3 style="display: inline;">Model Optimizations:</h3> Weights and activations quantized to FP8
12
+ * <h3 style="display: inline;">Release Date:</h3> June 14, 2024
13
+ * <h3 style="display: inline;">Model Developers:</h3> Neural Magic
14
+
15
+ Qwen2-7B-Instruct quantized to FP8 weights and activations using per-tensor quantization through the [AutoFP8 repository](https://github.com/neuralmagic/AutoFP8), ready for inference with vLLM >= 0.5.0.
16
+ Calibrated with 512 UltraChat samples to achieve 100% performance recovery on the Open LLM Benchmark evaluations.
17
+ Reduces space on disk by ~45%.
18
+ Part of the [FP8 LLMs for vLLM collection](https://huggingface.co/collections/neuralmagic/fp8-llms-for-vllm-666742ed2b78b7ac8df13127).
19
 
20
  ## Usage and Creation
21
  Produced using [AutoFP8 with calibration samples from ultrachat](https://github.com/neuralmagic/AutoFP8/blob/147fa4d9e1a90ef8a93f96fc7d9c33056ddc017a/example_dataset.py).
 
45
  model.save_quantized(quantized_model_dir)
46
  ```
47
 
48
+ Evaluated through vLLM with the following script:
49
+
50
+ ```
51
+ #!/bin/bash
52
+
53
+ # Example usage:
54
+ # CUDA_VISIBLE_DEVICES=0 ./eval_openllm.sh "neuralmagic/Qwen2-7B-Instruct-FP8" "tensor_parallel_size=1,max_model_len=4096,add_bos_token=True,gpu_memory_utilization=0.7"
55
+
56
+ export MODEL_DIR=${1}
57
+ export MODEL_ARGS=${2}
58
+
59
+ declare -A tasks_fewshot=(
60
+ ["arc_challenge"]=25
61
+ ["winogrande"]=5
62
+ ["truthfulqa_mc2"]=0
63
+ ["hellaswag"]=10
64
+ ["mmlu"]=5
65
+ ["gsm8k"]=5
66
+ )
67
+
68
+ declare -A batch_sizes=(
69
+ ["arc_challenge"]="auto"
70
+ ["winogrande"]="auto"
71
+ ["truthfulqa_mc2"]="auto"
72
+ ["hellaswag"]="auto"
73
+ ["mmlu"]=1
74
+ ["gsm8k"]="auto"
75
+ )
76
+
77
+ for TASK in "${!tasks_fewshot[@]}"; do
78
+ NUM_FEWSHOT=${tasks_fewshot[$TASK]}
79
+ BATCH_SIZE=${batch_sizes[$TASK]}
80
+ lm_eval --model vllm \
81
+ --model_args pretrained=$MODEL_DIR,$MODEL_ARGS \
82
+ --tasks ${TASK} \
83
+ --num_fewshot ${NUM_FEWSHOT} \
84
+ --write_out \
85
+ --show_config \
86
+ --device cuda \
87
+ --batch_size ${BATCH_SIZE} \
88
+ --output_path="results/${TASK}"
89
+ done
90
+ ```
91
+
92
  ## Evaluation
93
 
94
+ Evaluated on the Open LLM Leaderboard evaluations through vLLM.
95
+
96
  ### Open LLM Leaderboard evaluation scores
97
  | | Qwen2-7B-Instruct | Qwen2-7B-Instruct-FP8<br>(this model) |
98
  | :------------------: | :----------------------: | :------------------------------------------------: |