luow-amd haoyang-amd commited on
Commit
52efb3a
1 Parent(s): 62f74b5

Update README.md (#9)

Browse files

- Update README.md (d0c8391aeee79a2c97670ecd97b5d35df548412d)


Co-authored-by: haoyanli <haoyang-amd@users.noreply.huggingface.co>

Files changed (1) hide show
  1. README.md +6 -4
README.md CHANGED
@@ -14,7 +14,7 @@ base_model:
14
  - ## Introduction
15
  This model was created by applying [Quark](https://quark.docs.amd.com/latest/index.html) with calibration samples from Pile dataset.
16
  - ## Quantization Stragegy
17
- - ***Quantized Layers***: All linear layers excluding "lm_head", "*gate"
18
  - ***Weight***: FP8 symmetric per-tensor
19
  - ***Activation***: FP8 symmetric per-tensor
20
  - ***KV Cache***: FP8 symmetric  per-tensor
@@ -29,16 +29,18 @@ python3 quantize_quark.py \
29
         --output_dir deepseek-moe-16b-chat-FP8-KV \
30
         --quant_scheme w_fp8_a_fp8 \
31
         --kv_cache_dtype fp8 \
32
-        --num_calib_data 128  \
33
-        --model_export quark_safetensors
 
34
  # If model size is too large for single GPU, please use multi GPU instead.
35
  python3 quantize_quark.py \
36
         --model_dir $MODEL_DIR \
37
         --output_dir deepseek-moe-16b-chat-FP8-KV \
38
         --quant_scheme w_fp8_a_fp8 \
39
         --kv_cache_dtype fp8 \
40
-        --num_calib_data 128  \
41
         --model_export quark_safetensors \
 
42
         --multi_gpu
43
  ```
44
  ## Deployment
 
14
  - ## Introduction
15
  This model was created by applying [Quark](https://quark.docs.amd.com/latest/index.html) with calibration samples from Pile dataset.
16
  - ## Quantization Stragegy
17
+ - ***Quantized Layers***: All linear layers excluding "lm_head", "*.gate"
18
  - ***Weight***: FP8 symmetric per-tensor
19
  - ***Activation***: FP8 symmetric per-tensor
20
  - ***KV Cache***: FP8 symmetric  per-tensor
 
29
         --output_dir deepseek-moe-16b-chat-FP8-KV \
30
         --quant_scheme w_fp8_a_fp8 \
31
         --kv_cache_dtype fp8 \
32
+        --num_calib_data 128 \
33
+        --model_export quark_safetensors \
34
+ --no_weight_matrix_merge
35
  # If model size is too large for single GPU, please use multi GPU instead.
36
  python3 quantize_quark.py \
37
         --model_dir $MODEL_DIR \
38
         --output_dir deepseek-moe-16b-chat-FP8-KV \
39
         --quant_scheme w_fp8_a_fp8 \
40
         --kv_cache_dtype fp8 \
41
+        --num_calib_data 128 \
42
         --model_export quark_safetensors \
43
+ --no_weight_matrix_merge \
44
         --multi_gpu
45
  ```
46
  ## Deployment