amd
/

Mixtral-8x7B-Instruct-v0.1-FP8-KV

bowenbaoamd commited on 7 days ago

Commit

1ac52b5

•

1 Parent(s): a2b8b36

Upload README.md with huggingface_hub

Files changed (1) hide show

README.md CHANGED Viewed

@@ -25,7 +25,8 @@ python3 quantize_quark.py \
         --kv_cache_dtype fp8 \
         --num_calib_data 128 \
         --model_export quark_safetensors \
-        --no_weight_matrix_merge
 # If model size is too large for single GPU, please use multi GPU instead.
 python3 quantize_quark.py \
         --model_dir $MODEL_DIR \
@@ -35,7 +36,8 @@ python3 quantize_quark.py \
         --num_calib_data 128 \
         --model_export quark_safetensors \
         --no_weight_matrix_merge \
-        --multi_gpu
 ```
 ## Deployment
 Quark has its own export format and allows FP8 quantized models to be efficiently deployed using the vLLM backend(vLLM-compatible).

         --kv_cache_dtype fp8 \
         --num_calib_data 128 \
         --model_export quark_safetensors \
+        --no_weight_matrix_merge \
+        --custom_mode fp8
 # If model size is too large for single GPU, please use multi GPU instead.
 python3 quantize_quark.py \
         --model_dir $MODEL_DIR \
         --num_calib_data 128 \
         --model_export quark_safetensors \
         --no_weight_matrix_merge \
+        --multi_gpu \
+        --custom_mode fp8
 ```
 ## Deployment
 Quark has its own export format and allows FP8 quantized models to be efficiently deployed using the vLLM backend(vLLM-compatible).