lmdeploy
/

llama2-chat-13b-w4

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

unsubscribe commited on Aug 14, 2023

Commit

406a7cf

•

1 Parent(s): 003fce4

7b -> 13b

Files changed (1) hide show

README.md +4 -3

README.md CHANGED Viewed

@@ -33,11 +33,11 @@ You can download the pre-quantized 4-bit weight models from LMDeploy's [model zo
 Alternatively, you can quantize 16-bit weights to 4-bit weights following the ["4-bit Weight Quantization"](#4-bit-weight-quantization) section, and then perform inference as per the below instructions.
-Take the 4-bit Llama-2-7B model from the model zoo as an example:
 ```shell
 git-lfs install
-git clone https://huggingface.co/lmdeploy/llama2-chat-7b-w4
 ```
 As demonstrated in the command below, first convert the model's layout using `turbomind.deploy`, and then you can interact with the AI assistant in the terminal
@@ -47,7 +47,7 @@ As demonstrated in the command below, first convert the model's layout using `tu
 ## Convert the model's layout and store it in the default path, ./workspace.
 python3 -m lmdeploy.serve.turbomind.deploy \
     --model-name llama2 \
-    --model-path ./llama2-chat-7b-w4 \
     --model-format awq \
     --group-size 128
@@ -104,6 +104,7 @@ LMDeploy employs AWQ algorithm for model weight quantization.
 ```shell
 python3 -m lmdeploy.lite.apis.auto_awq \
   --w_bits 4 \                       # Bit number for weight quantization
   --w_sym False \                    # Whether to use symmetric quantization for weights
   --w_group_size 128 \               # Group size for weight quantization statistics

 Alternatively, you can quantize 16-bit weights to 4-bit weights following the ["4-bit Weight Quantization"](#4-bit-weight-quantization) section, and then perform inference as per the below instructions.
+Take the 4-bit Llama-2-13B model from the model zoo as an example:
 ```shell
 git-lfs install
+git clone https://huggingface.co/lmdeploy/llama2-chat-13b-w4
 ```
 As demonstrated in the command below, first convert the model's layout using `turbomind.deploy`, and then you can interact with the AI assistant in the terminal
 ## Convert the model's layout and store it in the default path, ./workspace.
 python3 -m lmdeploy.serve.turbomind.deploy \
     --model-name llama2 \
+    --model-path ./llama2-chat-13b-w4 \
     --model-format awq \
     --group-size 128
 ```shell
 python3 -m lmdeploy.lite.apis.auto_awq \
+  --model $HF_MODEL \
   --w_bits 4 \                       # Bit number for weight quantization
   --w_sym False \                    # Whether to use symmetric quantization for weights
   --w_group_size 128 \               # Group size for weight quantization statistics