jinxuewen commited on
Commit
ea87219
1 Parent(s): bcc5d26
Files changed (1) hide show
  1. README.md +164 -3
README.md CHANGED
@@ -1,3 +1,164 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+ # Vicuna 13b Weights vicuna-weights
3
+
4
+ from fschat https://github.com/lm-sys/FastChat.git
5
+
6
+ ## Install
7
+
8
+ ```bash
9
+ pip3 install fschat
10
+ ```
11
+
12
+ ## Inference with Command Line Interface
13
+
14
+ (Experimental Feature: You can specify `--style rich` to enable rich text output and better text streaming quality for some non-ASCII content. This may not work properly on certain terminals.)
15
+
16
+ <a href="https://chat.lmsys.org"><img src="assets/screenshot_cli.png" width="70%"></a>
17
+
18
+ #### Single GPU
19
+ The command below requires around 28GB of GPU memory for Vicuna-13B and 14GB of GPU memory for Vicuna-7B.
20
+ See the "No Enough Memory" section below if you do not have enough memory.
21
+ ```
22
+ python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights
23
+ ```
24
+
25
+ #### Multiple GPUs
26
+ You can use model parallelism to aggregate GPU memory from multiple GPUs on the same machine.
27
+ ```
28
+ python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --num-gpus 2
29
+ ```
30
+
31
+ #### CPU Only
32
+ This runs on the CPU only and does not require GPU. It requires around 60GB of CPU memory for Vicuna-13B and around 30GB of CPU memory for Vicuna-7B.
33
+ ```
34
+ python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --device cpu
35
+ ```
36
+
37
+ #### Metal Backend (Mac Computers with Apple Silicon or AMD GPUs)
38
+ Use `--device mps` to enable GPU acceleration on Mac computers (requires torch >= 2.0).
39
+ Use `--load-8bit` to turn on 8-bit compression.
40
+ ```
41
+ python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --device mps --load-8bit
42
+ ```
43
+ Vicuna-7B can run on a 32GB M1 Macbook with 1 - 2 words / second.
44
+
45
+
46
+ #### No Enough Memory or Other Platforms
47
+ If you do not have enough memory, you can enable 8-bit compression by adding `--load-8bit` to commands above.
48
+ This can reduce memory usage by around half with slightly degraded model quality.
49
+ It is compatible with the CPU, GPU, and Metal backend.
50
+ Vicuna-13B with 8-bit compression can run on a single NVIDIA 3090/4080/V100(16GB) GPU.
51
+
52
+ ```
53
+ python3 -m fastchat.serve.cli --model-path /path/to/vicuna/weights --load-8bit
54
+ ```
55
+
56
+ Besides, we are actively exploring more methods to make the model easier to run on more platforms.
57
+ Contributions and pull requests are welcome.
58
+
59
+ ## Serving with Web GUI
60
+
61
+ <a href="https://chat.lmsys.org"><img src="assets/screenshot_gui.png" width="70%"></a>
62
+
63
+ To serve using the web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the webserver and model workers. Here are the commands to follow in your terminal:
64
+
65
+ #### Launch the controller
66
+ ```bash
67
+ python3 -m fastchat.serve.controller
68
+ ```
69
+
70
+ This controller manages the distributed workers.
71
+
72
+ #### Launch the model worker
73
+ ```bash
74
+ python3 -m fastchat.serve.model_worker --model-path /path/to/vicuna/weights
75
+ ```
76
+ Wait until the process finishes loading the model and you see "Uvicorn running on ...". You can launch multiple model workers to serve multiple models concurrently. The model worker will connect to the controller automatically.
77
+
78
+ To ensure that your model worker is connected to your controller properly, send a test message using the following command:
79
+ ```bash
80
+ python3 -m fastchat.serve.test_message --model-name vicuna-13b
81
+ ```
82
+
83
+ #### Launch the Gradio web server
84
+ ```bash
85
+ python3 -m fastchat.serve.gradio_web_server
86
+ ```
87
+
88
+ This is the user interface that users will interact with.
89
+
90
+ By following these steps, you will be able to serve your models using the web UI. You can open your browser and chat with a model now.
91
+
92
+
93
+ ## API
94
+
95
+ ### Huggingface Generation APIs
96
+ See [fastchat/serve/huggingface_api.py](fastchat/serve/huggingface_api.py)
97
+
98
+ ### OpenAI-compatible RESTful APIs & SDK
99
+
100
+ (Experimental. We will keep improving the API and SDK.)
101
+
102
+ #### Chat Completion
103
+
104
+ Reference: https://platform.openai.com/docs/api-reference/chat/create
105
+
106
+ Some features/compatibilities to be implemented:
107
+
108
+ - [ ] streaming
109
+ - [ ] support of some parameters like `top_p`, `presence_penalty`
110
+ - [ ] proper error handling (e.g. model not found)
111
+ - [ ] the return value in the client SDK could be used like a dict
112
+
113
+
114
+ **RESTful API Server**
115
+
116
+ First, launch the controller
117
+
118
+ ```bash
119
+ python3 -m fastchat.serve.controller
120
+ ```
121
+
122
+ Then, launch the model worker(s)
123
+
124
+ ```bash
125
+ python3 -m fastchat.serve.model_worker --model-name 'vicuna-7b-v1.1' --model-path /path/to/vicuna/weights
126
+ ```
127
+
128
+ Finally, launch the RESTful API server
129
+
130
+ ```bash
131
+ export FASTCHAT_CONTROLLER_URL=http://localhost:21001
132
+ python3 -m fastchat.serve.api --host localhost --port 8000
133
+ ```
134
+
135
+ Test the API server
136
+
137
+ ```bash
138
+ curl http://localhost:8000/v1/chat/completions \
139
+ -H "Content-Type: application/json" \
140
+ -d '{
141
+ "model": "vicuna-7b-v1.1",
142
+ "messages": [{"role": "user", "content": "Hello!"}]
143
+ }'
144
+ ```
145
+
146
+ **Client SDK**
147
+
148
+ Assuming environment variable `FASTCHAT_BASEURL` is set to the API server URL (e.g., `http://localhost:8000`), you can use the following code to send a request to the API server:
149
+
150
+ ```python
151
+ import os
152
+ from fastchat import client
153
+
154
+ client.set_baseurl(os.getenv("FASTCHAT_BASEURL"))
155
+
156
+ completion = client.ChatCompletion.create(
157
+ model="vicuna-7b-v1.1",
158
+ messages=[
159
+ {"role": "user", "content": "Hello!"}
160
+ ]
161
+ )
162
+
163
+ print(completion.choices[0].message)
164
+ ```