Spaces:

yusufs
/

vllm-inference

Paused

yusufs commited on Nov 27, 2024

Commit

fc30f26

1 Parent(s): d90e4d6

feat(download-model): add download model at runtime

Files changed (4) hide show

Dockerfile CHANGED Viewed

@@ -1,5 +1,8 @@
 FROM python:3.12
 RUN useradd -m -u 1000 user
 USER user
 ENV PATH="/home/user/.local/bin:$PATH"
@@ -11,6 +14,11 @@ RUN pip install --no-cache-dir -r requirements.txt --extra-index-url https://dow
 COPY --chown=user . /app
 EXPOSE 7860
 #CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]

 FROM python:3.12
+# Declare your environment variables with the ARG directive
+ARG HF_TOKEN
 RUN useradd -m -u 1000 user
 USER user
 ENV PATH="/home/user/.local/bin:$PATH"
 COPY --chown=user . /app
+# Download at build time,
+# to ensure during restart we won't have to wait for the download from HF (only wait for docker pull).
+RUN python /app/download_model.py
 EXPOSE 7860
 #CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -15,6 +15,7 @@ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-
 poetry export -f requirements.txt --output requirements.txt --without-hashes
 ```
 ## VLLM OpenAI Compatible API Server
@@ -27,7 +28,7 @@ Fixes:
 This `api_server.py` file is exact copy version from https://github.com/vllm-project/vllm/blob/v0.6.4.post1/vllm/entrypoints/openai/api_server.py
-* The `HUGGING_FACE_HUB_TOKEN` must exist during runtime.
 ## Documentation about config

 poetry export -f requirements.txt --output requirements.txt --without-hashes
 ```
+* The `HUGGING_FACE_HUB_TOKEN` and `HF_TOKEN` must exist during runtime (use the same value, it must have read permission to the model.)
 ## VLLM OpenAI Compatible API Server
 This `api_server.py` file is exact copy version from https://github.com/vllm-project/vllm/blob/v0.6.4.post1/vllm/entrypoints/openai/api_server.py
 ## Documentation about config

download_model.py ADDED Viewed

+import os
+from huggingface_hub import snapshot_download
+hf_token: str = os.getenv("HF_TOKEN")
+hf_token = hf_token.strip()
+if hf_token == "":
+    raise ValueError("HF_TOKEN is empty")
+snapshot_download(
+    repo_id="sail/Sailor-4B-Chat",
+    revision="89a866a7041e6ec023dd462adeca8e28dd53c83e",
+    token=hf_token,
+)

run.sh CHANGED Viewed

@@ -20,8 +20,6 @@ python -u /app/openai_compatible_api_server.py \
     --revision 89a866a7041e6ec023dd462adeca8e28dd53c83e \
     --host 0.0.0.0 \
     --port 7860 \
-    --max-num-batched-tokens 32768 \
-    --max-model-len 32768 \
     --dtype half \
     --enforce-eager \
     --gpu-memory-utilization 0.85

     --revision 89a866a7041e6ec023dd462adeca8e28dd53c83e \
     --host 0.0.0.0 \
     --port 7860 \
     --dtype half \
     --enforce-eager \
     --gpu-memory-utilization 0.85