kishizaki-sci commited on
Commit
e74261e
·
verified ·
1 Parent(s): 958af28

Upload inference_vLLM.ipynb

Browse files
Files changed (1) hide show
  1. inference_vLLM.ipynb +1349 -0
inference_vLLM.ipynb ADDED
@@ -0,0 +1,1349 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "id": "e0538a90-61d8-4bd0-b2f7-e08e69b32295",
7
+ "metadata": {},
8
+ "outputs": [
9
+ {
10
+ "name": "stdout",
11
+ "output_type": "stream",
12
+ "text": [
13
+ "Collecting vllm\n",
14
+ " Downloading vllm-0.6.4.post1-cp38-abi3-manylinux1_x86_64.whl.metadata (10 kB)\n",
15
+ "Requirement already satisfied: psutil in /usr/local/lib/python3.11/dist-packages (from vllm) (6.0.0)\n",
16
+ "Collecting sentencepiece (from vllm)\n",
17
+ " Downloading sentencepiece-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (7.7 kB)\n",
18
+ "Requirement already satisfied: numpy<2.0.0 in /usr/local/lib/python3.11/dist-packages (from vllm) (1.26.3)\n",
19
+ "Requirement already satisfied: requests>=2.26.0 in /usr/local/lib/python3.11/dist-packages (from vllm) (2.32.3)\n",
20
+ "Requirement already satisfied: tqdm in /usr/local/lib/python3.11/dist-packages (from vllm) (4.67.1)\n",
21
+ "Collecting py-cpuinfo (from vllm)\n",
22
+ " Downloading py_cpuinfo-9.0.0-py3-none-any.whl.metadata (794 bytes)\n",
23
+ "Requirement already satisfied: transformers>=4.45.2 in /usr/local/lib/python3.11/dist-packages (from vllm) (4.47.0.dev0)\n",
24
+ "Requirement already satisfied: tokenizers>=0.19.1 in /usr/local/lib/python3.11/dist-packages (from vllm) (0.20.3)\n",
25
+ "Collecting protobuf (from vllm)\n",
26
+ " Downloading protobuf-5.29.1-cp38-abi3-manylinux2014_x86_64.whl.metadata (592 bytes)\n",
27
+ "Requirement already satisfied: aiohttp in /usr/local/lib/python3.11/dist-packages (from vllm) (3.11.8)\n",
28
+ "Collecting openai>=1.45.0 (from vllm)\n",
29
+ " Downloading openai-1.57.0-py3-none-any.whl.metadata (24 kB)\n",
30
+ "Collecting uvicorn[standard] (from vllm)\n",
31
+ " Downloading uvicorn-0.32.1-py3-none-any.whl.metadata (6.6 kB)\n",
32
+ "Collecting pydantic>=2.9 (from vllm)\n",
33
+ " Downloading pydantic-2.10.3-py3-none-any.whl.metadata (172 kB)\n",
34
+ "Requirement already satisfied: pillow in /usr/local/lib/python3.11/dist-packages (from vllm) (10.2.0)\n",
35
+ "Requirement already satisfied: prometheus-client>=0.18.0 in /usr/local/lib/python3.11/dist-packages (from vllm) (0.21.0)\n",
36
+ "Collecting prometheus-fastapi-instrumentator>=7.0.0 (from vllm)\n",
37
+ " Downloading prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl.metadata (13 kB)\n",
38
+ "Collecting tiktoken>=0.6.0 (from vllm)\n",
39
+ " Downloading tiktoken-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)\n",
40
+ "Collecting lm-format-enforcer<0.11,>=0.10.9 (from vllm)\n",
41
+ " Downloading lm_format_enforcer-0.10.9-py3-none-any.whl.metadata (17 kB)\n",
42
+ "Collecting outlines<0.1,>=0.0.43 (from vllm)\n",
43
+ " Downloading outlines-0.0.46-py3-none-any.whl.metadata (15 kB)\n",
44
+ "Collecting typing-extensions>=4.10 (from vllm)\n",
45
+ " Downloading typing_extensions-4.12.2-py3-none-any.whl.metadata (3.0 kB)\n",
46
+ "Requirement already satisfied: filelock>=3.10.4 in /usr/local/lib/python3.11/dist-packages (from vllm) (3.13.1)\n",
47
+ "Collecting partial-json-parser (from vllm)\n",
48
+ " Downloading partial_json_parser-0.2.1.1.post4-py3-none-any.whl.metadata (6.2 kB)\n",
49
+ "Requirement already satisfied: pyzmq in /usr/local/lib/python3.11/dist-packages (from vllm) (24.0.1)\n",
50
+ "Collecting msgspec (from vllm)\n",
51
+ " Downloading msgspec-0.18.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.9 kB)\n",
52
+ "Collecting gguf==0.10.0 (from vllm)\n",
53
+ " Downloading gguf-0.10.0-py3-none-any.whl.metadata (3.5 kB)\n",
54
+ "Requirement already satisfied: importlib-metadata in /usr/lib/python3/dist-packages (from vllm) (4.6.4)\n",
55
+ "Collecting mistral-common>=1.5.0 (from mistral-common[opencv]>=1.5.0->vllm)\n",
56
+ " Downloading mistral_common-1.5.1-py3-none-any.whl.metadata (4.6 kB)\n",
57
+ "Requirement already satisfied: pyyaml in /usr/local/lib/python3.11/dist-packages (from vllm) (6.0.2)\n",
58
+ "Collecting einops (from vllm)\n",
59
+ " Downloading einops-0.8.0-py3-none-any.whl.metadata (12 kB)\n",
60
+ "Collecting compressed-tensors==0.8.0 (from vllm)\n",
61
+ " Downloading compressed_tensors-0.8.0-py3-none-any.whl.metadata (6.8 kB)\n",
62
+ "Collecting ray>=2.9 (from vllm)\n",
63
+ " Downloading ray-2.40.0-cp311-cp311-manylinux2014_x86_64.whl.metadata (17 kB)\n",
64
+ "Collecting nvidia-ml-py>=12.560.30 (from vllm)\n",
65
+ " Downloading nvidia_ml_py-12.560.30-py3-none-any.whl.metadata (8.6 kB)\n",
66
+ "Collecting torch==2.5.1 (from vllm)\n",
67
+ " Downloading torch-2.5.1-cp311-cp311-manylinux1_x86_64.whl.metadata (28 kB)\n",
68
+ "Collecting torchvision==0.20.1 (from vllm)\n",
69
+ " Downloading torchvision-0.20.1-cp311-cp311-manylinux1_x86_64.whl.metadata (6.1 kB)\n",
70
+ "Collecting xformers==0.0.28.post3 (from vllm)\n",
71
+ " Downloading xformers-0.0.28.post3-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (1.0 kB)\n",
72
+ "Collecting fastapi!=0.113.*,!=0.114.0,>=0.107.0 (from vllm)\n",
73
+ " Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)\n",
74
+ "Requirement already satisfied: networkx in /usr/local/lib/python3.11/dist-packages (from torch==2.5.1->vllm) (3.2.1)\n",
75
+ "Requirement already satisfied: jinja2 in /usr/local/lib/python3.11/dist-packages (from torch==2.5.1->vllm) (3.1.3)\n",
76
+ "Requirement already satisfied: fsspec in /usr/local/lib/python3.11/dist-packages (from torch==2.5.1->vllm) (2024.2.0)\n",
77
+ "Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch==2.5.1->vllm)\n",
78
+ " Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
79
+ "Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch==2.5.1->vllm)\n",
80
+ " Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
81
+ "Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch==2.5.1->vllm)\n",
82
+ " Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)\n",
83
+ "Requirement already satisfied: nvidia-cudnn-cu12==9.1.0.70 in /usr/local/lib/python3.11/dist-packages (from torch==2.5.1->vllm) (9.1.0.70)\n",
84
+ "Collecting nvidia-cublas-cu12==12.4.5.8 (from torch==2.5.1->vllm)\n",
85
+ " Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
86
+ "Collecting nvidia-cufft-cu12==11.2.1.3 (from torch==2.5.1->vllm)\n",
87
+ " Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
88
+ "Collecting nvidia-curand-cu12==10.3.5.147 (from torch==2.5.1->vllm)\n",
89
+ " Downloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
90
+ "Collecting nvidia-cusolver-cu12==11.6.1.9 (from torch==2.5.1->vllm)\n",
91
+ " Downloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)\n",
92
+ "Collecting nvidia-cusparse-cu12==12.3.1.170 (from torch==2.5.1->vllm)\n",
93
+ " Downloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)\n",
94
+ "Collecting nvidia-nccl-cu12==2.21.5 (from torch==2.5.1->vllm)\n",
95
+ " Downloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl.metadata (1.8 kB)\n",
96
+ "Collecting nvidia-nvtx-cu12==12.4.127 (from torch==2.5.1->vllm)\n",
97
+ " Downloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.7 kB)\n",
98
+ "Collecting nvidia-nvjitlink-cu12==12.4.127 (from torch==2.5.1->vllm)\n",
99
+ " Downloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)\n",
100
+ "Collecting triton==3.1.0 (from torch==2.5.1->vllm)\n",
101
+ " Downloading triton-3.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (1.3 kB)\n",
102
+ "Collecting sympy==1.13.1 (from torch==2.5.1->vllm)\n",
103
+ " Downloading sympy-1.13.1-py3-none-any.whl.metadata (12 kB)\n",
104
+ "Requirement already satisfied: mpmath<1.4,>=1.1.0 in /usr/local/lib/python3.11/dist-packages (from sympy==1.13.1->torch==2.5.1->vllm) (1.3.0)\n",
105
+ "Collecting starlette<0.42.0,>=0.40.0 (from fastapi!=0.113.*,!=0.114.0,>=0.107.0->vllm)\n",
106
+ " Downloading starlette-0.41.3-py3-none-any.whl.metadata (6.0 kB)\n",
107
+ "Collecting interegular>=0.3.2 (from lm-format-enforcer<0.11,>=0.10.9->vllm)\n",
108
+ " Downloading interegular-0.3.3-py37-none-any.whl.metadata (3.0 kB)\n",
109
+ "Requirement already satisfied: packaging in /usr/local/lib/python3.11/dist-packages (from lm-format-enforcer<0.11,>=0.10.9->vllm) (24.1)\n",
110
+ "Requirement already satisfied: jsonschema<5.0.0,>=4.21.1 in /usr/local/lib/python3.11/dist-packages (from mistral-common>=1.5.0->mistral-common[opencv]>=1.5.0->vllm) (4.23.0)\n",
111
+ "Collecting pillow (from vllm)\n",
112
+ " Downloading pillow-10.4.0-cp311-cp311-manylinux_2_28_x86_64.whl.metadata (9.2 kB)\n",
113
+ "Collecting tiktoken>=0.6.0 (from vllm)\n",
114
+ " Downloading tiktoken-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)\n",
115
+ "Collecting opencv-python-headless<5.0.0,>=4.0.0 (from mistral-common[opencv]>=1.5.0->vllm)\n",
116
+ " Downloading opencv_python_headless-4.10.0.84-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (20 kB)\n",
117
+ "Requirement already satisfied: anyio<5,>=3.5.0 in /usr/local/lib/python3.11/dist-packages (from openai>=1.45.0->vllm) (4.6.0)\n",
118
+ "Requirement already satisfied: distro<2,>=1.7.0 in /usr/lib/python3/dist-packages (from openai>=1.45.0->vllm) (1.7.0)\n",
119
+ "Requirement already satisfied: httpx<1,>=0.23.0 in /usr/local/lib/python3.11/dist-packages (from openai>=1.45.0->vllm) (0.27.2)\n",
120
+ "Collecting jiter<1,>=0.4.0 (from openai>=1.45.0->vllm)\n",
121
+ " Downloading jiter-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (5.2 kB)\n",
122
+ "Requirement already satisfied: sniffio in /usr/local/lib/python3.11/dist-packages (from openai>=1.45.0->vllm) (1.3.1)\n",
123
+ "Collecting lark (from outlines<0.1,>=0.0.43->vllm)\n",
124
+ " Downloading lark-1.2.2-py3-none-any.whl.metadata (1.8 kB)\n",
125
+ "Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.11/dist-packages (from outlines<0.1,>=0.0.43->vllm) (1.6.0)\n",
126
+ "Collecting cloudpickle (from outlines<0.1,>=0.0.43->vllm)\n",
127
+ " Downloading cloudpickle-3.1.0-py3-none-any.whl.metadata (7.0 kB)\n",
128
+ "Collecting diskcache (from outlines<0.1,>=0.0.43->vllm)\n",
129
+ " Downloading diskcache-5.6.3-py3-none-any.whl.metadata (20 kB)\n",
130
+ "Collecting numba (from outlines<0.1,>=0.0.43->vllm)\n",
131
+ " Downloading numba-0.60.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (2.7 kB)\n",
132
+ "Requirement already satisfied: referencing in /usr/local/lib/python3.11/dist-packages (from outlines<0.1,>=0.0.43->vllm) (0.35.1)\n",
133
+ "Requirement already satisfied: datasets in /usr/local/lib/python3.11/dist-packages (from outlines<0.1,>=0.0.43->vllm) (3.1.0)\n",
134
+ "Collecting pycountry (from outlines<0.1,>=0.0.43->vllm)\n",
135
+ " Downloading pycountry-24.6.1-py3-none-any.whl.metadata (12 kB)\n",
136
+ "Collecting pyairports (from outlines<0.1,>=0.0.43->vllm)\n",
137
+ " Downloading pyairports-2.1.1-py3-none-any.whl.metadata (1.7 kB)\n",
138
+ "Collecting annotated-types>=0.6.0 (from pydantic>=2.9->vllm)\n",
139
+ " Downloading annotated_types-0.7.0-py3-none-any.whl.metadata (15 kB)\n",
140
+ "Collecting pydantic-core==2.27.1 (from pydantic>=2.9->vllm)\n",
141
+ " Downloading pydantic_core-2.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.6 kB)\n",
142
+ "Collecting click>=7.0 (from ray>=2.9->vllm)\n",
143
+ " Downloading click-8.1.7-py3-none-any.whl.metadata (3.0 kB)\n",
144
+ "Collecting msgpack<2.0.0,>=1.0.0 (from ray>=2.9->vllm)\n",
145
+ " Downloading msgpack-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.4 kB)\n",
146
+ "Requirement already satisfied: aiosignal in /usr/local/lib/python3.11/dist-packages (from ray>=2.9->vllm) (1.3.1)\n",
147
+ "Requirement already satisfied: frozenlist in /usr/local/lib/python3.11/dist-packages (from ray>=2.9->vllm) (1.5.0)\n",
148
+ "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests>=2.26.0->vllm) (3.3.2)\n",
149
+ "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.11/dist-packages (from requests>=2.26.0->vllm) (3.10)\n",
150
+ "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests>=2.26.0->vllm) (2.2.3)\n",
151
+ "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.11/dist-packages (from requests>=2.26.0->vllm) (2024.8.30)\n",
152
+ "Requirement already satisfied: regex>=2022.1.18 in /usr/local/lib/python3.11/dist-packages (from tiktoken>=0.6.0->vllm) (2024.11.6)\n",
153
+ "Requirement already satisfied: huggingface-hub<1.0,>=0.16.4 in /usr/local/lib/python3.11/dist-packages (from tokenizers>=0.19.1->vllm) (0.26.3)\n",
154
+ "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.11/dist-packages (from transformers>=4.45.2->vllm) (0.4.5)\n",
155
+ "Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->vllm) (2.4.3)\n",
156
+ "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->vllm) (24.2.0)\n",
157
+ "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.11/dist-packages (from aiohttp->vllm) (6.1.0)\n",
158
+ "Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->vllm) (0.2.0)\n",
159
+ "Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp->vllm) (1.18.0)\n",
160
+ "Requirement already satisfied: h11>=0.8 in /usr/local/lib/python3.11/dist-packages (from uvicorn[standard]->vllm) (0.14.0)\n",
161
+ "Collecting httptools>=0.6.3 (from uvicorn[standard]->vllm)\n",
162
+ " Downloading httptools-0.6.4-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.6 kB)\n",
163
+ "Collecting python-dotenv>=0.13 (from uvicorn[standard]->vllm)\n",
164
+ " Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)\n",
165
+ "Collecting uvloop!=0.15.0,!=0.15.1,>=0.14.0 (from uvicorn[standard]->vllm)\n",
166
+ " Downloading uvloop-0.21.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)\n",
167
+ "Collecting watchfiles>=0.13 (from uvicorn[standard]->vllm)\n",
168
+ " Downloading watchfiles-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)\n",
169
+ "Collecting websockets>=10.4 (from uvicorn[standard]->vllm)\n",
170
+ " Downloading websockets-14.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.7 kB)\n",
171
+ "Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.11/dist-packages (from httpx<1,>=0.23.0->openai>=1.45.0->vllm) (1.0.5)\n",
172
+ "Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.11/dist-packages (from jsonschema<5.0.0,>=4.21.1->mistral-common>=1.5.0->mistral-common[opencv]>=1.5.0->vllm) (2023.12.1)\n",
173
+ "Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.11/dist-packages (from jsonschema<5.0.0,>=4.21.1->mistral-common>=1.5.0->mistral-common[opencv]>=1.5.0->vllm) (0.20.0)\n",
174
+ "Requirement already satisfied: pyarrow>=15.0.0 in /usr/local/lib/python3.11/dist-packages (from datasets->outlines<0.1,>=0.0.43->vllm) (18.1.0)\n",
175
+ "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.11/dist-packages (from datasets->outlines<0.1,>=0.0.43->vllm) (0.3.8)\n",
176
+ "Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (from datasets->outlines<0.1,>=0.0.43->vllm) (2.2.3)\n",
177
+ "Requirement already satisfied: xxhash in /usr/local/lib/python3.11/dist-packages (from datasets->outlines<0.1,>=0.0.43->vllm) (3.5.0)\n",
178
+ "Requirement already satisfied: multiprocess<0.70.17 in /usr/local/lib/python3.11/dist-packages (from datasets->outlines<0.1,>=0.0.43->vllm) (0.70.16)\n",
179
+ "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.11/dist-packages (from jinja2->torch==2.5.1->vllm) (2.1.5)\n",
180
+ "Collecting llvmlite<0.44,>=0.43.0dev0 (from numba->outlines<0.1,>=0.0.43->vllm)\n",
181
+ " Downloading llvmlite-0.43.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.8 kB)\n",
182
+ "Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas->datasets->outlines<0.1,>=0.0.43->vllm) (2.9.0.post0)\n",
183
+ "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas->datasets->outlines<0.1,>=0.0.43->vllm) (2024.2)\n",
184
+ "Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas->datasets->outlines<0.1,>=0.0.43->vllm) (2024.2)\n",
185
+ "Requirement already satisfied: six>=1.5 in /usr/lib/python3/dist-packages (from python-dateutil>=2.8.2->pandas->datasets->outlines<0.1,>=0.0.43->vllm) (1.16.0)\n",
186
+ "Downloading vllm-0.6.4.post1-cp38-abi3-manylinux1_x86_64.whl (198.9 MB)\n",
187
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m198.9/198.9 MB\u001b[0m \u001b[31m117.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
188
+ "\u001b[?25hDownloading compressed_tensors-0.8.0-py3-none-any.whl (86 kB)\n",
189
+ "Downloading gguf-0.10.0-py3-none-any.whl (71 kB)\n",
190
+ "Downloading torch-2.5.1-cp311-cp311-manylinux1_x86_64.whl (906.5 MB)\n",
191
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m906.5/906.5 MB\u001b[0m \u001b[31m85.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
192
+ "\u001b[?25hDownloading torchvision-0.20.1-cp311-cp311-manylinux1_x86_64.whl (7.2 MB)\n",
193
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m7.2/7.2 MB\u001b[0m \u001b[31m111.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
194
+ "\u001b[?25hDownloading xformers-0.0.28.post3-cp311-cp311-manylinux_2_28_x86_64.whl (16.7 MB)\n",
195
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m16.7/16.7 MB\u001b[0m \u001b[31m105.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
196
+ "\u001b[?25hDownloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl (363.4 MB)\n",
197
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m363.4/363.4 MB\u001b[0m \u001b[31m178.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
198
+ "\u001b[?25hDownloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (13.8 MB)\n",
199
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m13.8/13.8 MB\u001b[0m \u001b[31m61.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n",
200
+ "\u001b[?25hDownloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (24.6 MB)\n",
201
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m24.6/24.6 MB\u001b[0m \u001b[31m85.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0ma \u001b[36m0:00:01\u001b[0m\n",
202
+ "\u001b[?25hDownloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (883 kB)\n",
203
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m883.7/883.7 kB\u001b[0m \u001b[31m116.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
204
+ "\u001b[?25hDownloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl (211.5 MB)\n",
205
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m211.5/211.5 MB\u001b[0m \u001b[31m152.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
206
+ "\u001b[?25hDownloading nvidia_curand_cu12-10.3.5.147-py3-none-manylinux2014_x86_64.whl (56.3 MB)\n",
207
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m56.3/56.3 MB\u001b[0m \u001b[31m127.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
208
+ "\u001b[?25hDownloading nvidia_cusolver_cu12-11.6.1.9-py3-none-manylinux2014_x86_64.whl (127.9 MB)\n",
209
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m127.9/127.9 MB\u001b[0m \u001b[31m145.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
210
+ "\u001b[?25hDownloading nvidia_cusparse_cu12-12.3.1.170-py3-none-manylinux2014_x86_64.whl (207.5 MB)\n",
211
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m207.5/207.5 MB\u001b[0m \u001b[31m134.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
212
+ "\u001b[?25hDownloading nvidia_nccl_cu12-2.21.5-py3-none-manylinux2014_x86_64.whl (188.7 MB)\n",
213
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m188.7/188.7 MB\u001b[0m \u001b[31m142.5 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
214
+ "\u001b[?25hDownloading nvidia_nvjitlink_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (21.1 MB)\n",
215
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m21.1/21.1 MB\u001b[0m \u001b[31m151.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
216
+ "\u001b[?25hDownloading nvidia_nvtx_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl (99 kB)\n",
217
+ "Downloading sympy-1.13.1-py3-none-any.whl (6.2 MB)\n",
218
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.2/6.2 MB\u001b[0m \u001b[31m161.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
219
+ "\u001b[?25hDownloading triton-3.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209.5 MB)\n",
220
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m209.5/209.5 MB\u001b[0m \u001b[31m88.6 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
221
+ "\u001b[?25hDownloading fastapi-0.115.6-py3-none-any.whl (94 kB)\n",
222
+ "Downloading lm_format_enforcer-0.10.9-py3-none-any.whl (43 kB)\n",
223
+ "Downloading mistral_common-1.5.1-py3-none-any.whl (6.5 MB)\n",
224
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.5/6.5 MB\u001b[0m \u001b[31m167.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
225
+ "\u001b[?25hDownloading sentencepiece-0.2.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)\n",
226
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.3/1.3 MB\u001b[0m \u001b[31m146.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
227
+ "\u001b[?25hDownloading nvidia_ml_py-12.560.30-py3-none-any.whl (40 kB)\n",
228
+ "Downloading openai-1.57.0-py3-none-any.whl (389 kB)\n",
229
+ "Downloading outlines-0.0.46-py3-none-any.whl (101 kB)\n",
230
+ "Downloading pillow-10.4.0-cp311-cp311-manylinux_2_28_x86_64.whl (4.5 MB)\n",
231
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.5/4.5 MB\u001b[0m \u001b[31m148.7 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
232
+ "\u001b[?25hDownloading prometheus_fastapi_instrumentator-7.0.0-py3-none-any.whl (19 kB)\n",
233
+ "Downloading pydantic-2.10.3-py3-none-any.whl (456 kB)\n",
234
+ "Downloading pydantic_core-2.27.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.1 MB)\n",
235
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2.1/2.1 MB\u001b[0m \u001b[31m180.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
236
+ "\u001b[?25hDownloading ray-2.40.0-cp311-cp311-manylinux2014_x86_64.whl (67.0 MB)\n",
237
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m67.0/67.0 MB\u001b[0m \u001b[31m151.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
238
+ "\u001b[?25hDownloading protobuf-5.29.1-cp38-abi3-manylinux2014_x86_64.whl (319 kB)\n",
239
+ "Downloading tiktoken-0.7.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.1 MB)\n",
240
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m1.1/1.1 MB\u001b[0m \u001b[31m153.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
241
+ "\u001b[?25hDownloading typing_extensions-4.12.2-py3-none-any.whl (37 kB)\n",
242
+ "Downloading einops-0.8.0-py3-none-any.whl (43 kB)\n",
243
+ "Downloading msgspec-0.18.6-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (209 kB)\n",
244
+ "Downloading partial_json_parser-0.2.1.1.post4-py3-none-any.whl (9.9 kB)\n",
245
+ "Downloading py_cpuinfo-9.0.0-py3-none-any.whl (22 kB)\n",
246
+ "Downloading annotated_types-0.7.0-py3-none-any.whl (13 kB)\n",
247
+ "Downloading click-8.1.7-py3-none-any.whl (97 kB)\n",
248
+ "Downloading httptools-0.6.4-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (459 kB)\n",
249
+ "Downloading interegular-0.3.3-py37-none-any.whl (23 kB)\n",
250
+ "Downloading jiter-0.8.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (343 kB)\n",
251
+ "Downloading msgpack-1.1.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (403 kB)\n",
252
+ "Downloading opencv_python_headless-4.10.0.84-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (49.9 MB)\n",
253
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m49.9/49.9 MB\u001b[0m \u001b[31m117.2 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n",
254
+ "\u001b[?25hDownloading python_dotenv-1.0.1-py3-none-any.whl (19 kB)\n",
255
+ "Downloading starlette-0.41.3-py3-none-any.whl (73 kB)\n",
256
+ "Downloading uvloop-0.21.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (4.0 MB)\n",
257
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m4.0/4.0 MB\u001b[0m \u001b[31m160.0 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
258
+ "\u001b[?25hDownloading watchfiles-1.0.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (442 kB)\n",
259
+ "Downloading websockets-14.1-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (168 kB)\n",
260
+ "Downloading cloudpickle-3.1.0-py3-none-any.whl (22 kB)\n",
261
+ "Downloading diskcache-5.6.3-py3-none-any.whl (45 kB)\n",
262
+ "Downloading lark-1.2.2-py3-none-any.whl (111 kB)\n",
263
+ "Downloading numba-0.60.0-cp311-cp311-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (3.7 MB)\n",
264
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m3.7/3.7 MB\u001b[0m \u001b[31m132.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
265
+ "\u001b[?25hDownloading pyairports-2.1.1-py3-none-any.whl (371 kB)\n",
266
+ "Downloading pycountry-24.6.1-py3-none-any.whl (6.3 MB)\n",
267
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m6.3/6.3 MB\u001b[0m \u001b[31m130.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
268
+ "\u001b[?25hDownloading uvicorn-0.32.1-py3-none-any.whl (63 kB)\n",
269
+ "Downloading llvmlite-0.43.0-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (43.9 MB)\n",
270
+ "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m43.9/43.9 MB\u001b[0m \u001b[31m168.3 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m\n",
271
+ "\u001b[?25hInstalling collected packages: sentencepiece, pyairports, py-cpuinfo, nvidia-ml-py, websockets, uvloop, typing-extensions, triton, sympy, python-dotenv, pycountry, protobuf, pillow, partial-json-parser, opencv-python-headless, nvidia-nvtx-cu12, nvidia-nvjitlink-cu12, nvidia-nccl-cu12, nvidia-curand-cu12, nvidia-cufft-cu12, nvidia-cuda-runtime-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-cupti-cu12, nvidia-cublas-cu12, msgspec, msgpack, llvmlite, lark, jiter, interegular, httptools, gguf, einops, diskcache, cloudpickle, click, annotated-types, watchfiles, uvicorn, tiktoken, starlette, pydantic-core, nvidia-cusparse-cu12, numba, pydantic, prometheus-fastapi-instrumentator, nvidia-cusolver-cu12, torch, ray, openai, mistral-common, lm-format-enforcer, fastapi, xformers, torchvision, outlines, compressed-tensors, vllm\n",
272
+ " Attempting uninstall: typing-extensions\n",
273
+ " Found existing installation: typing_extensions 4.9.0\n",
274
+ " Uninstalling typing_extensions-4.9.0:\n",
275
+ " Successfully uninstalled typing_extensions-4.9.0\n",
276
+ " Attempting uninstall: triton\n",
277
+ " Found existing installation: triton 3.0.0\n",
278
+ " Uninstalling triton-3.0.0:\n",
279
+ " Successfully uninstalled triton-3.0.0\n",
280
+ " Attempting uninstall: sympy\n",
281
+ " Found existing installation: sympy 1.12\n",
282
+ " Uninstalling sympy-1.12:\n",
283
+ " Successfully uninstalled sympy-1.12\n",
284
+ " Attempting uninstall: pillow\n",
285
+ " Found existing installation: pillow 10.2.0\n",
286
+ " Uninstalling pillow-10.2.0:\n",
287
+ " Successfully uninstalled pillow-10.2.0\n",
288
+ " Attempting uninstall: nvidia-nvtx-cu12\n",
289
+ " Found existing installation: nvidia-nvtx-cu12 12.4.99\n",
290
+ " Uninstalling nvidia-nvtx-cu12-12.4.99:\n",
291
+ " Successfully uninstalled nvidia-nvtx-cu12-12.4.99\n",
292
+ " Attempting uninstall: nvidia-nvjitlink-cu12\n",
293
+ " Found existing installation: nvidia-nvjitlink-cu12 12.4.99\n",
294
+ " Uninstalling nvidia-nvjitlink-cu12-12.4.99:\n",
295
+ " Successfully uninstalled nvidia-nvjitlink-cu12-12.4.99\n",
296
+ " Attempting uninstall: nvidia-nccl-cu12\n",
297
+ " Found existing installation: nvidia-nccl-cu12 2.20.5\n",
298
+ " Uninstalling nvidia-nccl-cu12-2.20.5:\n",
299
+ " Successfully uninstalled nvidia-nccl-cu12-2.20.5\n",
300
+ " Attempting uninstall: nvidia-curand-cu12\n",
301
+ " Found existing installation: nvidia-curand-cu12 10.3.5.119\n",
302
+ " Uninstalling nvidia-curand-cu12-10.3.5.119:\n",
303
+ " Successfully uninstalled nvidia-curand-cu12-10.3.5.119\n",
304
+ " Attempting uninstall: nvidia-cufft-cu12\n",
305
+ " Found existing installation: nvidia-cufft-cu12 11.2.0.44\n",
306
+ " Uninstalling nvidia-cufft-cu12-11.2.0.44:\n",
307
+ " Successfully uninstalled nvidia-cufft-cu12-11.2.0.44\n",
308
+ " Attempting uninstall: nvidia-cuda-runtime-cu12\n",
309
+ " Found existing installation: nvidia-cuda-runtime-cu12 12.4.99\n",
310
+ " Uninstalling nvidia-cuda-runtime-cu12-12.4.99:\n",
311
+ " Successfully uninstalled nvidia-cuda-runtime-cu12-12.4.99\n",
312
+ " Attempting uninstall: nvidia-cuda-nvrtc-cu12\n",
313
+ " Found existing installation: nvidia-cuda-nvrtc-cu12 12.4.99\n",
314
+ " Uninstalling nvidia-cuda-nvrtc-cu12-12.4.99:\n",
315
+ " Successfully uninstalled nvidia-cuda-nvrtc-cu12-12.4.99\n",
316
+ " Attempting uninstall: nvidia-cuda-cupti-cu12\n",
317
+ " Found existing installation: nvidia-cuda-cupti-cu12 12.4.99\n",
318
+ " Uninstalling nvidia-cuda-cupti-cu12-12.4.99:\n",
319
+ " Successfully uninstalled nvidia-cuda-cupti-cu12-12.4.99\n",
320
+ " Attempting uninstall: nvidia-cublas-cu12\n",
321
+ " Found existing installation: nvidia-cublas-cu12 12.4.2.65\n",
322
+ " Uninstalling nvidia-cublas-cu12-12.4.2.65:\n",
323
+ " Successfully uninstalled nvidia-cublas-cu12-12.4.2.65\n",
324
+ " Attempting uninstall: nvidia-cusparse-cu12\n",
325
+ " Found existing installation: nvidia-cusparse-cu12 12.3.0.142\n",
326
+ " Uninstalling nvidia-cusparse-cu12-12.3.0.142:\n",
327
+ " Successfully uninstalled nvidia-cusparse-cu12-12.3.0.142\n",
328
+ " Attempting uninstall: nvidia-cusolver-cu12\n",
329
+ " Found existing installation: nvidia-cusolver-cu12 11.6.0.99\n",
330
+ " Uninstalling nvidia-cusolver-cu12-11.6.0.99:\n",
331
+ " Successfully uninstalled nvidia-cusolver-cu12-11.6.0.99\n",
332
+ " Attempting uninstall: torch\n",
333
+ " Found existing installation: torch 2.4.1+cu124\n",
334
+ " Uninstalling torch-2.4.1+cu124:\n",
335
+ " Successfully uninstalled torch-2.4.1+cu124\n",
336
+ " Attempting uninstall: torchvision\n",
337
+ " Found existing installation: torchvision 0.19.1+cu124\n",
338
+ " Uninstalling torchvision-0.19.1+cu124:\n",
339
+ " Successfully uninstalled torchvision-0.19.1+cu124\n",
340
+ "\u001b[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.\n",
341
+ "torchaudio 2.4.1+cu124 requires torch==2.4.1, but you have torch 2.5.1 which is incompatible.\u001b[0m\u001b[31m\n",
342
+ "\u001b[0mSuccessfully installed annotated-types-0.7.0 click-8.1.7 cloudpickle-3.1.0 compressed-tensors-0.8.0 diskcache-5.6.3 einops-0.8.0 fastapi-0.115.6 gguf-0.10.0 httptools-0.6.4 interegular-0.3.3 jiter-0.8.0 lark-1.2.2 llvmlite-0.43.0 lm-format-enforcer-0.10.9 mistral-common-1.5.1 msgpack-1.1.0 msgspec-0.18.6 numba-0.60.0 nvidia-cublas-cu12-12.4.5.8 nvidia-cuda-cupti-cu12-12.4.127 nvidia-cuda-nvrtc-cu12-12.4.127 nvidia-cuda-runtime-cu12-12.4.127 nvidia-cufft-cu12-11.2.1.3 nvidia-curand-cu12-10.3.5.147 nvidia-cusolver-cu12-11.6.1.9 nvidia-cusparse-cu12-12.3.1.170 nvidia-ml-py-12.560.30 nvidia-nccl-cu12-2.21.5 nvidia-nvjitlink-cu12-12.4.127 nvidia-nvtx-cu12-12.4.127 openai-1.57.0 opencv-python-headless-4.10.0.84 outlines-0.0.46 partial-json-parser-0.2.1.1.post4 pillow-10.4.0 prometheus-fastapi-instrumentator-7.0.0 protobuf-5.29.1 py-cpuinfo-9.0.0 pyairports-2.1.1 pycountry-24.6.1 pydantic-2.10.3 pydantic-core-2.27.1 python-dotenv-1.0.1 ray-2.40.0 sentencepiece-0.2.0 starlette-0.41.3 sympy-1.13.1 tiktoken-0.7.0 torch-2.5.1 torchvision-0.20.1 triton-3.1.0 typing-extensions-4.12.2 uvicorn-0.32.1 uvloop-0.21.0 vllm-0.6.4.post1 watchfiles-1.0.0 websockets-14.1 xformers-0.0.28.post3\n",
343
+ "\u001b[33mWARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager, possibly rendering your system unusable.It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv. Use the --root-user-action option if you know what you are doing and want to suppress this warning.\u001b[0m\u001b[33m\n",
344
+ "\u001b[0m\n",
345
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.2\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m24.3.1\u001b[0m\n",
346
+ "\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpython -m pip install --upgrade pip\u001b[0m\n"
347
+ ]
348
+ }
349
+ ],
350
+ "source": [
351
+ "!pip install vllm"
352
+ ]
353
+ },
354
+ {
355
+ "cell_type": "code",
356
+ "execution_count": 1,
357
+ "id": "e772542e-467c-481a-9128-8364987a1bd9",
358
+ "metadata": {},
359
+ "outputs": [
360
+ {
361
+ "name": "stdout",
362
+ "output_type": "stream",
363
+ "text": [
364
+ "Sun Dec 8 01:39:25 2024 \n",
365
+ "+-----------------------------------------------------------------------------------------+\n",
366
+ "| NVIDIA-SMI 565.57.01 Driver Version: 565.57.01 CUDA Version: 12.7 |\n",
367
+ "|-----------------------------------------+------------------------+----------------------+\n",
368
+ "| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n",
369
+ "| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n",
370
+ "| | | MIG M. |\n",
371
+ "|=========================================+========================+======================|\n",
372
+ "| 0 NVIDIA H100 NVL On | 00000000:3C:00.0 Off | 0 |\n",
373
+ "| N/A 26C P0 62W / 310W | 1MiB / 95830MiB | 0% Default |\n",
374
+ "| | | Disabled |\n",
375
+ "+-----------------------------------------+------------------------+----------------------+\n",
376
+ "| 1 NVIDIA H100 NVL On | 00000000:AE:00.0 Off | 0 |\n",
377
+ "| N/A 26C P0 59W / 310W | 1MiB / 95830MiB | 0% Default |\n",
378
+ "| | | Disabled |\n",
379
+ "+-----------------------------------------+------------------------+----------------------+\n",
380
+ "| 2 NVIDIA H100 NVL On | 00000000:BD:00.0 Off | 0 |\n",
381
+ "| N/A 24C P0 60W / 310W | 1MiB / 95830MiB | 0% Default |\n",
382
+ "| | | Disabled |\n",
383
+ "+-----------------------------------------+------------------------+----------------------+\n",
384
+ "| 3 NVIDIA H100 NVL On | 00000000:BE:00.0 Off | 0 |\n",
385
+ "| N/A 26C P0 60W / 310W | 1MiB / 95830MiB | 0% Default |\n",
386
+ "| | | Disabled |\n",
387
+ "+-----------------------------------------+------------------------+----------------------+\n",
388
+ " \n",
389
+ "+-----------------------------------------------------------------------------------------+\n",
390
+ "| Processes: |\n",
391
+ "| GPU GI CI PID Type Process name GPU Memory |\n",
392
+ "| ID ID Usage |\n",
393
+ "|=========================================================================================|\n",
394
+ "| No running processes found |\n",
395
+ "+-----------------------------------------------------------------------------------------+\n"
396
+ ]
397
+ }
398
+ ],
399
+ "source": [
400
+ "!nvidia-smi"
401
+ ]
402
+ },
403
+ {
404
+ "cell_type": "code",
405
+ "execution_count": 2,
406
+ "id": "2bf7e331-4686-4c0d-ae0f-72cbb79e2e8c",
407
+ "metadata": {},
408
+ "outputs": [],
409
+ "source": [
410
+ "from vllm import LLM, SamplingParams"
411
+ ]
412
+ },
413
+ {
414
+ "cell_type": "code",
415
+ "execution_count": 3,
416
+ "id": "a51d52bc-d60e-412e-a150-20bc0526d20e",
417
+ "metadata": {},
418
+ "outputs": [
419
+ {
420
+ "data": {
421
+ "application/vnd.jupyter.widget-view+json": {
422
+ "model_id": "7c598964dfdb4818aad022b3d085af8d",
423
+ "version_major": 2,
424
+ "version_minor": 0
425
+ },
426
+ "text/plain": [
427
+ "config.json: 0%| | 0.00/1.25k [00:00<?, ?B/s]"
428
+ ]
429
+ },
430
+ "metadata": {},
431
+ "output_type": "display_data"
432
+ },
433
+ {
434
+ "name": "stdout",
435
+ "output_type": "stream",
436
+ "text": [
437
+ "INFO 12-08 01:39:53 config.py:350] This model supports multiple tasks: {'generate', 'embedding'}. Defaulting to 'generate'.\n",
438
+ "INFO 12-08 01:39:53 awq_marlin.py:113] Detected that the model can run with awq_marlin, however you specified quantization=awq explicitly, so forcing awq. Use quantization=awq_marlin for faster inference\n",
439
+ "WARNING 12-08 01:39:53 config.py:428] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.\n",
440
+ "INFO 12-08 01:39:53 config.py:1020] Defaulting to use mp for distributed inference\n",
441
+ "WARNING 12-08 01:39:53 arg_utils.py:1013] Chunked prefill is enabled by default for models with max_model_len > 32K. Currently, chunked prefill might not work with some features or models. If you encounter any issues, please disable chunked prefill by setting --enable-chunked-prefill=False.\n",
442
+ "INFO 12-08 01:39:53 config.py:1136] Chunked prefill is enabled with max_num_batched_tokens=512.\n",
443
+ "INFO 12-08 01:39:53 llm_engine.py:249] Initializing an LLM engine (v0.6.4.post1) with config: model='kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN', speculative_config=None, tokenizer='kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN, num_scheduler_steps=1, chunked_prefill_enabled=True multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, chat_template_text_format=string, mm_processor_kwargs=None, pooler_config=None)\n"
444
+ ]
445
+ },
446
+ {
447
+ "data": {
448
+ "application/vnd.jupyter.widget-view+json": {
449
+ "model_id": "1f7cc9b8b1b54e97a7638c1fdf2ddcaf",
450
+ "version_major": 2,
451
+ "version_minor": 0
452
+ },
453
+ "text/plain": [
454
+ "tokenizer_config.json: 0%| | 0.00/55.4k [00:00<?, ?B/s]"
455
+ ]
456
+ },
457
+ "metadata": {},
458
+ "output_type": "display_data"
459
+ },
460
+ {
461
+ "data": {
462
+ "application/vnd.jupyter.widget-view+json": {
463
+ "model_id": "ef6601b681a24fd9a2208a04352df88a",
464
+ "version_major": 2,
465
+ "version_minor": 0
466
+ },
467
+ "text/plain": [
468
+ "tokenizer.json: 0%| | 0.00/17.2M [00:00<?, ?B/s]"
469
+ ]
470
+ },
471
+ "metadata": {},
472
+ "output_type": "display_data"
473
+ },
474
+ {
475
+ "data": {
476
+ "application/vnd.jupyter.widget-view+json": {
477
+ "model_id": "06e8fb444a9847878c56f24e9856c9e2",
478
+ "version_major": 2,
479
+ "version_minor": 0
480
+ },
481
+ "text/plain": [
482
+ "special_tokens_map.json: 0%| | 0.00/325 [00:00<?, ?B/s]"
483
+ ]
484
+ },
485
+ "metadata": {},
486
+ "output_type": "display_data"
487
+ },
488
+ {
489
+ "data": {
490
+ "application/vnd.jupyter.widget-view+json": {
491
+ "model_id": "7bef29f785a341a5bada54897d06284e",
492
+ "version_major": 2,
493
+ "version_minor": 0
494
+ },
495
+ "text/plain": [
496
+ "generation_config.json: 0%| | 0.00/182 [00:00<?, ?B/s]"
497
+ ]
498
+ },
499
+ "metadata": {},
500
+ "output_type": "display_data"
501
+ },
502
+ {
503
+ "name": "stdout",
504
+ "output_type": "stream",
505
+ "text": [
506
+ "WARNING 12-08 01:39:57 multiproc_gpu_executor.py:56] Reducing Torch parallelism from 72 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.\n",
507
+ "INFO 12-08 01:39:57 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager\n",
508
+ "\u001b[1;36m(VllmWorkerProcess pid=729)\u001b[0;0m INFO 12-08 01:39:57 multiproc_worker_utils.py:215] Worker ready; awaiting tasks\n",
509
+ "INFO 12-08 01:39:57 selector.py:135] Using Flash Attention backend.\n",
510
+ "\u001b[1;36m(VllmWorkerProcess pid=729)\u001b[0;0m \u001b[1;36m(VllmWorkerProcess pid=730)\u001b[0;0m INFO 12-08 01:39:57 selector.py:135] Using Flash Attention backend.\n",
511
+ "\u001b[1;36m(VllmWorkerProcess pid=730)\u001b[0;0m INFO 12-08 01:39:57 multiproc_worker_utils.py:215] Worker ready; awaiting tasks\n",
512
+ "INFO 12-08 01:39:57 selector.py:135] Using Flash Attention backend.\n",
513
+ "\u001b[1;36m(VllmWorkerProcess pid=731)\u001b[0;0m INFO 12-08 01:39:57 selector.py:135] Using Flash Attention backend.\n",
514
+ "\u001b[1;36m(VllmWorkerProcess pid=731)\u001b[0;0m INFO 12-08 01:39:57 multiproc_worker_utils.py:215] Worker ready; awaiting tasks\n",
515
+ "INFO 12-08 01:40:00 utils.py:961] Found nccl from library libnccl.so.2\n",
516
+ "\u001b[1;36m(VllmWorkerProcess pid=730)\u001b[0;0m INFO 12-08 01:40:00 pynccl.py:69] vLLM is using nccl==2.21.5\n",
517
+ "\u001b[1;36m(VllmWorkerProcess pid=729)\u001b[0;0m \u001b[1;36m(VllmWorkerProcess pid=731)\u001b[0;0m INFO 12-08 01:40:00 utils.py:961] Found nccl from library libnccl.so.2\n",
518
+ "INFO 12-08 01:40:00 utils.py:961] Found nccl from library libnccl.so.2\n",
519
+ "INFO 12-08 01:40:00 utils.py:961] Found nccl from library libnccl.so.2\n",
520
+ "\u001b[1;36m(VllmWorkerProcess pid=729)\u001b[0;0m \u001b[1;36m(VllmWorkerProcess pid=730)\u001b[0;0m INFO 12-08 01:40:00 pynccl.py:69] vLLM is using nccl==2.21.5\n",
521
+ "INFO 12-08 01:40:00 pynccl.py:69] vLLM is using nccl==2.21.5\n",
522
+ "\u001b[1;36m(VllmWorkerProcess pid=731)\u001b[0;0m INFO 12-08 01:40:00 pynccl.py:69] vLLM is using nccl==2.21.5\n",
523
+ "WARNING 12-08 01:40:01 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.\n",
524
+ "\u001b[1;36m(VllmWorkerProcess pid=729)\u001b[0;0m \u001b[1;36m(VllmWorkerProcess pid=730)\u001b[0;0m WARNING 12-08 01:40:01 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.\n",
525
+ "WARNING 12-08 01:40:01 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.\n",
526
+ "\u001b[1;36m(VllmWorkerProcess pid=731)\u001b[0;0m WARNING 12-08 01:40:01 custom_all_reduce.py:134] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.\n",
527
+ "INFO 12-08 01:40:01 shm_broadcast.py:236] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7fe462e95610>, local_subscribe_port=36659, remote_subscribe_port=None)\n",
528
+ "INFO 12-08 01:40:01 model_runner.py:1072] Starting to load model kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN...\n",
529
+ "\u001b[1;36m(VllmWorkerProcess pid=729)\u001b[0;0m \u001b[1;36m(VllmWorkerProcess pid=730)\u001b[0;0m \u001b[1;36m(VllmWorkerProcess pid=731)\u001b[0;0m INFO 12-08 01:40:01 model_runner.py:1072] Starting to load model kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN...\n",
530
+ "INFO 12-08 01:40:01 model_runner.py:1072] Starting to load model kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN...\n",
531
+ "INFO 12-08 01:40:01 model_runner.py:1072] Starting to load model kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN...\n",
532
+ "INFO 12-08 01:40:02 weight_utils.py:243] Using model weights format ['*.safetensors']\n",
533
+ "\u001b[1;36m(VllmWorkerProcess pid=729)\u001b[0;0m INFO 12-08 01:40:02 weight_utils.py:243] Using model weights format ['*.safetensors']\n",
534
+ "\u001b[1;36m(VllmWorkerProcess pid=730)\u001b[0;0m INFO 12-08 01:40:02 weight_utils.py:243] Using model weights format ['*.safetensors']\n",
535
+ "\u001b[1;36m(VllmWorkerProcess pid=731)\u001b[0;0m INFO 12-08 01:40:02 weight_utils.py:243] Using model weights format ['*.safetensors']\n"
536
+ ]
537
+ },
538
+ {
539
+ "data": {
540
+ "application/vnd.jupyter.widget-view+json": {
541
+ "model_id": "3992e4d9fca34515910b7bf2492a3bc2",
542
+ "version_major": 2,
543
+ "version_minor": 0
544
+ },
545
+ "text/plain": [
546
+ "model-00004-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
547
+ ]
548
+ },
549
+ "metadata": {},
550
+ "output_type": "display_data"
551
+ },
552
+ {
553
+ "data": {
554
+ "application/vnd.jupyter.widget-view+json": {
555
+ "model_id": "3e0eec98a2b1418283e0809b0cf9b7ad",
556
+ "version_major": 2,
557
+ "version_minor": 0
558
+ },
559
+ "text/plain": [
560
+ "model-00002-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
561
+ ]
562
+ },
563
+ "metadata": {},
564
+ "output_type": "display_data"
565
+ },
566
+ {
567
+ "data": {
568
+ "application/vnd.jupyter.widget-view+json": {
569
+ "model_id": "aef1bf777f134def9e8fdbf6038e9b0f",
570
+ "version_major": 2,
571
+ "version_minor": 0
572
+ },
573
+ "text/plain": [
574
+ "model-00006-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
575
+ ]
576
+ },
577
+ "metadata": {},
578
+ "output_type": "display_data"
579
+ },
580
+ {
581
+ "data": {
582
+ "application/vnd.jupyter.widget-view+json": {
583
+ "model_id": "b506d6aed0024cde8f18df0a3e796fbc",
584
+ "version_major": 2,
585
+ "version_minor": 0
586
+ },
587
+ "text/plain": [
588
+ "model-00007-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
589
+ ]
590
+ },
591
+ "metadata": {},
592
+ "output_type": "display_data"
593
+ },
594
+ {
595
+ "data": {
596
+ "application/vnd.jupyter.widget-view+json": {
597
+ "model_id": "db9659785e1b49bdbc77298f8d95a746",
598
+ "version_major": 2,
599
+ "version_minor": 0
600
+ },
601
+ "text/plain": [
602
+ "model-00008-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
603
+ ]
604
+ },
605
+ "metadata": {},
606
+ "output_type": "display_data"
607
+ },
608
+ {
609
+ "data": {
610
+ "application/vnd.jupyter.widget-view+json": {
611
+ "model_id": "4ac31302b8d94e48996d65295561ea38",
612
+ "version_major": 2,
613
+ "version_minor": 0
614
+ },
615
+ "text/plain": [
616
+ "model-00003-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
617
+ ]
618
+ },
619
+ "metadata": {},
620
+ "output_type": "display_data"
621
+ },
622
+ {
623
+ "data": {
624
+ "application/vnd.jupyter.widget-view+json": {
625
+ "model_id": "34494082039e41de94d38168f54e07b4",
626
+ "version_major": 2,
627
+ "version_minor": 0
628
+ },
629
+ "text/plain": [
630
+ "model-00005-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
631
+ ]
632
+ },
633
+ "metadata": {},
634
+ "output_type": "display_data"
635
+ },
636
+ {
637
+ "data": {
638
+ "application/vnd.jupyter.widget-view+json": {
639
+ "model_id": "71affa983f584b91b98bddedbc7f894e",
640
+ "version_major": 2,
641
+ "version_minor": 0
642
+ },
643
+ "text/plain": [
644
+ "model-00001-of-00044.safetensors: 0%| | 0.00/4.95G [00:00<?, ?B/s]"
645
+ ]
646
+ },
647
+ "metadata": {},
648
+ "output_type": "display_data"
649
+ },
650
+ {
651
+ "data": {
652
+ "application/vnd.jupyter.widget-view+json": {
653
+ "model_id": "2e9c94b7b0bc43d8b22785c62d09e264",
654
+ "version_major": 2,
655
+ "version_minor": 0
656
+ },
657
+ "text/plain": [
658
+ "model-00009-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
659
+ ]
660
+ },
661
+ "metadata": {},
662
+ "output_type": "display_data"
663
+ },
664
+ {
665
+ "data": {
666
+ "application/vnd.jupyter.widget-view+json": {
667
+ "model_id": "2fddbf414af2494692fc262abf2832b7",
668
+ "version_major": 2,
669
+ "version_minor": 0
670
+ },
671
+ "text/plain": [
672
+ "model-00010-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
673
+ ]
674
+ },
675
+ "metadata": {},
676
+ "output_type": "display_data"
677
+ },
678
+ {
679
+ "data": {
680
+ "application/vnd.jupyter.widget-view+json": {
681
+ "model_id": "6e72305ba5cf40269479f81dbb25b69e",
682
+ "version_major": 2,
683
+ "version_minor": 0
684
+ },
685
+ "text/plain": [
686
+ "model-00011-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
687
+ ]
688
+ },
689
+ "metadata": {},
690
+ "output_type": "display_data"
691
+ },
692
+ {
693
+ "data": {
694
+ "application/vnd.jupyter.widget-view+json": {
695
+ "model_id": "023fe209088a42a28e7656c3da74343d",
696
+ "version_major": 2,
697
+ "version_minor": 0
698
+ },
699
+ "text/plain": [
700
+ "model-00012-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
701
+ ]
702
+ },
703
+ "metadata": {},
704
+ "output_type": "display_data"
705
+ },
706
+ {
707
+ "data": {
708
+ "application/vnd.jupyter.widget-view+json": {
709
+ "model_id": "e20f953f68274473aaba3a3fe9d86591",
710
+ "version_major": 2,
711
+ "version_minor": 0
712
+ },
713
+ "text/plain": [
714
+ "model-00013-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
715
+ ]
716
+ },
717
+ "metadata": {},
718
+ "output_type": "display_data"
719
+ },
720
+ {
721
+ "data": {
722
+ "application/vnd.jupyter.widget-view+json": {
723
+ "model_id": "70928ce19e7b4aa38313d2369fe0280f",
724
+ "version_major": 2,
725
+ "version_minor": 0
726
+ },
727
+ "text/plain": [
728
+ "model-00014-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
729
+ ]
730
+ },
731
+ "metadata": {},
732
+ "output_type": "display_data"
733
+ },
734
+ {
735
+ "data": {
736
+ "application/vnd.jupyter.widget-view+json": {
737
+ "model_id": "e9f0748b3b954b748cea03d9d7973fd5",
738
+ "version_major": 2,
739
+ "version_minor": 0
740
+ },
741
+ "text/plain": [
742
+ "model-00015-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
743
+ ]
744
+ },
745
+ "metadata": {},
746
+ "output_type": "display_data"
747
+ },
748
+ {
749
+ "data": {
750
+ "application/vnd.jupyter.widget-view+json": {
751
+ "model_id": "56a827b4296d4d79a34ff5b095ca7532",
752
+ "version_major": 2,
753
+ "version_minor": 0
754
+ },
755
+ "text/plain": [
756
+ "model-00016-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
757
+ ]
758
+ },
759
+ "metadata": {},
760
+ "output_type": "display_data"
761
+ },
762
+ {
763
+ "data": {
764
+ "application/vnd.jupyter.widget-view+json": {
765
+ "model_id": "6bb5e0bd0d09458cb21c84f8c8cb2174",
766
+ "version_major": 2,
767
+ "version_minor": 0
768
+ },
769
+ "text/plain": [
770
+ "model-00017-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
771
+ ]
772
+ },
773
+ "metadata": {},
774
+ "output_type": "display_data"
775
+ },
776
+ {
777
+ "data": {
778
+ "application/vnd.jupyter.widget-view+json": {
779
+ "model_id": "d423b03ac7b84064b9fc2c0c86794e31",
780
+ "version_major": 2,
781
+ "version_minor": 0
782
+ },
783
+ "text/plain": [
784
+ "model-00018-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
785
+ ]
786
+ },
787
+ "metadata": {},
788
+ "output_type": "display_data"
789
+ },
790
+ {
791
+ "data": {
792
+ "application/vnd.jupyter.widget-view+json": {
793
+ "model_id": "79956a438155481080f30b669d7aae7b",
794
+ "version_major": 2,
795
+ "version_minor": 0
796
+ },
797
+ "text/plain": [
798
+ "model-00019-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
799
+ ]
800
+ },
801
+ "metadata": {},
802
+ "output_type": "display_data"
803
+ },
804
+ {
805
+ "data": {
806
+ "application/vnd.jupyter.widget-view+json": {
807
+ "model_id": "cc538ae927d2423d8bdf28c437522018",
808
+ "version_major": 2,
809
+ "version_minor": 0
810
+ },
811
+ "text/plain": [
812
+ "model-00020-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
813
+ ]
814
+ },
815
+ "metadata": {},
816
+ "output_type": "display_data"
817
+ },
818
+ {
819
+ "data": {
820
+ "application/vnd.jupyter.widget-view+json": {
821
+ "model_id": "01d605b47f224b81a1d78c46bc512440",
822
+ "version_major": 2,
823
+ "version_minor": 0
824
+ },
825
+ "text/plain": [
826
+ "model-00021-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
827
+ ]
828
+ },
829
+ "metadata": {},
830
+ "output_type": "display_data"
831
+ },
832
+ {
833
+ "data": {
834
+ "application/vnd.jupyter.widget-view+json": {
835
+ "model_id": "e3a40b4984db442b863fb24b9510364b",
836
+ "version_major": 2,
837
+ "version_minor": 0
838
+ },
839
+ "text/plain": [
840
+ "model-00022-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
841
+ ]
842
+ },
843
+ "metadata": {},
844
+ "output_type": "display_data"
845
+ },
846
+ {
847
+ "data": {
848
+ "application/vnd.jupyter.widget-view+json": {
849
+ "model_id": "ccdb98bdf695414b832cd7e7c28c042f",
850
+ "version_major": 2,
851
+ "version_minor": 0
852
+ },
853
+ "text/plain": [
854
+ "model-00023-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
855
+ ]
856
+ },
857
+ "metadata": {},
858
+ "output_type": "display_data"
859
+ },
860
+ {
861
+ "data": {
862
+ "application/vnd.jupyter.widget-view+json": {
863
+ "model_id": "3480c970522041caa1de260c97f3a5aa",
864
+ "version_major": 2,
865
+ "version_minor": 0
866
+ },
867
+ "text/plain": [
868
+ "model-00024-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
869
+ ]
870
+ },
871
+ "metadata": {},
872
+ "output_type": "display_data"
873
+ },
874
+ {
875
+ "data": {
876
+ "application/vnd.jupyter.widget-view+json": {
877
+ "model_id": "135cdeaff5be480ca496d0562acf83cd",
878
+ "version_major": 2,
879
+ "version_minor": 0
880
+ },
881
+ "text/plain": [
882
+ "model-00025-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
883
+ ]
884
+ },
885
+ "metadata": {},
886
+ "output_type": "display_data"
887
+ },
888
+ {
889
+ "data": {
890
+ "application/vnd.jupyter.widget-view+json": {
891
+ "model_id": "d0b44e898f064aa7b43ea8b4d36e97ad",
892
+ "version_major": 2,
893
+ "version_minor": 0
894
+ },
895
+ "text/plain": [
896
+ "model-00026-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
897
+ ]
898
+ },
899
+ "metadata": {},
900
+ "output_type": "display_data"
901
+ },
902
+ {
903
+ "data": {
904
+ "application/vnd.jupyter.widget-view+json": {
905
+ "model_id": "1aa50a58715943caa10db97b0ee5981f",
906
+ "version_major": 2,
907
+ "version_minor": 0
908
+ },
909
+ "text/plain": [
910
+ "model-00027-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
911
+ ]
912
+ },
913
+ "metadata": {},
914
+ "output_type": "display_data"
915
+ },
916
+ {
917
+ "data": {
918
+ "application/vnd.jupyter.widget-view+json": {
919
+ "model_id": "e9de0a30c1534f499465e270e157e9de",
920
+ "version_major": 2,
921
+ "version_minor": 0
922
+ },
923
+ "text/plain": [
924
+ "model-00028-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
925
+ ]
926
+ },
927
+ "metadata": {},
928
+ "output_type": "display_data"
929
+ },
930
+ {
931
+ "data": {
932
+ "application/vnd.jupyter.widget-view+json": {
933
+ "model_id": "445f10e616fd49e1a4868c88fb31c56f",
934
+ "version_major": 2,
935
+ "version_minor": 0
936
+ },
937
+ "text/plain": [
938
+ "model-00029-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
939
+ ]
940
+ },
941
+ "metadata": {},
942
+ "output_type": "display_data"
943
+ },
944
+ {
945
+ "data": {
946
+ "application/vnd.jupyter.widget-view+json": {
947
+ "model_id": "292d0fbfbf8c4145b3a208480eeb6121",
948
+ "version_major": 2,
949
+ "version_minor": 0
950
+ },
951
+ "text/plain": [
952
+ "model-00030-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
953
+ ]
954
+ },
955
+ "metadata": {},
956
+ "output_type": "display_data"
957
+ },
958
+ {
959
+ "data": {
960
+ "application/vnd.jupyter.widget-view+json": {
961
+ "model_id": "43893486f5dc40f290811b94d4c1352d",
962
+ "version_major": 2,
963
+ "version_minor": 0
964
+ },
965
+ "text/plain": [
966
+ "model-00031-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
967
+ ]
968
+ },
969
+ "metadata": {},
970
+ "output_type": "display_data"
971
+ },
972
+ {
973
+ "data": {
974
+ "application/vnd.jupyter.widget-view+json": {
975
+ "model_id": "7291f54de4874b9ebbd2c215a1b5c5ed",
976
+ "version_major": 2,
977
+ "version_minor": 0
978
+ },
979
+ "text/plain": [
980
+ "model-00032-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
981
+ ]
982
+ },
983
+ "metadata": {},
984
+ "output_type": "display_data"
985
+ },
986
+ {
987
+ "data": {
988
+ "application/vnd.jupyter.widget-view+json": {
989
+ "model_id": "461b39ab50e746a1bd34e8f641d75f15",
990
+ "version_major": 2,
991
+ "version_minor": 0
992
+ },
993
+ "text/plain": [
994
+ "model-00033-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
995
+ ]
996
+ },
997
+ "metadata": {},
998
+ "output_type": "display_data"
999
+ },
1000
+ {
1001
+ "data": {
1002
+ "application/vnd.jupyter.widget-view+json": {
1003
+ "model_id": "e98726876f774a92a5a4d5b3bc2b381a",
1004
+ "version_major": 2,
1005
+ "version_minor": 0
1006
+ },
1007
+ "text/plain": [
1008
+ "model-00034-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
1009
+ ]
1010
+ },
1011
+ "metadata": {},
1012
+ "output_type": "display_data"
1013
+ },
1014
+ {
1015
+ "data": {
1016
+ "application/vnd.jupyter.widget-view+json": {
1017
+ "model_id": "d669bf2bd9d04dc881d16f5eb955f599",
1018
+ "version_major": 2,
1019
+ "version_minor": 0
1020
+ },
1021
+ "text/plain": [
1022
+ "model-00035-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
1023
+ ]
1024
+ },
1025
+ "metadata": {},
1026
+ "output_type": "display_data"
1027
+ },
1028
+ {
1029
+ "data": {
1030
+ "application/vnd.jupyter.widget-view+json": {
1031
+ "model_id": "1443fa509c8b4c2a8acaeb0c7bbc3f84",
1032
+ "version_major": 2,
1033
+ "version_minor": 0
1034
+ },
1035
+ "text/plain": [
1036
+ "model-00036-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
1037
+ ]
1038
+ },
1039
+ "metadata": {},
1040
+ "output_type": "display_data"
1041
+ },
1042
+ {
1043
+ "data": {
1044
+ "application/vnd.jupyter.widget-view+json": {
1045
+ "model_id": "9f388beeb6f245568782cf29d8a7b831",
1046
+ "version_major": 2,
1047
+ "version_minor": 0
1048
+ },
1049
+ "text/plain": [
1050
+ "model-00037-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
1051
+ ]
1052
+ },
1053
+ "metadata": {},
1054
+ "output_type": "display_data"
1055
+ },
1056
+ {
1057
+ "data": {
1058
+ "application/vnd.jupyter.widget-view+json": {
1059
+ "model_id": "c6a2ce5c1e564152802f56ec6e4e8a75",
1060
+ "version_major": 2,
1061
+ "version_minor": 0
1062
+ },
1063
+ "text/plain": [
1064
+ "model-00038-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
1065
+ ]
1066
+ },
1067
+ "metadata": {},
1068
+ "output_type": "display_data"
1069
+ },
1070
+ {
1071
+ "data": {
1072
+ "application/vnd.jupyter.widget-view+json": {
1073
+ "model_id": "2335b6093c9f49eeba58349209fbc38d",
1074
+ "version_major": 2,
1075
+ "version_minor": 0
1076
+ },
1077
+ "text/plain": [
1078
+ "model-00039-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
1079
+ ]
1080
+ },
1081
+ "metadata": {},
1082
+ "output_type": "display_data"
1083
+ },
1084
+ {
1085
+ "data": {
1086
+ "application/vnd.jupyter.widget-view+json": {
1087
+ "model_id": "a7b3687060db425ab597c8e103deed46",
1088
+ "version_major": 2,
1089
+ "version_minor": 0
1090
+ },
1091
+ "text/plain": [
1092
+ "model-00040-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
1093
+ ]
1094
+ },
1095
+ "metadata": {},
1096
+ "output_type": "display_data"
1097
+ },
1098
+ {
1099
+ "data": {
1100
+ "application/vnd.jupyter.widget-view+json": {
1101
+ "model_id": "2b561cea2d324c40975c1eb9ea30cc5d",
1102
+ "version_major": 2,
1103
+ "version_minor": 0
1104
+ },
1105
+ "text/plain": [
1106
+ "model-00041-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
1107
+ ]
1108
+ },
1109
+ "metadata": {},
1110
+ "output_type": "display_data"
1111
+ },
1112
+ {
1113
+ "data": {
1114
+ "application/vnd.jupyter.widget-view+json": {
1115
+ "model_id": "b287241358cf48a497101402d0bd693f",
1116
+ "version_major": 2,
1117
+ "version_minor": 0
1118
+ },
1119
+ "text/plain": [
1120
+ "model-00042-of-00044.safetensors: 0%| | 0.00/4.97G [00:00<?, ?B/s]"
1121
+ ]
1122
+ },
1123
+ "metadata": {},
1124
+ "output_type": "display_data"
1125
+ },
1126
+ {
1127
+ "data": {
1128
+ "application/vnd.jupyter.widget-view+json": {
1129
+ "model_id": "cc93269313fc4f7f9d7d1a301373a2f8",
1130
+ "version_major": 2,
1131
+ "version_minor": 0
1132
+ },
1133
+ "text/plain": [
1134
+ "model-00043-of-00044.safetensors: 0%| | 0.00/4.22G [00:00<?, ?B/s]"
1135
+ ]
1136
+ },
1137
+ "metadata": {},
1138
+ "output_type": "display_data"
1139
+ },
1140
+ {
1141
+ "data": {
1142
+ "application/vnd.jupyter.widget-view+json": {
1143
+ "model_id": "7fd37422efe449fbb6ac31fd869cee15",
1144
+ "version_major": 2,
1145
+ "version_minor": 0
1146
+ },
1147
+ "text/plain": [
1148
+ "model-00044-of-00044.safetensors: 0%| | 0.00/4.20G [00:00<?, ?B/s]"
1149
+ ]
1150
+ },
1151
+ "metadata": {},
1152
+ "output_type": "display_data"
1153
+ },
1154
+ {
1155
+ "data": {
1156
+ "application/vnd.jupyter.widget-view+json": {
1157
+ "model_id": "a664a2e9587343f99883157151ba080f",
1158
+ "version_major": 2,
1159
+ "version_minor": 0
1160
+ },
1161
+ "text/plain": [
1162
+ "model.safetensors.index.json: 0%| | 0.00/239k [00:00<?, ?B/s]"
1163
+ ]
1164
+ },
1165
+ "metadata": {},
1166
+ "output_type": "display_data"
1167
+ },
1168
+ {
1169
+ "data": {
1170
+ "application/vnd.jupyter.widget-view+json": {
1171
+ "model_id": "8e2b847d11014313807e518426b573a4",
1172
+ "version_major": 2,
1173
+ "version_minor": 0
1174
+ },
1175
+ "text/plain": [
1176
+ "Loading safetensors checkpoint shards: 0% Completed | 0/44 [00:00<?, ?it/s]\n"
1177
+ ]
1178
+ },
1179
+ "metadata": {},
1180
+ "output_type": "display_data"
1181
+ },
1182
+ {
1183
+ "name": "stdout",
1184
+ "output_type": "stream",
1185
+ "text": [
1186
+ "INFO 12-08 01:52:50 model_runner.py:1077] Loading model weights took 50.6331 GB\n",
1187
+ "\u001b[1;36m(VllmWorkerProcess pid=730)\u001b[0;0m INFO 12-08 01:52:52 model_runner.py:1077] Loading model weights took 50.6331 GB\n",
1188
+ "\u001b[1;36m(VllmWorkerProcess pid=729)\u001b[0;0m INFO 12-08 01:52:52 model_runner.py:1077] Loading model weights took 50.6331 GB\n",
1189
+ "\u001b[1;36m(VllmWorkerProcess pid=731)\u001b[0;0m INFO 12-08 01:52:52 model_runner.py:1077] Loading model weights took 50.6331 GB\n",
1190
+ "\u001b[1;36m(VllmWorkerProcess pid=730)\u001b[0;0m \u001b[1;36m(VllmWorkerProcess pid=729)\u001b[0;0m \u001b[1;36m(VllmWorkerProcess pid=731)\u001b[0;0m INFO 12-08 01:52:54 worker.py:232] Memory profiling results: total_gpu_memory=93.11GiB initial_memory_usage=51.58GiB peak_torch_memory=51.55GiB memory_usage_post_profile=51.82GiB non_torch_memory=1.15GiB kv_cache_size=37.61GiB gpu_memory_utilization=0.97\n",
1191
+ "INFO 12-08 01:52:54 worker.py:232] Memory profiling results: total_gpu_memory=93.11GiB initial_memory_usage=51.51GiB peak_torch_memory=51.55GiB memory_usage_post_profile=51.68GiB non_torch_memory=1.01GiB kv_cache_size=37.75GiB gpu_memory_utilization=0.97\n",
1192
+ "INFO 12-08 01:52:54 worker.py:232] Memory profiling results: total_gpu_memory=93.11GiB initial_memory_usage=51.58GiB peak_torch_memory=51.55GiB memory_usage_post_profile=51.82GiB non_torch_memory=1.15GiB kv_cache_size=37.61GiB gpu_memory_utilization=0.97\n",
1193
+ "INFO 12-08 01:52:54 worker.py:232] Memory profiling results: total_gpu_memory=93.11GiB initial_memory_usage=51.51GiB peak_torch_memory=51.84GiB memory_usage_post_profile=51.68GiB non_torch_memory=1.02GiB kv_cache_size=37.46GiB gpu_memory_utilization=0.97\n",
1194
+ "INFO 12-08 01:52:54 distributed_gpu_executor.py:57] # GPU blocks: 19483, # CPU blocks: 2080\n",
1195
+ "INFO 12-08 01:52:54 distributed_gpu_executor.py:61] Maximum concurrency for 131072 tokens per request: 2.38x\n",
1196
+ "\u001b[1;36m(VllmWorkerProcess pid=730)\u001b[0;0m INFO 12-08 01:52:59 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n",
1197
+ "\u001b[1;36m(VllmWorkerProcess pid=730)\u001b[0;0m INFO 12-08 01:52:59 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n",
1198
+ "INFO 12-08 01:52:59 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n",
1199
+ "INFO 12-08 01:52:59 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n",
1200
+ "\u001b[1;36m(VllmWorkerProcess pid=731)\u001b[0;0m INFO 12-08 01:53:00 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n",
1201
+ "\u001b[1;36m(VllmWorkerProcess pid=731)\u001b[0;0m INFO 12-08 01:53:00 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n",
1202
+ "\u001b[1;36m(VllmWorkerProcess pid=729)\u001b[0;0m INFO 12-08 01:53:00 model_runner.py:1400] Capturing cudagraphs for decoding. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.\n",
1203
+ "\u001b[1;36m(VllmWorkerProcess pid=729)\u001b[0;0m INFO 12-08 01:53:00 model_runner.py:1404] If out-of-memory error occurs during cudagraph capture, consider decreasing `gpu_memory_utilization` or switching to eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.\n",
1204
+ "\u001b[1;36m(VllmWorkerProcess pid=731)\u001b[0;0m INFO 12-08 01:53:44 model_runner.py:1518] Graph capturing finished in 45 secs, took 2.71 GiB\n",
1205
+ "INFO 12-08 01:53:45 model_runner.py:1518] Graph capturing finished in 46 secs, took 2.71 GiB\n",
1206
+ "\u001b[1;36m(VllmWorkerProcess pid=729)\u001b[0;0m INFO 12-08 01:53:45 model_runner.py:1518] Graph capturing finished in 45 secs, took 2.71 GiB\n",
1207
+ "\u001b[1;36m(VllmWorkerProcess pid=730)\u001b[0;0m INFO 12-08 01:53:45 model_runner.py:1518] Graph capturing finished in 46 secs, took 2.71 GiB\n"
1208
+ ]
1209
+ }
1210
+ ],
1211
+ "source": [
1212
+ "llm = LLM(\n",
1213
+ " model=\"kishizaki-sci/Llama-3.1-405B-Instruct-AWQ-4bit-JP-EN\",\n",
1214
+ " tensor_parallel_size=4,\n",
1215
+ " gpu_memory_utilization=0.97,\n",
1216
+ " quantization=\"awq\"\n",
1217
+ ")\n",
1218
+ "tokenizer = llm.get_tokenizer()"
1219
+ ]
1220
+ },
1221
+ {
1222
+ "cell_type": "code",
1223
+ "execution_count": 4,
1224
+ "id": "cc81f387-a06f-4564-a50e-37e367a79422",
1225
+ "metadata": {},
1226
+ "outputs": [],
1227
+ "source": [
1228
+ "DEFAULT_SYSTEM_PROMPT = \"あなたは日本人のアシスタントです。\"\n",
1229
+ "text = \"plotly.graph_objectsを使って散布図を作るサンプルコードを書いてください.\"\n",
1230
+ "\n",
1231
+ "messages = [\n",
1232
+ " {\"role\": \"system\", \"content\": DEFAULT_SYSTEM_PROMPT},\n",
1233
+ " {\"role\": \"user\", \"content\": text},\n",
1234
+ "]\n",
1235
+ "\n",
1236
+ "prompt = tokenizer.apply_chat_template(\n",
1237
+ " messages,\n",
1238
+ " tokenize=False,\n",
1239
+ " add_generation_prompt=True\n",
1240
+ ")\n",
1241
+ "\n",
1242
+ "sampling_params = SamplingParams(\n",
1243
+ " temperature=0.6,\n",
1244
+ " top_p=0.9,\n",
1245
+ " max_tokens=1000\n",
1246
+ ")"
1247
+ ]
1248
+ },
1249
+ {
1250
+ "cell_type": "code",
1251
+ "execution_count": 5,
1252
+ "id": "c74b2d83-12ff-4324-bc84-51e88b3e12b3",
1253
+ "metadata": {},
1254
+ "outputs": [
1255
+ {
1256
+ "name": "stderr",
1257
+ "output_type": "stream",
1258
+ "text": [
1259
+ "Processed prompts: 100%|██████████| 1/1 [00:20<00:00, 20.38s/it, est. speed input: 3.29 toks/s, output: 13.59 toks/s]"
1260
+ ]
1261
+ },
1262
+ {
1263
+ "name": "stdout",
1264
+ "output_type": "stream",
1265
+ "text": [
1266
+ "plotly.graph_objectsを使って散布図を作るサンプルコードを以下に示します。\n",
1267
+ "\n",
1268
+ "```python\n",
1269
+ "import plotly.graph_objects as go\n",
1270
+ "import numpy as np\n",
1271
+ "\n",
1272
+ "# サンプルデータを生成\n",
1273
+ "np.random.seed(0)\n",
1274
+ "x = np.random.randn(100)\n",
1275
+ "y = np.random.randn(100)\n",
1276
+ "\n",
1277
+ "# 散布図を作成\n",
1278
+ "fig = go.Figure(data=[go.Scatter(\n",
1279
+ " x=x,\n",
1280
+ " y=y,\n",
1281
+ " mode='markers',\n",
1282
+ " marker=dict(\n",
1283
+ " size=10,\n",
1284
+ " color='blue',\n",
1285
+ " opacity=0.7\n",
1286
+ " )\n",
1287
+ ")])\n",
1288
+ "\n",
1289
+ "# グラフのタイトルと軸ラベルを設定\n",
1290
+ "fig.update_layout(\n",
1291
+ " title='散布図のサンプル',\n",
1292
+ " xaxis_title='X軸',\n",
1293
+ " yaxis_title='Y軸'\n",
1294
+ ")\n",
1295
+ "\n",
1296
+ "# グラフを表示\n",
1297
+ "fig.show()\n",
1298
+ "```\n",
1299
+ "\n",
1300
+ "このコードでは、numpyを使用してランダムなサンプルデータを生成し、plotly.graph_objectsのScatterオブジェクトを使用して散布図を作成しています。散布図のマーカーのサイズ、色、透明度を設定し、���ラフのタイトルと軸ラベルを設定しています。最後に、`fig.show()`を使用してグラフを表示しています。\n",
1301
+ "CPU times: user 19.8 s, sys: 645 ms, total: 20.5 s\n",
1302
+ "Wall time: 20.4 s\n"
1303
+ ]
1304
+ },
1305
+ {
1306
+ "name": "stderr",
1307
+ "output_type": "stream",
1308
+ "text": [
1309
+ "\n"
1310
+ ]
1311
+ }
1312
+ ],
1313
+ "source": [
1314
+ "%%time\n",
1315
+ "outputs = llm.generate(prompt, sampling_params)\n",
1316
+ "print(outputs[0].outputs[0].text)"
1317
+ ]
1318
+ },
1319
+ {
1320
+ "cell_type": "code",
1321
+ "execution_count": null,
1322
+ "id": "1fb4a3d0-10ba-4eda-824d-e774322ddf07",
1323
+ "metadata": {},
1324
+ "outputs": [],
1325
+ "source": []
1326
+ }
1327
+ ],
1328
+ "metadata": {
1329
+ "kernelspec": {
1330
+ "display_name": "Python 3 (ipykernel)",
1331
+ "language": "python",
1332
+ "name": "python3"
1333
+ },
1334
+ "language_info": {
1335
+ "codemirror_mode": {
1336
+ "name": "ipython",
1337
+ "version": 3
1338
+ },
1339
+ "file_extension": ".py",
1340
+ "mimetype": "text/x-python",
1341
+ "name": "python",
1342
+ "nbconvert_exporter": "python",
1343
+ "pygments_lexer": "ipython3",
1344
+ "version": "3.11.10"
1345
+ }
1346
+ },
1347
+ "nbformat": 4,
1348
+ "nbformat_minor": 5
1349
+ }