Adocados HelloKKMe commited on
Commit
272f31d
·
verified ·
0 Parent(s):

Duplicate from Salesforce/GTA1-32B

Browse files

Co-authored-by: Yan Yang <HelloKKMe@users.noreply.huggingface.co>

.gitattributes ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,777 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ metrics:
6
+ - accuracy
7
+ pipeline_tag: image-text-to-text
8
+ tags:
9
+ - VLM
10
+ - Computer-Use-Agent
11
+ - OS-Agent
12
+ - GUI
13
+ - Grounding
14
+ library_name: transformers
15
+ ---
16
+
17
+ # Introduction
18
+
19
+ Reinforcement learning (RL) (e.g., GRPO) helps with grounding because of its inherent objective alignment—rewarding successful clicks—rather than encouraging long textual Chain-of-Thought (CoT) reasoning. Unlike approaches that rely heavily on verbose CoT reasoning, GRPO directly incentivizes actionable and grounded responses. Based on findings from our [blog](https://huggingface.co/blog/HelloKKMe/grounding-r1), we share state-of-the-art GUI grounding models trained using GRPO.
20
+
21
+ # Grounding Performance
22
+
23
+ We follow the standard evaluation protocol and benchmark our model on three challenging datasets. Our method consistently achieves the best results among all open-source model families. Below are the comparative results:
24
+
25
+ | **Model** | **Size** | **Open Source** | **ScreenSpot-V2** | **ScreenSpotPro** | **OSWORLD-G** | **OSWORLD-G-Refined** |
26
+ |-------------------|:--------:|:---------------:|:-----------------:|:-----------------:|:-----------------:|:-----------------:|
27
+ | OpenAI CUA | — | ❌ | 87.9 | 23.4 | — | — |
28
+ | Claude 3.7 | — | ❌ | 87.6 | 27.7 | — | — |
29
+ | JEDI-7B | 7B | ✅ | 91.7 | 39.5 | 54.1 | — |
30
+ | SE-GUI | 7B | ✅ | 90.3 | 47.0 | — | — |
31
+ | UI-TARS | 7B | ✅ | 91.6 | 35.7 | 47.5 | — |
32
+ | UI-TARS-1.5* | 7B | ✅ | 89.7* | 42.0* | 52.8* | 64.2* |
33
+ | UGround-v1-7B | 7B | ✅ | — | 31.1 | — | 36.4 |
34
+ | Qwen2.5-VL-32B-Instruct | 32B | ✅ | 91.9* | 48.0 | 46.5 | 59.6* |
35
+ | UGround-v1-72B | 72B | ✅ | — | 34.5 | — | — |
36
+ | Qwen2.5-VL-72B-Instruct | 72B | ✅ | 94.00* | 53.3 | — | 62.2* |
37
+ | UI-TARS | 72B | ✅ | 90.3 | 38.1 | — | — |
38
+ | OpenCUA | 7B | ✅ | 92.3 | 50.0 | 55.3 | 68.3* |
39
+ | OpenCUA | 32B | ✅ | 93.4 | 55.3 | 59.6 | 70.2* |
40
+ | GTA1-2507 (Ours) | 7B | ✅ | 92.4 <sub>*(∆ +2.7)*</sub> | 50.1<sub>*(∆ +8.1)*</sub> | 55.1 <sub>*(∆ +2.3)*</sub> | 67.7 <sub>*(∆ +3.5)*</sub> |
41
+ | GTA1 (Ours) | 7B | ✅ | 93.4 <sub>*(∆ +0.1)*</sub> | 55.5<sub>*(∆ +5.5)*</sub> | 60.1<sub>*(∆ +4.8)*</sub> | 68.8<sub>*(∆ +0.5)*</sub> |
42
+ | GTA1 (Ours) | 32B | ✅ | 95.2 <sub>*(∆ +1.8)*</sub> | 63.6<sub>*(∆ +8.3)*</sub> | 65.2 <sub>*(∆ +5.6)*</sub> | 72.2<sub>*(∆ +2.0)*</sub> |
43
+
44
+ > **Note:**
45
+ > - Model size is indicated in billions (B) of parameters.
46
+ > - A dash (—) denotes results that are currently unavailable.
47
+ > - A superscript asterisk (﹡) denotes our evaluated result.
48
+ > - UI-TARS-1.5 7B, OpenCUA-7B, and OpenCUA-32B are applied as our baseline models.
49
+ > - ∆ indicates the performance improvement (∆) of our model compared to its baseline.
50
+
51
+
52
+ # Agent Performance
53
+
54
+ ## OSWorld and OSWorld-Verified Benchmarks
55
+
56
+ We evaluate our models on the OSWorld and OSWorld-Verified benchmarks following the standard evaluation protocol. The results demonstrate strong performance across both datasets.
57
+
58
+ | **Agent Model** | **Step** | **OSWorld** | **OSWorld-Verified** |
59
+ |-----------------|:--------:|:-----------:|:-------------------:|
60
+ | **Proprietary Models** |
61
+ | Claude 3.7 Sonnet | 100 | 28.0 | — |
62
+ | OpenAI CUA 4o | 200 | 38.1 | — |
63
+ | UI-TARS-1.5 | 100 | 42.5 | 41.8 |
64
+ | OpenAI CUA o3 | 200 | 42.9 | — |
65
+ | **Open-Source Models** |
66
+ | Aria-UI w/ GPT-4o | 15 | 15.2 | — |
67
+ | Aguvis-72B w/ GPT-4o | 15 | 17.0 | — |
68
+ | UI-TARS-72B-SFT | 50 | 18.8 | — |
69
+ | Agent S w/ Claude-3.5-Sonnet | 15 | 20.5 | — |
70
+ | Agent S w/ GPT-4o | 15 | 20.6 | — |
71
+ | UI-TARS-72B-DPO | 15 | 22.7 | — |
72
+ | UI-TARS-72B-DPO | 50 | 24.6 | — |
73
+ | UI-TARS-1.5-7B | 100 | 26.9 | 27.4 |
74
+ | Jedi-7B w/ o3 | 100 | — | 51.0 |
75
+ | Jedi-7B w/ GPT-4o | 100 | 27.0 | �� |
76
+ | Agent S2 w/ Claude-3.7-Sonnet | 50 | 34.5 | — |
77
+ | Agent S2 w/ Gemini-2.5-Pro | 50 | 41.4 | 45.8 |
78
+ | Agent S2.5 w/ o3 | 100 | — | 56.0 |
79
+ | Agent S2.5 w/ GPT-5 | 100 | — | 58.4 |
80
+ | CoAct-1 w/o3 & o4mini & OpenAI CUA 4o | 150 | — | 60.8 |
81
+ | GTA1-7B-2507 w/ o3 | 100 | 45.2 | 53.1 |
82
+ | GTA1-7B-2507 w/ GPT-5 | 100 | — | 61.0 |
83
+ | GTA1-32B w/ o3 | 100 | — | 55.4 |
84
+ | GTA1-32B w/ GPT-5 | 100 | — | 63.4 |
85
+
86
+ > **Note:** A dash (—) indicates unavailable results.
87
+
88
+ ## WindowsAgentArena Benchmark
89
+
90
+ We also evaluate our models on the WindowsAgentArena benchmark, demonstrating strong performance in Windows-specific GUI automation tasks.
91
+
92
+ | **Agent Model** | **Step** | **Success Rate** |
93
+ |-----------------|:--------:|:---------------:|
94
+ | Kimi-VL | 15 | 10.4 |
95
+ | WAA | — | 19.5 |
96
+ | Jedi w/ GPT-4o | 100 | 33.7 |
97
+ | GTA1-7B-2507 w/ o3 | 100 | 47.9 |
98
+ | GTA1-7B-2507 w/ GPT-5 | 100 | 49.2 |
99
+ | GTA1-32B w/ o3 | 100 | 51.2 |
100
+ | GTA1-32B w/ GPT-5 | 100 | 50.6 |
101
+
102
+ > **Note:** A dash (—) indicates unavailable results.
103
+
104
+ # Inference
105
+ Below is a code snippet demonstrating how to run inference using a trained model.
106
+
107
+ ```python
108
+ from transformers import AutoTokenizer, AutoImageProcessor
109
+ from transformers.models.qwen2_vl.image_processing_qwen2_vl_fast import smart_resize
110
+ from PIL import Image
111
+ from io import BytesIO
112
+ import base64
113
+ import re
114
+ from vllm import LLM, SamplingParams
115
+
116
+ instruction="click start"
117
+ image_path="example.png"
118
+
119
+ CLICK_REGEXES = [
120
+ # pyautogui.click(x=123, y=456)
121
+ re.compile(r"click\s*\(\s*x\s*=\s*(\d+)\s*,\s*y\s*=\s*(\d+)\s*\)", re.IGNORECASE),
122
+ # pyautogui.click(123, 456) or click(123,456)
123
+ re.compile(r"click\s*\(\s*(\d+)\s*,\s*(\d+)\s*\)", re.IGNORECASE),
124
+ ]
125
+
126
+ def format_message(image_path,instruction):
127
+ SYSTEM_PROMPT = (
128
+ "You are a GUI agent. You are given a task and a screenshot of the screen. "
129
+ "You need to perform a series of pyautogui actions to complete the task."
130
+ )
131
+ messages = [
132
+ {"role": "system", "content": SYSTEM_PROMPT},
133
+ {"role": "user", "content": [
134
+ {"type": "image", "image": image_path},
135
+ {"type": "text", "text": instruction},
136
+ ]},
137
+ ]
138
+ text = prompt_tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
139
+
140
+ text2, n = re.subn(
141
+ r"<\|media_begin\|>.*?<\|media_end\|>",
142
+ "<|vision_start|><|image_pad|><|vision_end|>",
143
+ text,
144
+ flags=re.S
145
+ )
146
+ if n == 0:
147
+ raise RuntimeError("Cannot find <|media_begin|>...<|media_end|> token.")
148
+ return text2
149
+
150
+ def parse_xy_from_text(text: str):
151
+ if "click" not in text.lower():
152
+ return [-1, -1]
153
+ for rx in CLICK_REGEXES:
154
+ m = rx.search(text)
155
+ if m:
156
+ try:
157
+ return int(m.group(1)), int(m.group(2))
158
+ except Exception:
159
+ continue
160
+ return [-1,-1]
161
+
162
+ def convert_pil_image_to_base64(image):
163
+ buffered = BytesIO()
164
+ image.save(buffered, format="PNG")
165
+ return base64.b64encode(buffered.getvalue()).decode()
166
+
167
+ llm = LLM(
168
+ model="Salesforce/GTA1-32B",
169
+ tokenizer="Salesforce/GTA1-32B",
170
+ tokenizer_mode="slow",
171
+ trust_remote_code=True,
172
+ dtype="bfloat16",
173
+ limit_mm_per_prompt={"image": 1},
174
+ tensor_parallel_size=1,
175
+ )
176
+ prompt_tok = AutoTokenizer.from_pretrained("Salesforce/GTA1-32B", trust_remote_code=True)
177
+ sp = SamplingParams(max_tokens=512, temperature=0.0)
178
+ tokenizer = llm.get_tokenizer()
179
+ processor=AutoImageProcessor.from_pretrained("Salesforce/GTA1-32B", trust_remote_code=True)
180
+
181
+ image = Image.open(image_path).convert('RGB')
182
+ resized_height, resized_width = smart_resize(
183
+ image.height,
184
+ image.width,
185
+ factor=processor.patch_size * processor.merge_size,
186
+ min_pixels=processor.min_pixels,
187
+ max_pixels=processor.max_pixels,
188
+ )
189
+ resized_image = image.resize((resized_width, resized_height))
190
+ messages = format_message(image_path, instruction)
191
+ response = llm.generate(
192
+ [{"prompt": messages, "multi_modal_data": {"image": [resized_image]}}],
193
+ sampling_params=sp
194
+ )[0].outputs[0].text
195
+
196
+
197
+ coordinates = parse_xy_from_text(response)
198
+ print(coordinates[0]/resized_width*image.width, coordinates[1]/resized_height*image.height)
199
+ ```
200
+
201
+ # Model Serving
202
+
203
+ Below is an example script for serving the model.
204
+ ```python
205
+ import torch
206
+ import os
207
+ # -------------------------
208
+ # System / Torch defaults
209
+ # -------------------------
210
+ os.environ.setdefault("TOKENIZERS_PARALLELISM", "false") # avoid CPU oversubscription
211
+ os.environ.setdefault("VLLM_USE_V1", "1")
212
+ os.environ.setdefault("VLLM_ENGINE_IN_BACKGROUND_THREAD", "0")
213
+ import base64
214
+ import re
215
+ from typing import Dict, List, Union
216
+ from PIL import Image
217
+ from io import BytesIO
218
+ import traceback
219
+ import argparse
220
+ import asyncio
221
+ import requests
222
+ import ray
223
+ from ray import serve
224
+ from fastapi import FastAPI
225
+ from transformers import AutoTokenizer
226
+ from vllm import LLM, SamplingParams
227
+ import uuid
228
+
229
+
230
+ N_REPLICAS = 2
231
+
232
+ try:
233
+ torch.backends.cuda.matmul.allow_tf32 = True
234
+ torch.backends.cudnn.benchmark = True
235
+ except Exception:
236
+ pass
237
+
238
+
239
+ # -------------------------
240
+ # IO helpers
241
+ # -------------------------
242
+
243
+ def pil_to_base64(img: Image.Image, format: str = "PNG") -> str:
244
+ buffer = BytesIO()
245
+ img.save(buffer, format=format)
246
+ img_bytes = buffer.getvalue()
247
+ img_b64 = base64.b64encode(img_bytes).decode("utf-8")
248
+ return img_b64
249
+
250
+
251
+ def data_uri_to_pil(data_uri: str) -> Image.Image:
252
+ header, b64_str = data_uri.split(",", 1)
253
+ img_data = base64.b64decode(b64_str)
254
+ buffer = BytesIO(img_data)
255
+ img = Image.open(buffer)
256
+ return img
257
+
258
+
259
+ def extract_images(messages: List[Dict]) -> List[Image.Image]:
260
+ images = []
261
+ for msg in messages:
262
+ if msg.get("role") == "user":
263
+ for content in msg.get("content", []):
264
+ if content.get("type") in ["image", "image_url"]:
265
+ if content["type"] == "image":
266
+ images.append(data_uri_to_pil(content["image"]).convert("RGB"))
267
+ else:
268
+ images.append(data_uri_to_pil(content["image_url"]["url"]).convert("RGB"))
269
+ return images
270
+
271
+
272
+ # -------------------------
273
+ # Prompt builder
274
+ # -------------------------
275
+
276
+ def build_prompt_with_template(tokenizer: AutoTokenizer, messages: List[Dict]) -> str:
277
+ text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
278
+ text2, n = re.subn(
279
+ r"<\|media_begin\|>.*?<\|media_end\|>",
280
+ "<|vision_start|><|image_pad|><|vision_end|>",
281
+ text,
282
+ flags=re.S,
283
+ )
284
+ if n == 0:
285
+ raise RuntimeError("Did not find <|media_begin|>...<|media_end|> block in template.")
286
+ return text2
287
+
288
+ # -------------------------
289
+ # Deployment
290
+ # -------------------------
291
+
292
+ def build_app(model_path: str, num_replicas: int, port: int):
293
+ api = FastAPI(title="GTA1-32B Multi-GPU Service (High-throughput)")
294
+
295
+ @serve.deployment(
296
+ num_replicas=num_replicas,
297
+ ray_actor_options={"num_gpus": 1, "num_cpus": 4},
298
+ max_ongoing_requests=16,
299
+ )
300
+ class GTA1Model:
301
+ def __init__(self, model_path: str):
302
+ gpu_ids = ray.get_gpu_ids()
303
+ self.gpu_id = gpu_ids[0] if gpu_ids else 0
304
+ print(f"🔍 Ray assigned GPU IDs: {gpu_ids}")
305
+ # Initialize vLLM within this replica (Ray sets CUDA_VISIBLE_DEVICES)
306
+ print(f"🔄 Initializing vLLM on GPU {self.gpu_id}[ray id] from {model_path}")
307
+ if not torch.cuda.is_available():
308
+ raise RuntimeError("CUDA is not available")
309
+
310
+ self.llm = LLM(
311
+ model=model_path,
312
+ tokenizer=model_path,
313
+ tokenizer_mode="slow",
314
+ trust_remote_code=True,
315
+ dtype="bfloat16",
316
+ limit_mm_per_prompt={"image": 1},
317
+ max_model_len=32768,
318
+ tensor_parallel_size=1,
319
+ )
320
+ self.vllm_tokenizer = self.llm.get_tokenizer()
321
+ self.hf_tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
322
+ self.model_path = model_path
323
+ self.dtype = "bfloat16"
324
+ print(f"✅ vLLM initialized successfully (Ray GPU Id: {self.gpu_id})")
325
+
326
+ # ------------ batching core ------------
327
+ @serve.batch(max_batch_size=8, batch_wait_timeout_s=0.1) # increase if GPU allows
328
+ async def _generate_batch(self, payload: Union[Dict, List[Dict]]):
329
+ """Build prompts, enforce single image, and call vLLM.generate."""
330
+ if isinstance(payload, dict):
331
+ list_of_payloads = [payload]
332
+ else:
333
+ list_of_payloads = payload
334
+ request_id = uuid.uuid4().hex[:8]
335
+ # --- Build per-sample prompt/image ---
336
+ prompts: List[str] = []
337
+ images_per_req: List[Image.Image] = []
338
+ error_results = []
339
+ early_exit = False
340
+ for p in list_of_payloads:
341
+ try:
342
+ messages = p["messages"]
343
+ imgs = extract_images(messages)
344
+ if len(imgs) != 1:
345
+ raise RuntimeError(f"Exactly one image is required, got {len(imgs)}")
346
+ prompt_text = build_prompt_with_template(self.hf_tokenizer, messages)
347
+ # Sanity check on tokens: 1 <|image_pad|>, no <|media_placeholder|>
348
+ tok = self.vllm_tokenizer
349
+ id_imgpad = tok.encode("<|image_pad|>", add_special_tokens=False)[0]
350
+ id_media = tok.encode("<|media_placeholder|>", add_special_tokens=False)[0]
351
+ ids = tok.encode(prompt_text, add_special_tokens=False)
352
+ if sum(i == id_imgpad for i in ids) != 1 or any(i == id_media for i in ids):
353
+ raise RuntimeError("Prompt media tokens invalid after conversion")
354
+ prompts.append(prompt_text)
355
+ images_per_req.append(imgs[0])
356
+ except Exception as e:
357
+ early_exit = True
358
+ trace = traceback.format_exc()
359
+ error_results.append(
360
+ {
361
+ "response": "",
362
+ "error": {
363
+ "message": str(e),
364
+ "trace": trace,
365
+ 'type_of_payload': str(type(payload)),
366
+ 'type_of_list_of_payloads': str(type(list_of_payloads)),
367
+ 'type_of_p': str(type(p)),
368
+ 'p_keys': str(p.keys()) if isinstance(p, dict) else str(p),
369
+ },
370
+ "usage": {},
371
+ "gpu_id": self.gpu_id
372
+ }
373
+ )
374
+ if early_exit:
375
+ return error_results
376
+ # --- vLLM generation ---
377
+ args_base = list_of_payloads[0]
378
+ sp = SamplingParams(
379
+ max_tokens=args_base.get("max_new_tokens", 512),
380
+ temperature=args_base.get("temperature", 0.0),
381
+ top_p=args_base.get("top_p", 0.9),
382
+ )
383
+
384
+ requests_list = [
385
+ {"prompt": pr, "multi_modal_data": {"image": [im]}}
386
+ for pr, im in zip(prompts, images_per_req)
387
+ ]
388
+
389
+ outs = self.llm.generate(requests_list, sampling_params=sp)
390
+
391
+ tok = self.vllm_tokenizer
392
+ results: List[Dict] = []
393
+ for pr, o in zip(prompts, outs):
394
+ text = o.outputs[0].text if o.outputs else ""
395
+ gen_tokens = len(o.outputs[0].token_ids) if (o.outputs and hasattr(o.outputs[0], 'token_ids')) else None
396
+ prompt_tokens = len(tok.encode(pr, add_special_tokens=False))
397
+ usage = {
398
+ "prompt_tokens": prompt_tokens,
399
+ "generated_tokens": gen_tokens if gen_tokens is not None else None,
400
+ "total_tokens": (prompt_tokens + gen_tokens) if gen_tokens is not None else None,
401
+ }
402
+ results.append({
403
+ "response": text,
404
+ "error": "",
405
+ "usage": usage,
406
+ "gpu_id": self.gpu_id,
407
+ 'bs_size_in_this_request': f"{request_id}:{len(list_of_payloads)}"
408
+ })
409
+
410
+ return results
411
+
412
+ # Exposed single-call entry that joins the batch
413
+ async def call_llm(self, payload: Dict):
414
+ try:
415
+ res = await self._generate_batch(payload)
416
+ return res
417
+ except Exception as e:
418
+ trace = traceback.format_exc()
419
+ return {"response": "", "error": {"message": str(e), "trace": trace}, "usage": {}, "gpu_id": self.gpu_id}
420
+
421
+ def health(self):
422
+ return {
423
+ "status": "ok",
424
+ "gpu_id": self.gpu_id,
425
+ "dtype": self.dtype,
426
+ "model_path": self.model_path,
427
+ }
428
+
429
+ model = GTA1Model.bind(model_path)
430
+
431
+ @serve.deployment(max_ongoing_requests=96)
432
+ @serve.ingress(api)
433
+ class GTA1App:
434
+ def __init__(self, model_handle):
435
+ self.model_deployment = model_handle
436
+
437
+ @api.get("/health")
438
+ async def health_all(self):
439
+ # Calling the same Serve handle N times does not guarantee each call hits a different replica
440
+ attempts = max(8, N_REPLICAS * 4) # oversample
441
+ calls = [self.model_deployment.health.remote() for i in range(attempts)]
442
+ replies = await asyncio.gather(*calls)
443
+ # dedupe by replica_id (or by tuple(gpu_id))
444
+ seen = {}
445
+ for r in replies:
446
+ seen[r.get("gpu_id", f"unknown-{len(seen)}")] = r
447
+ if len(seen) >= N_REPLICAS:
448
+ break
449
+ return {"replicas": list(seen.values())}
450
+
451
+ @api.post("/call_llm")
452
+ async def call_llm(self, req: Dict):
453
+ return await self.model_deployment.call_llm.remote(req)
454
+
455
+ return GTA1App.bind(model)
456
+
457
+
458
+ # -------------------------
459
+ # Main
460
+ # -------------------------
461
+ if __name__ == "__main__":
462
+ parser = argparse.ArgumentParser()
463
+ parser.add_argument("--model_path", type=str, default="Salesforce/GTA1-32B")
464
+ parser.add_argument("--host", type=str, default="0.0.0.0")
465
+ parser.add_argument("--port", type=int, default=3005)
466
+ parser.add_argument("--num_replicas", type=int, default=2)
467
+ args = parser.parse_args()
468
+ N_REPLICAS = args.num_replicas
469
+ ray.init(ignore_reinit_error=True)
470
+
471
+ print(f"🚀 Starting GTA1-32B service on {args.host}:{args.port}")
472
+ serve.start(detached=True, http_options={"host": args.host, "port": args.port})
473
+
474
+ app = build_app(args.model_path, args.num_replicas, args.port)
475
+ serve.run(app, name="GTA1-32B", route_prefix="/")
476
+
477
+ # Quick health sample
478
+ try:
479
+ r = requests.get(f"http://0.0.0.0:{args.port}/health", timeout=5)
480
+ print(r.json())
481
+ except Exception as e:
482
+ print("Health probe failed:", e)
483
+
484
+ ```
485
+ Here is the example usage,
486
+
487
+ ```python
488
+ import argparse
489
+ import base64
490
+ import concurrent.futures
491
+ import json
492
+ import os
493
+ import re
494
+ from typing import Dict, List, Tuple
495
+ from gui_agent.agent.gta1.format_message import encode_numpy_image_to_base64, encode_image_bytes, smart_resize
496
+
497
+ import requests
498
+ from PIL import Image, ImageDraw
499
+
500
+
501
+ def image_file_to_data_uri(image_path: str) -> str:
502
+ if not os.path.exists(image_path):
503
+ raise FileNotFoundError(f"Image not found: {image_path}")
504
+ with open(image_path, "rb") as f:
505
+ b64 = base64.b64encode(f.read()).decode("utf-8")
506
+ # default to png; serverside only requires a data URI header then comma
507
+ return f"data:image/png;base64,{b64}"
508
+
509
+
510
+ def build_messages(image_path: str, instruction: str, system_prompt: str) -> List[Dict]:
511
+ return [
512
+ {"role": "system", "content": system_prompt},
513
+ {
514
+ "role": "user",
515
+ "content": [
516
+ {"type": "image", "image": image_file_to_data_uri(image_path)},
517
+ {"type": "text", "text": instruction},
518
+ ],
519
+ },
520
+ ]
521
+
522
+
523
+ def call_health(base_url: str, timeout: float = 10.0) -> Dict:
524
+ r = requests.get(f"{base_url}/health", timeout=timeout)
525
+ r.raise_for_status()
526
+ return r.json()
527
+
528
+
529
+ def call_single(
530
+ base_url: str,
531
+ image_path: str,
532
+ instruction: str,
533
+ system_prompt: str,
534
+ max_new_tokens: int = 512,
535
+ temperature: float = 0.0,
536
+ top_p: float = 0.9,
537
+ timeout: float = 120.0,
538
+ ) -> List[Dict]:
539
+ payload = {
540
+ "messages": build_messages(image_path, instruction, system_prompt),
541
+ "max_new_tokens": max_new_tokens,
542
+ "temperature": temperature,
543
+ "top_p": top_p,
544
+ }
545
+ r = requests.post(f"{base_url}/call_llm", json=payload, timeout=timeout)
546
+ r.raise_for_status()
547
+ resp = r.json()
548
+ if isinstance(resp, dict):
549
+ return [resp]
550
+ return resp
551
+
552
+
553
+ def call_many_concurrent(
554
+ base_url: str,
555
+ image_path: str,
556
+ instruction: str,
557
+ system_prompt: str,
558
+ num_requests: int,
559
+ concurrency: int,
560
+ max_new_tokens: int = 512,
561
+ temperature: float = 0.0,
562
+ top_p: float = 0.9,
563
+ timeout: float = 120.0,
564
+ ) -> List[List[Dict]]:
565
+ results: List[List[Dict]] = []
566
+
567
+ def _one(i: int) -> List[Dict]:
568
+ # Vary instruction slightly so you can trace requests
569
+ instr = f"{instruction} [req {i+1}/{num_requests}]"
570
+ return call_single(
571
+ base_url,
572
+ image_path,
573
+ instr,
574
+ system_prompt,
575
+ max_new_tokens,
576
+ temperature,
577
+ top_p,
578
+ timeout,
579
+ )
580
+
581
+ with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as pool:
582
+ futures = [pool.submit(_one, i) for i in range(num_requests)]
583
+ for fut in concurrent.futures.as_completed(futures):
584
+ results.append(fut.result())
585
+ return results
586
+
587
+
588
+ def pretty_print_response(batch_results: List[Dict]) -> None:
589
+ if isinstance(batch_results, dict):
590
+ batch_results = [batch_results]
591
+ for idx, item in enumerate(batch_results):
592
+ if item.get("error"):
593
+ print(f"[#{idx}] ERROR: {json.dumps(item['error'], ensure_ascii=False)})")
594
+ else:
595
+ usage = item.get("usage", {})
596
+ print(f"[#{idx}] gpu={item.get('gpu_id')} tokens={usage} text=\n{item.get('response','').strip()}\n")
597
+
598
+ CLICK_KWARGS_REGEX = re.compile(r"pyautogui\.click\(\s*x\s*=\s*(\d+)\s*,\s*y\s*=\s*(\d+)\s*\)")
599
+ CLICK_POSARGS_REGEX = re.compile(r"pyautogui\.click\(\s*(\d+)\s*,\s*(\d+)\s*\)")
600
+
601
+ def extract_clicks_from_text(text: str) -> List[Tuple[int, int]]:
602
+ clicks: List[Tuple[int, int]] = []
603
+ for x, y in CLICK_KWARGS_REGEX.findall(text or ""):
604
+ clicks.append((int(x), int(y)))
605
+ for x, y in CLICK_POSARGS_REGEX.findall(text or ""):
606
+ clicks.append((int(x), int(y)))
607
+ return clicks
608
+
609
+ def extract_clicks_from_results(result_items: List[Dict]) -> List[Tuple[int, int]]:
610
+ clicks: List[Tuple[int, int]] = []
611
+ if isinstance(result_items, dict):
612
+ result_items = [result_items]
613
+ for item in result_items:
614
+ if item.get("error"):
615
+ continue
616
+ clicks.extend(extract_clicks_from_text(item.get("response", "")))
617
+ return clicks
618
+
619
+ def compute_resized_dims_for_server_mapping(image_path: str) -> Tuple[int, int, int, int]:
620
+ with Image.open(image_path) as im:
621
+ width, height = im.size
622
+ resized_H, resized_W = smart_resize(
623
+ height,
624
+ width,
625
+ factor=28,
626
+ min_pixels=1000,
627
+ max_pixels=1000000000000,
628
+ )
629
+ return width, height, int(resized_W), int(resized_H)
630
+
631
+ def map_clicks_to_original(clicks_resized: List[Tuple[int, int]],
632
+ original_w: int,
633
+ original_h: int,
634
+ resized_w: int,
635
+ resized_h: int) -> List[Tuple[int, int]]:
636
+ if resized_w == 0 or resized_h == 0:
637
+ return []
638
+ scale_x = original_w / float(resized_w)
639
+ scale_y = original_h / float(resized_h)
640
+ mapped: List[Tuple[int, int]] = []
641
+ for x, y in clicks_resized:
642
+ mapped_x = int(round(x * scale_x))
643
+ mapped_y = int(round(y * scale_y))
644
+ mapped.append((mapped_x, mapped_y))
645
+ return mapped
646
+
647
+ def draw_circles_on_image(image_path: str,
648
+ points: List[Tuple[int, int]],
649
+ output_path: str,
650
+ radius: int = 8,
651
+ color: Tuple[int, int, int] = (255, 0, 0),
652
+ width: int = 3) -> None:
653
+ if not points:
654
+ return
655
+ with Image.open(image_path).convert("RGB") as img:
656
+ drawer = ImageDraw.Draw(img)
657
+ for (x, y) in points:
658
+ left = x - radius
659
+ top = y - radius
660
+ right = x + radius
661
+ bottom = y + radius
662
+ drawer.ellipse([(left, top), (right, bottom)], outline=color, fill=(0,255,0), width=width)
663
+ img.save(output_path)
664
+ print(f"Annotated image saved to: {output_path} (points drawn: {len(points)})")
665
+
666
+ SYSTEM_PROMPT = (
667
+ "You are a GUI agent. You are given a task and a screenshot of the screen. "
668
+ "You need to perform a series of pyautogui actions to complete the task."
669
+ )
670
+ def main():
671
+ parser = argparse.ArgumentParser(description="Examples: single and batched inference against GTA1-32B Ray Serve.")
672
+ parser.add_argument("--host", type=str, default="http://localhost", help="Ray Serve host, e.g. http://localhost or http://IP")
673
+ parser.add_argument("--port", type=int, default=3005, help="Ray Serve port")
674
+ parser.add_argument("--image", type=str, required=False, default="example.jpg", help="Path to input image")
675
+ parser.add_argument("--instruction", type=str, default="click the icon in the bottom row, third from the left", help="User instruction")
676
+ parser.add_argument("--system", type=str, default=SYSTEM_PROMPT)
677
+ parser.add_argument("--mode", type=str, choices=["single", "batch", "health"], default="batch")
678
+ parser.add_argument("--num_requests", type=int, default=8, help="Number of requests in batch mode")
679
+ parser.add_argument("--concurrency", type=int, default=8, help="Max concurrent HTTP calls in batch mode")
680
+ parser.add_argument("--max_new_tokens", type=int, default=512)
681
+ parser.add_argument("--temperature", type=float, default=0.0)
682
+ parser.add_argument("--top_p", type=float, default=0.9)
683
+ parser.add_argument("--timeout", type=float, default=180.0)
684
+ args = parser.parse_args()
685
+
686
+ base_url = f"{args.host}:{args.port}"
687
+
688
+ if args.mode == "health":
689
+ info = call_health(base_url, timeout=10.0)
690
+ print(json.dumps(info, indent=2))
691
+ return
692
+
693
+ if args.mode == "single":
694
+ result_list = call_single(
695
+ base_url=base_url,
696
+ image_path=args.image,
697
+ instruction=args.instruction,
698
+ system_prompt=args.system,
699
+ max_new_tokens=args.max_new_tokens,
700
+ temperature=args.temperature,
701
+ top_p=args.top_p,
702
+ timeout=args.timeout,
703
+ )
704
+ print(result_list)
705
+ pretty_print_response(result_list)
706
+ clicks_resized = extract_clicks_from_results(result_list)
707
+ if clicks_resized:
708
+ orig_w, orig_h, resized_w, resized_h = compute_resized_dims_for_server_mapping(args.image)
709
+ mapped_clicks = map_clicks_to_original(clicks_resized, orig_w, orig_h, resized_w, resized_h)
710
+ out_path = f"ray_serve/annotated.png"
711
+ draw_circles_on_image(args.image, mapped_clicks, out_path)
712
+ return
713
+
714
+ if args.mode == "batch":
715
+ print(f"Submitting {args.num_requests} requests with concurrency={args.concurrency}...")
716
+ batch_outs = call_many_concurrent(
717
+ base_url=base_url,
718
+ image_path=args.image,
719
+ instruction=args.instruction,
720
+ system_prompt=args.system,
721
+ num_requests=args.num_requests,
722
+ concurrency=args.concurrency,
723
+ max_new_tokens=args.max_new_tokens,
724
+ temperature=args.temperature,
725
+ top_p=args.top_p,
726
+ timeout=args.timeout,
727
+ )
728
+ for i, one_result in enumerate(batch_outs):
729
+ print(f"===== Result for request {i+1} =====")
730
+ pretty_print_response(one_result)
731
+ all_clicks_resized: List[Tuple[int, int]] = []
732
+ for one_result in batch_outs:
733
+ all_clicks_resized.extend(extract_clicks_from_results(one_result))
734
+ if all_clicks_resized:
735
+ orig_w, orig_h, resized_w, resized_h = compute_resized_dims_for_server_mapping(args.image)
736
+ mapped_clicks = map_clicks_to_original(all_clicks_resized, orig_w, orig_h, resized_w, resized_h)
737
+ out_path = f"ray_serve/annotated.png"
738
+ draw_circles_on_image(args.image, mapped_clicks, out_path)
739
+ return
740
+
741
+
742
+ if __name__ == "__main__":
743
+ main()
744
+ ```
745
+ ## Ethical Considerations
746
+
747
+ This model is released for research and educational purposes. While our model demonstrates strong performance on GUI benchmarks, users should carefully evaluate its suitability for their specific use cases.
748
+
749
+ **Important Considerations:**
750
+ - **Accuracy Limitations:** Like all AI systems, this model may produce incorrect outputs or fail to accurately identify GUI elements in certain scenarios.
751
+ - **Safety and Security:** Exercise caution when deploying GUI automation agents, especially in production environments where incorrect actions could affect system integrity or data security.
752
+ - **Human Oversight:** We recommend maintaining appropriate human supervision when using this model for automated GUI interactions.
753
+ - **Compliance:** Users are responsible for ensuring their use of this model complies with applicable laws, regulations, and organizational policies.
754
+
755
+ **Recommended Best Practices:**
756
+ - Thoroughly test the model in controlled environments before production deployment
757
+ - Implement safeguards and error handling mechanisms
758
+ - Consider the potential impact of automated actions on user systems and data
759
+ - Regularly monitor and validate model performance in your specific domain
760
+
761
+ For further guidance on use cases, refer to our AUP and AI AUP.
762
+
763
+ ## Citation
764
+
765
+ If you're using any GTA model or find it helpful in your research, please cite it as follows:
766
+
767
+ ```markdown
768
+ @article{yang2025gta1guitesttimescaling,
769
+ title={GTA1: GUI Test-time Scaling Agent},
770
+ author={Yan Yang and Dongxu Li and Yutong Dai and Yuhao Yang and Ziyang Luo and Zirui Zhao and Zhiyuan Hu and Junzhe Huang and Amrita Saha and Zeyuan Chen and Ran Xu and Liyuan Pan and Silvio Savarese and Caiming Xiong and Junnan Li},
771
+ year={2025},
772
+ eprint={2507.05791},
773
+ archivePrefix={arXiv},
774
+ primaryClass={cs.AI},
775
+ url={https://arxiv.org/abs/2507.05791},
776
+ }
777
+ ```
config.bak.json ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "OpenCUAForConditionalGeneration"
4
+ ],
5
+ "auto_map": {
6
+ "AutoConfig": "configuration_opencua.OpenCUAConfig",
7
+ "AutoModel": "modeling_opencua.OpenCUAForConditionalGeneration",
8
+ "AutoModelForCausalLM": "modeling_opencua.OpenCUAForConditionalGeneration"
9
+ },
10
+ "ignore_index": -100,
11
+ "media_placeholder_token_id": 151664,
12
+ "model_type": "opencua",
13
+ "pad_token_id": 0,
14
+ "text_config": {
15
+ "bos_token_id": 151643,
16
+ "eos_token_id": 151644,
17
+ "head_dim": 128,
18
+ "hidden_act": "silu",
19
+ "hidden_size": 5120,
20
+ "initializer_range": 0.02,
21
+ "intermediate_size": 27648,
22
+ "k_proj_bias": true,
23
+ "max_length": 20,
24
+ "min_length": 0,
25
+ "model_type": "qwen2",
26
+ "num_attention_heads": 40,
27
+ "num_beam_groups": 1,
28
+ "num_beams": 1,
29
+ "num_hidden_layers": 64,
30
+ "num_key_value_heads": 8,
31
+ "pad_token_id": 152063,
32
+ "pretraining_sequence_length": 131072,
33
+ "q_proj_bias": true,
34
+ "rms_norm_eps": 1e-05,
35
+ "rope_theta": 1000000.0,
36
+ "tie_word_embeddings": false,
37
+ "torch_dtype": "bfloat16",
38
+ "use_bfloat16": false,
39
+ "use_cache": true,
40
+ "v_proj_bias": true,
41
+ "vocab_size": 152064
42
+ },
43
+ "tie_word_embeddings": false,
44
+ "torch_dtype": "bfloat16",
45
+ "transformers_version": "4.48.3",
46
+ "vision_config": {
47
+ "depth": 32,
48
+ "fullatt_block_indexes": [
49
+ 7,
50
+ 15,
51
+ 23,
52
+ 31
53
+ ],
54
+ "hidden_act": "silu",
55
+ "hidden_size": 1280,
56
+ "num_heads": 16,
57
+ "in_chans": 3,
58
+ "intermediate_size": 3456,
59
+
60
+ "patch_size": 14,
61
+ "spatial_merge_size": 2,
62
+ "spatial_patch_size": 14,
63
+ "temporal_patch_size": 2,
64
+ "out_hidden_size": 5120,
65
+ "tokens_per_second": 2,
66
+ "window_size": 112
67
+ },
68
+ "vocab_size": 152064
69
+ }
config.json ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": ["Qwen2_5_VLForConditionalGeneration"],
3
+ "model_type": "qwen2_5_vl",
4
+ "transformers_version": "4.49.0",
5
+ "torch_dtype": "bfloat16",
6
+
7
+ "processor_class": "OpenCUAProcessor",
8
+
9
+ "hidden_act": "silu",
10
+ "attention_dropout": 0.0,
11
+ "initializer_range": 0.02,
12
+ "rms_norm_eps": 1e-06,
13
+ "tie_word_embeddings": false,
14
+ "use_cache": true,
15
+
16
+ "vocab_size": 152064,
17
+ "max_position_embeddings": 128000,
18
+ "sliding_window": 32768,
19
+ "use_sliding_window": false,
20
+ "max_window_layers": 64,
21
+
22
+ "rope_scaling": { "type": "default" },
23
+ "rope_theta": 1000000.0,
24
+
25
+ "hidden_size": 5120,
26
+ "intermediate_size": 27648,
27
+ "num_hidden_layers": 64,
28
+ "num_attention_heads": 40,
29
+ "num_key_value_heads": 8,
30
+
31
+ "bos_token_id": 151643,
32
+ "eos_token_id": 151644,
33
+ "pad_token_id": 152063,
34
+
35
+ "vision_start_token_id": 151665,
36
+ "vision_end_token_id": 151666,
37
+ "vision_token_id": 151654,
38
+ "image_token_id": 151667,
39
+ "video_token_id": 151664,
40
+
41
+ "vision_config": {
42
+ "model_type": "qwen2_5_vl",
43
+ "in_chans": 3,
44
+ "hidden_size": 1280,
45
+ "intermediate_size": 3456,
46
+ "out_hidden_size": 5120,
47
+ "spatial_patch_size": 14,
48
+ "tokens_per_second": 2,
49
+ "torch_dtype": "bfloat16"
50
+ }
51
+ }
configuration_opencua.py ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from transformers.configuration_utils import PretrainedConfig
2
+ from transformers.models.qwen2_5_vl.configuration_qwen2_5_vl import Qwen2_5_VLVisionConfig
3
+ from transformers.models.qwen2.configuration_qwen2 import Qwen2Config
4
+
5
+
6
+ class OpenCUAConfig(PretrainedConfig):
7
+ """OpenCUA-2.5-32B model configuration.
8
+
9
+ Args:
10
+ vision_config: Configuration for the vision model.Qwen2_5_VLVisionConfig
11
+ text_config: Configuration for the text model. Qwen2Config
12
+ pad_token_id: The token ID to use for padding.
13
+ """
14
+
15
+ model_type = "opencua"
16
+
17
+ def __init__(
18
+ self,
19
+ vision_config: dict | Qwen2_5_VLVisionConfig | None = None,
20
+ text_config: dict | Qwen2Config | None = None,
21
+ ignore_index: int = -100,
22
+ media_placeholder_token_id: int = 151664,
23
+ pad_token_id: int = 0,
24
+ **kwargs
25
+ ):
26
+ if isinstance(vision_config, dict):
27
+ vision_config = Qwen2_5_VLVisionConfig(**vision_config)
28
+ self.vision_config = vision_config
29
+
30
+ if isinstance(text_config, dict):
31
+ text_config = Qwen2Config(**text_config)
32
+ self.text_config = text_config
33
+
34
+ self.ignore_index = ignore_index
35
+ self.media_placeholder_token_id = media_placeholder_token_id
36
+
37
+ super().__init__(pad_token_id=pad_token_id, **kwargs)
generation_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_length": 32768,
3
+ "eos_token_id": 151644
4
+ }
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model-00001-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:61a1a81d27692172b8c4af0dba584d54f9761065e65a8a99800cc2d46334a78d
3
+ size 4932320880
model-00002-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4c166a525e6180f6a33d89c8d1c241b87ab4c14295e77a48d351328abef8babb
3
+ size 4727609720
model-00003-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4265e2f92162197d911af82da2900280989834d2dffef68b302a5818c4c0ee95
3
+ size 4822749744
model-00004-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bbc0137d617f58571e5669e20f199ad0fcef7bbe89126aef4de08b4c6a04be5a
3
+ size 4998049568
model-00005-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:a1f6c2afe9fa652bf72633734170e2f8e520ebe6a0b4a291916543cd87f0149e
3
+ size 4883041912
model-00006-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b8258dd0383f72df276a81618b19194bedfb4d9d30eb941b54252a86e203d029
3
+ size 4772902520
model-00007-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f8c4849d1e7e8f76954e18183b1ff2b4560413c5a9df5e305f204c1025582e6b
3
+ size 4966230504
model-00008-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:32869aec221c400372785a300e2285c45664ac0b584576c3db40a9c9ad842359
3
+ size 4800968432
model-00009-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eadb68817ae3046cc0867c72d8f058e97202c9b2fc79f50f91d8905e1a74b0e4
3
+ size 4931462480
model-00010-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:de45132043dbcfd9728b92fc13cc27166ec3fa2ce4bcac50c97e3196b77ed039
3
+ size 4728824872
model-00011-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:96e16b21960f0a3e5f1de0f8f120c3c1860fd5b8c97f8dda9e784401965d0ed9
3
+ size 4943615728
model-00012-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ebc7d2855167a5e1b4e1d48734da0fc45d2117638a58a68223b9ec6e4e895f54
3
+ size 4744353832
model-00013-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7882dd3323b2ab948e7a2c59d8ec7ad5d314f509463cc20049196b7318a6eb50
3
+ size 4817449768
model-00014-of-00014.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f7d75bc05db0c27fda49bc6e417b96ca8f8b07754e0e32570e443c168ad202f6
3
+ size 3835987888
model.args.pt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f63a4fc32414b15ac0c96d0d6b4889f0bd29bb7c16bde542c978c0e01dd49beb
3
+ size 25196
model.safetensors.index.json ADDED
@@ -0,0 +1,1168 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "metadata": {
3
+ "total_size": 66905436672
4
+ },
5
+ "weight_map": {
6
+ "lm_head.weight": "model-00008-of-00014.safetensors",
7
+ "model.embed_tokens.weight": "model-00008-of-00014.safetensors",
8
+ "model.layers.0.input_layernorm.weight": "model-00002-of-00014.safetensors",
9
+ "model.layers.0.mlp.down_proj.weight": "model-00002-of-00014.safetensors",
10
+ "model.layers.0.mlp.gate_proj.weight": "model-00004-of-00014.safetensors",
11
+ "model.layers.0.mlp.up_proj.weight": "model-00011-of-00014.safetensors",
12
+ "model.layers.0.post_attention_layernorm.weight": "model-00001-of-00014.safetensors",
13
+ "model.layers.0.self_attn.k_proj.bias": "model-00003-of-00014.safetensors",
14
+ "model.layers.0.self_attn.k_proj.weight": "model-00006-of-00014.safetensors",
15
+ "model.layers.0.self_attn.o_proj.weight": "model-00007-of-00014.safetensors",
16
+ "model.layers.0.self_attn.q_proj.bias": "model-00005-of-00014.safetensors",
17
+ "model.layers.0.self_attn.q_proj.weight": "model-00008-of-00014.safetensors",
18
+ "model.layers.0.self_attn.v_proj.bias": "model-00003-of-00014.safetensors",
19
+ "model.layers.0.self_attn.v_proj.weight": "model-00010-of-00014.safetensors",
20
+ "model.layers.1.input_layernorm.weight": "model-00006-of-00014.safetensors",
21
+ "model.layers.1.mlp.down_proj.weight": "model-00004-of-00014.safetensors",
22
+ "model.layers.1.mlp.gate_proj.weight": "model-00003-of-00014.safetensors",
23
+ "model.layers.1.mlp.up_proj.weight": "model-00003-of-00014.safetensors",
24
+ "model.layers.1.post_attention_layernorm.weight": "model-00009-of-00014.safetensors",
25
+ "model.layers.1.self_attn.k_proj.bias": "model-00006-of-00014.safetensors",
26
+ "model.layers.1.self_attn.k_proj.weight": "model-00011-of-00014.safetensors",
27
+ "model.layers.1.self_attn.o_proj.weight": "model-00013-of-00014.safetensors",
28
+ "model.layers.1.self_attn.q_proj.bias": "model-00002-of-00014.safetensors",
29
+ "model.layers.1.self_attn.q_proj.weight": "model-00004-of-00014.safetensors",
30
+ "model.layers.1.self_attn.v_proj.bias": "model-00001-of-00014.safetensors",
31
+ "model.layers.1.self_attn.v_proj.weight": "model-00014-of-00014.safetensors",
32
+ "model.layers.10.input_layernorm.weight": "model-00006-of-00014.safetensors",
33
+ "model.layers.10.mlp.down_proj.weight": "model-00006-of-00014.safetensors",
34
+ "model.layers.10.mlp.gate_proj.weight": "model-00006-of-00014.safetensors",
35
+ "model.layers.10.mlp.up_proj.weight": "model-00014-of-00014.safetensors",
36
+ "model.layers.10.post_attention_layernorm.weight": "model-00006-of-00014.safetensors",
37
+ "model.layers.10.self_attn.k_proj.bias": "model-00006-of-00014.safetensors",
38
+ "model.layers.10.self_attn.k_proj.weight": "model-00005-of-00014.safetensors",
39
+ "model.layers.10.self_attn.o_proj.weight": "model-00014-of-00014.safetensors",
40
+ "model.layers.10.self_attn.q_proj.bias": "model-00013-of-00014.safetensors",
41
+ "model.layers.10.self_attn.q_proj.weight": "model-00014-of-00014.safetensors",
42
+ "model.layers.10.self_attn.v_proj.bias": "model-00003-of-00014.safetensors",
43
+ "model.layers.10.self_attn.v_proj.weight": "model-00007-of-00014.safetensors",
44
+ "model.layers.11.input_layernorm.weight": "model-00006-of-00014.safetensors",
45
+ "model.layers.11.mlp.down_proj.weight": "model-00013-of-00014.safetensors",
46
+ "model.layers.11.mlp.gate_proj.weight": "model-00013-of-00014.safetensors",
47
+ "model.layers.11.mlp.up_proj.weight": "model-00003-of-00014.safetensors",
48
+ "model.layers.11.post_attention_layernorm.weight": "model-00005-of-00014.safetensors",
49
+ "model.layers.11.self_attn.k_proj.bias": "model-00009-of-00014.safetensors",
50
+ "model.layers.11.self_attn.k_proj.weight": "model-00013-of-00014.safetensors",
51
+ "model.layers.11.self_attn.o_proj.weight": "model-00001-of-00014.safetensors",
52
+ "model.layers.11.self_attn.q_proj.bias": "model-00007-of-00014.safetensors",
53
+ "model.layers.11.self_attn.q_proj.weight": "model-00006-of-00014.safetensors",
54
+ "model.layers.11.self_attn.v_proj.bias": "model-00001-of-00014.safetensors",
55
+ "model.layers.11.self_attn.v_proj.weight": "model-00011-of-00014.safetensors",
56
+ "model.layers.12.input_layernorm.weight": "model-00011-of-00014.safetensors",
57
+ "model.layers.12.mlp.down_proj.weight": "model-00007-of-00014.safetensors",
58
+ "model.layers.12.mlp.gate_proj.weight": "model-00001-of-00014.safetensors",
59
+ "model.layers.12.mlp.up_proj.weight": "model-00007-of-00014.safetensors",
60
+ "model.layers.12.post_attention_layernorm.weight": "model-00003-of-00014.safetensors",
61
+ "model.layers.12.self_attn.k_proj.bias": "model-00013-of-00014.safetensors",
62
+ "model.layers.12.self_attn.k_proj.weight": "model-00013-of-00014.safetensors",
63
+ "model.layers.12.self_attn.o_proj.weight": "model-00014-of-00014.safetensors",
64
+ "model.layers.12.self_attn.q_proj.bias": "model-00001-of-00014.safetensors",
65
+ "model.layers.12.self_attn.q_proj.weight": "model-00001-of-00014.safetensors",
66
+ "model.layers.12.self_attn.v_proj.bias": "model-00010-of-00014.safetensors",
67
+ "model.layers.12.self_attn.v_proj.weight": "model-00009-of-00014.safetensors",
68
+ "model.layers.13.input_layernorm.weight": "model-00010-of-00014.safetensors",
69
+ "model.layers.13.mlp.down_proj.weight": "model-00013-of-00014.safetensors",
70
+ "model.layers.13.mlp.gate_proj.weight": "model-00011-of-00014.safetensors",
71
+ "model.layers.13.mlp.up_proj.weight": "model-00012-of-00014.safetensors",
72
+ "model.layers.13.post_attention_layernorm.weight": "model-00004-of-00014.safetensors",
73
+ "model.layers.13.self_attn.k_proj.bias": "model-00004-of-00014.safetensors",
74
+ "model.layers.13.self_attn.k_proj.weight": "model-00006-of-00014.safetensors",
75
+ "model.layers.13.self_attn.o_proj.weight": "model-00010-of-00014.safetensors",
76
+ "model.layers.13.self_attn.q_proj.bias": "model-00003-of-00014.safetensors",
77
+ "model.layers.13.self_attn.q_proj.weight": "model-00011-of-00014.safetensors",
78
+ "model.layers.13.self_attn.v_proj.bias": "model-00012-of-00014.safetensors",
79
+ "model.layers.13.self_attn.v_proj.weight": "model-00012-of-00014.safetensors",
80
+ "model.layers.14.input_layernorm.weight": "model-00001-of-00014.safetensors",
81
+ "model.layers.14.mlp.down_proj.weight": "model-00010-of-00014.safetensors",
82
+ "model.layers.14.mlp.gate_proj.weight": "model-00010-of-00014.safetensors",
83
+ "model.layers.14.mlp.up_proj.weight": "model-00009-of-00014.safetensors",
84
+ "model.layers.14.post_attention_layernorm.weight": "model-00003-of-00014.safetensors",
85
+ "model.layers.14.self_attn.k_proj.bias": "model-00011-of-00014.safetensors",
86
+ "model.layers.14.self_attn.k_proj.weight": "model-00006-of-00014.safetensors",
87
+ "model.layers.14.self_attn.o_proj.weight": "model-00010-of-00014.safetensors",
88
+ "model.layers.14.self_attn.q_proj.bias": "model-00013-of-00014.safetensors",
89
+ "model.layers.14.self_attn.q_proj.weight": "model-00008-of-00014.safetensors",
90
+ "model.layers.14.self_attn.v_proj.bias": "model-00009-of-00014.safetensors",
91
+ "model.layers.14.self_attn.v_proj.weight": "model-00012-of-00014.safetensors",
92
+ "model.layers.15.input_layernorm.weight": "model-00011-of-00014.safetensors",
93
+ "model.layers.15.mlp.down_proj.weight": "model-00005-of-00014.safetensors",
94
+ "model.layers.15.mlp.gate_proj.weight": "model-00007-of-00014.safetensors",
95
+ "model.layers.15.mlp.up_proj.weight": "model-00001-of-00014.safetensors",
96
+ "model.layers.15.post_attention_layernorm.weight": "model-00004-of-00014.safetensors",
97
+ "model.layers.15.self_attn.k_proj.bias": "model-00010-of-00014.safetensors",
98
+ "model.layers.15.self_attn.k_proj.weight": "model-00013-of-00014.safetensors",
99
+ "model.layers.15.self_attn.o_proj.weight": "model-00001-of-00014.safetensors",
100
+ "model.layers.15.self_attn.q_proj.bias": "model-00006-of-00014.safetensors",
101
+ "model.layers.15.self_attn.q_proj.weight": "model-00002-of-00014.safetensors",
102
+ "model.layers.15.self_attn.v_proj.bias": "model-00003-of-00014.safetensors",
103
+ "model.layers.15.self_attn.v_proj.weight": "model-00004-of-00014.safetensors",
104
+ "model.layers.16.input_layernorm.weight": "model-00013-of-00014.safetensors",
105
+ "model.layers.16.mlp.down_proj.weight": "model-00004-of-00014.safetensors",
106
+ "model.layers.16.mlp.gate_proj.weight": "model-00005-of-00014.safetensors",
107
+ "model.layers.16.mlp.up_proj.weight": "model-00002-of-00014.safetensors",
108
+ "model.layers.16.post_attention_layernorm.weight": "model-00011-of-00014.safetensors",
109
+ "model.layers.16.self_attn.k_proj.bias": "model-00009-of-00014.safetensors",
110
+ "model.layers.16.self_attn.k_proj.weight": "model-00011-of-00014.safetensors",
111
+ "model.layers.16.self_attn.o_proj.weight": "model-00010-of-00014.safetensors",
112
+ "model.layers.16.self_attn.q_proj.bias": "model-00008-of-00014.safetensors",
113
+ "model.layers.16.self_attn.q_proj.weight": "model-00012-of-00014.safetensors",
114
+ "model.layers.16.self_attn.v_proj.bias": "model-00004-of-00014.safetensors",
115
+ "model.layers.16.self_attn.v_proj.weight": "model-00003-of-00014.safetensors",
116
+ "model.layers.17.input_layernorm.weight": "model-00001-of-00014.safetensors",
117
+ "model.layers.17.mlp.down_proj.weight": "model-00012-of-00014.safetensors",
118
+ "model.layers.17.mlp.gate_proj.weight": "model-00001-of-00014.safetensors",
119
+ "model.layers.17.mlp.up_proj.weight": "model-00011-of-00014.safetensors",
120
+ "model.layers.17.post_attention_layernorm.weight": "model-00010-of-00014.safetensors",
121
+ "model.layers.17.self_attn.k_proj.bias": "model-00011-of-00014.safetensors",
122
+ "model.layers.17.self_attn.k_proj.weight": "model-00010-of-00014.safetensors",
123
+ "model.layers.17.self_attn.o_proj.weight": "model-00010-of-00014.safetensors",
124
+ "model.layers.17.self_attn.q_proj.bias": "model-00001-of-00014.safetensors",
125
+ "model.layers.17.self_attn.q_proj.weight": "model-00002-of-00014.safetensors",
126
+ "model.layers.17.self_attn.v_proj.bias": "model-00013-of-00014.safetensors",
127
+ "model.layers.17.self_attn.v_proj.weight": "model-00003-of-00014.safetensors",
128
+ "model.layers.18.input_layernorm.weight": "model-00010-of-00014.safetensors",
129
+ "model.layers.18.mlp.down_proj.weight": "model-00007-of-00014.safetensors",
130
+ "model.layers.18.mlp.gate_proj.weight": "model-00003-of-00014.safetensors",
131
+ "model.layers.18.mlp.up_proj.weight": "model-00014-of-00014.safetensors",
132
+ "model.layers.18.post_attention_layernorm.weight": "model-00004-of-00014.safetensors",
133
+ "model.layers.18.self_attn.k_proj.bias": "model-00004-of-00014.safetensors",
134
+ "model.layers.18.self_attn.k_proj.weight": "model-00001-of-00014.safetensors",
135
+ "model.layers.18.self_attn.o_proj.weight": "model-00010-of-00014.safetensors",
136
+ "model.layers.18.self_attn.q_proj.bias": "model-00002-of-00014.safetensors",
137
+ "model.layers.18.self_attn.q_proj.weight": "model-00006-of-00014.safetensors",
138
+ "model.layers.18.self_attn.v_proj.bias": "model-00006-of-00014.safetensors",
139
+ "model.layers.18.self_attn.v_proj.weight": "model-00007-of-00014.safetensors",
140
+ "model.layers.19.input_layernorm.weight": "model-00001-of-00014.safetensors",
141
+ "model.layers.19.mlp.down_proj.weight": "model-00003-of-00014.safetensors",
142
+ "model.layers.19.mlp.gate_proj.weight": "model-00002-of-00014.safetensors",
143
+ "model.layers.19.mlp.up_proj.weight": "model-00013-of-00014.safetensors",
144
+ "model.layers.19.post_attention_layernorm.weight": "model-00009-of-00014.safetensors",
145
+ "model.layers.19.self_attn.k_proj.bias": "model-00005-of-00014.safetensors",
146
+ "model.layers.19.self_attn.k_proj.weight": "model-00003-of-00014.safetensors",
147
+ "model.layers.19.self_attn.o_proj.weight": "model-00006-of-00014.safetensors",
148
+ "model.layers.19.self_attn.q_proj.bias": "model-00010-of-00014.safetensors",
149
+ "model.layers.19.self_attn.q_proj.weight": "model-00014-of-00014.safetensors",
150
+ "model.layers.19.self_attn.v_proj.bias": "model-00012-of-00014.safetensors",
151
+ "model.layers.19.self_attn.v_proj.weight": "model-00011-of-00014.safetensors",
152
+ "model.layers.2.input_layernorm.weight": "model-00011-of-00014.safetensors",
153
+ "model.layers.2.mlp.down_proj.weight": "model-00003-of-00014.safetensors",
154
+ "model.layers.2.mlp.gate_proj.weight": "model-00012-of-00014.safetensors",
155
+ "model.layers.2.mlp.up_proj.weight": "model-00010-of-00014.safetensors",
156
+ "model.layers.2.post_attention_layernorm.weight": "model-00014-of-00014.safetensors",
157
+ "model.layers.2.self_attn.k_proj.bias": "model-00010-of-00014.safetensors",
158
+ "model.layers.2.self_attn.k_proj.weight": "model-00003-of-00014.safetensors",
159
+ "model.layers.2.self_attn.o_proj.weight": "model-00003-of-00014.safetensors",
160
+ "model.layers.2.self_attn.q_proj.bias": "model-00011-of-00014.safetensors",
161
+ "model.layers.2.self_attn.q_proj.weight": "model-00003-of-00014.safetensors",
162
+ "model.layers.2.self_attn.v_proj.bias": "model-00003-of-00014.safetensors",
163
+ "model.layers.2.self_attn.v_proj.weight": "model-00004-of-00014.safetensors",
164
+ "model.layers.20.input_layernorm.weight": "model-00002-of-00014.safetensors",
165
+ "model.layers.20.mlp.down_proj.weight": "model-00014-of-00014.safetensors",
166
+ "model.layers.20.mlp.gate_proj.weight": "model-00011-of-00014.safetensors",
167
+ "model.layers.20.mlp.up_proj.weight": "model-00004-of-00014.safetensors",
168
+ "model.layers.20.post_attention_layernorm.weight": "model-00002-of-00014.safetensors",
169
+ "model.layers.20.self_attn.k_proj.bias": "model-00005-of-00014.safetensors",
170
+ "model.layers.20.self_attn.k_proj.weight": "model-00009-of-00014.safetensors",
171
+ "model.layers.20.self_attn.o_proj.weight": "model-00001-of-00014.safetensors",
172
+ "model.layers.20.self_attn.q_proj.bias": "model-00003-of-00014.safetensors",
173
+ "model.layers.20.self_attn.q_proj.weight": "model-00005-of-00014.safetensors",
174
+ "model.layers.20.self_attn.v_proj.bias": "model-00012-of-00014.safetensors",
175
+ "model.layers.20.self_attn.v_proj.weight": "model-00009-of-00014.safetensors",
176
+ "model.layers.21.input_layernorm.weight": "model-00002-of-00014.safetensors",
177
+ "model.layers.21.mlp.down_proj.weight": "model-00007-of-00014.safetensors",
178
+ "model.layers.21.mlp.gate_proj.weight": "model-00007-of-00014.safetensors",
179
+ "model.layers.21.mlp.up_proj.weight": "model-00014-of-00014.safetensors",
180
+ "model.layers.21.post_attention_layernorm.weight": "model-00010-of-00014.safetensors",
181
+ "model.layers.21.self_attn.k_proj.bias": "model-00011-of-00014.safetensors",
182
+ "model.layers.21.self_attn.k_proj.weight": "model-00009-of-00014.safetensors",
183
+ "model.layers.21.self_attn.o_proj.weight": "model-00009-of-00014.safetensors",
184
+ "model.layers.21.self_attn.q_proj.bias": "model-00007-of-00014.safetensors",
185
+ "model.layers.21.self_attn.q_proj.weight": "model-00013-of-00014.safetensors",
186
+ "model.layers.21.self_attn.v_proj.bias": "model-00004-of-00014.safetensors",
187
+ "model.layers.21.self_attn.v_proj.weight": "model-00003-of-00014.safetensors",
188
+ "model.layers.22.input_layernorm.weight": "model-00003-of-00014.safetensors",
189
+ "model.layers.22.mlp.down_proj.weight": "model-00013-of-00014.safetensors",
190
+ "model.layers.22.mlp.gate_proj.weight": "model-00013-of-00014.safetensors",
191
+ "model.layers.22.mlp.up_proj.weight": "model-00007-of-00014.safetensors",
192
+ "model.layers.22.post_attention_layernorm.weight": "model-00003-of-00014.safetensors",
193
+ "model.layers.22.self_attn.k_proj.bias": "model-00010-of-00014.safetensors",
194
+ "model.layers.22.self_attn.k_proj.weight": "model-00011-of-00014.safetensors",
195
+ "model.layers.22.self_attn.o_proj.weight": "model-00013-of-00014.safetensors",
196
+ "model.layers.22.self_attn.q_proj.bias": "model-00005-of-00014.safetensors",
197
+ "model.layers.22.self_attn.q_proj.weight": "model-00011-of-00014.safetensors",
198
+ "model.layers.22.self_attn.v_proj.bias": "model-00003-of-00014.safetensors",
199
+ "model.layers.22.self_attn.v_proj.weight": "model-00010-of-00014.safetensors",
200
+ "model.layers.23.input_layernorm.weight": "model-00007-of-00014.safetensors",
201
+ "model.layers.23.mlp.down_proj.weight": "model-00004-of-00014.safetensors",
202
+ "model.layers.23.mlp.gate_proj.weight": "model-00008-of-00014.safetensors",
203
+ "model.layers.23.mlp.up_proj.weight": "model-00013-of-00014.safetensors",
204
+ "model.layers.23.post_attention_layernorm.weight": "model-00006-of-00014.safetensors",
205
+ "model.layers.23.self_attn.k_proj.bias": "model-00001-of-00014.safetensors",
206
+ "model.layers.23.self_attn.k_proj.weight": "model-00001-of-00014.safetensors",
207
+ "model.layers.23.self_attn.o_proj.weight": "model-00001-of-00014.safetensors",
208
+ "model.layers.23.self_attn.q_proj.bias": "model-00010-of-00014.safetensors",
209
+ "model.layers.23.self_attn.q_proj.weight": "model-00003-of-00014.safetensors",
210
+ "model.layers.23.self_attn.v_proj.bias": "model-00006-of-00014.safetensors",
211
+ "model.layers.23.self_attn.v_proj.weight": "model-00013-of-00014.safetensors",
212
+ "model.layers.24.input_layernorm.weight": "model-00007-of-00014.safetensors",
213
+ "model.layers.24.mlp.down_proj.weight": "model-00010-of-00014.safetensors",
214
+ "model.layers.24.mlp.gate_proj.weight": "model-00008-of-00014.safetensors",
215
+ "model.layers.24.mlp.up_proj.weight": "model-00006-of-00014.safetensors",
216
+ "model.layers.24.post_attention_layernorm.weight": "model-00007-of-00014.safetensors",
217
+ "model.layers.24.self_attn.k_proj.bias": "model-00006-of-00014.safetensors",
218
+ "model.layers.24.self_attn.k_proj.weight": "model-00012-of-00014.safetensors",
219
+ "model.layers.24.self_attn.o_proj.weight": "model-00010-of-00014.safetensors",
220
+ "model.layers.24.self_attn.q_proj.bias": "model-00010-of-00014.safetensors",
221
+ "model.layers.24.self_attn.q_proj.weight": "model-00003-of-00014.safetensors",
222
+ "model.layers.24.self_attn.v_proj.bias": "model-00012-of-00014.safetensors",
223
+ "model.layers.24.self_attn.v_proj.weight": "model-00010-of-00014.safetensors",
224
+ "model.layers.25.input_layernorm.weight": "model-00009-of-00014.safetensors",
225
+ "model.layers.25.mlp.down_proj.weight": "model-00007-of-00014.safetensors",
226
+ "model.layers.25.mlp.gate_proj.weight": "model-00010-of-00014.safetensors",
227
+ "model.layers.25.mlp.up_proj.weight": "model-00013-of-00014.safetensors",
228
+ "model.layers.25.post_attention_layernorm.weight": "model-00004-of-00014.safetensors",
229
+ "model.layers.25.self_attn.k_proj.bias": "model-00001-of-00014.safetensors",
230
+ "model.layers.25.self_attn.k_proj.weight": "model-00001-of-00014.safetensors",
231
+ "model.layers.25.self_attn.o_proj.weight": "model-00007-of-00014.safetensors",
232
+ "model.layers.25.self_attn.q_proj.bias": "model-00006-of-00014.safetensors",
233
+ "model.layers.25.self_attn.q_proj.weight": "model-00004-of-00014.safetensors",
234
+ "model.layers.25.self_attn.v_proj.bias": "model-00006-of-00014.safetensors",
235
+ "model.layers.25.self_attn.v_proj.weight": "model-00010-of-00014.safetensors",
236
+ "model.layers.26.input_layernorm.weight": "model-00009-of-00014.safetensors",
237
+ "model.layers.26.mlp.down_proj.weight": "model-00009-of-00014.safetensors",
238
+ "model.layers.26.mlp.gate_proj.weight": "model-00004-of-00014.safetensors",
239
+ "model.layers.26.mlp.up_proj.weight": "model-00012-of-00014.safetensors",
240
+ "model.layers.26.post_attention_layernorm.weight": "model-00013-of-00014.safetensors",
241
+ "model.layers.26.self_attn.k_proj.bias": "model-00012-of-00014.safetensors",
242
+ "model.layers.26.self_attn.k_proj.weight": "model-00010-of-00014.safetensors",
243
+ "model.layers.26.self_attn.o_proj.weight": "model-00012-of-00014.safetensors",
244
+ "model.layers.26.self_attn.q_proj.bias": "model-00008-of-00014.safetensors",
245
+ "model.layers.26.self_attn.q_proj.weight": "model-00013-of-00014.safetensors",
246
+ "model.layers.26.self_attn.v_proj.bias": "model-00009-of-00014.safetensors",
247
+ "model.layers.26.self_attn.v_proj.weight": "model-00001-of-00014.safetensors",
248
+ "model.layers.27.input_layernorm.weight": "model-00006-of-00014.safetensors",
249
+ "model.layers.27.mlp.down_proj.weight": "model-00004-of-00014.safetensors",
250
+ "model.layers.27.mlp.gate_proj.weight": "model-00009-of-00014.safetensors",
251
+ "model.layers.27.mlp.up_proj.weight": "model-00005-of-00014.safetensors",
252
+ "model.layers.27.post_attention_layernorm.weight": "model-00009-of-00014.safetensors",
253
+ "model.layers.27.self_attn.k_proj.bias": "model-00012-of-00014.safetensors",
254
+ "model.layers.27.self_attn.k_proj.weight": "model-00002-of-00014.safetensors",
255
+ "model.layers.27.self_attn.o_proj.weight": "model-00012-of-00014.safetensors",
256
+ "model.layers.27.self_attn.q_proj.bias": "model-00005-of-00014.safetensors",
257
+ "model.layers.27.self_attn.q_proj.weight": "model-00009-of-00014.safetensors",
258
+ "model.layers.27.self_attn.v_proj.bias": "model-00009-of-00014.safetensors",
259
+ "model.layers.27.self_attn.v_proj.weight": "model-00001-of-00014.safetensors",
260
+ "model.layers.28.input_layernorm.weight": "model-00001-of-00014.safetensors",
261
+ "model.layers.28.mlp.down_proj.weight": "model-00010-of-00014.safetensors",
262
+ "model.layers.28.mlp.gate_proj.weight": "model-00009-of-00014.safetensors",
263
+ "model.layers.28.mlp.up_proj.weight": "model-00007-of-00014.safetensors",
264
+ "model.layers.28.post_attention_layernorm.weight": "model-00013-of-00014.safetensors",
265
+ "model.layers.28.self_attn.k_proj.bias": "model-00010-of-00014.safetensors",
266
+ "model.layers.28.self_attn.k_proj.weight": "model-00006-of-00014.safetensors",
267
+ "model.layers.28.self_attn.o_proj.weight": "model-00003-of-00014.safetensors",
268
+ "model.layers.28.self_attn.q_proj.bias": "model-00007-of-00014.safetensors",
269
+ "model.layers.28.self_attn.q_proj.weight": "model-00011-of-00014.safetensors",
270
+ "model.layers.28.self_attn.v_proj.bias": "model-00005-of-00014.safetensors",
271
+ "model.layers.28.self_attn.v_proj.weight": "model-00012-of-00014.safetensors",
272
+ "model.layers.29.input_layernorm.weight": "model-00003-of-00014.safetensors",
273
+ "model.layers.29.mlp.down_proj.weight": "model-00003-of-00014.safetensors",
274
+ "model.layers.29.mlp.gate_proj.weight": "model-00005-of-00014.safetensors",
275
+ "model.layers.29.mlp.up_proj.weight": "model-00009-of-00014.safetensors",
276
+ "model.layers.29.post_attention_layernorm.weight": "model-00001-of-00014.safetensors",
277
+ "model.layers.29.self_attn.k_proj.bias": "model-00011-of-00014.safetensors",
278
+ "model.layers.29.self_attn.k_proj.weight": "model-00003-of-00014.safetensors",
279
+ "model.layers.29.self_attn.o_proj.weight": "model-00014-of-00014.safetensors",
280
+ "model.layers.29.self_attn.q_proj.bias": "model-00013-of-00014.safetensors",
281
+ "model.layers.29.self_attn.q_proj.weight": "model-00002-of-00014.safetensors",
282
+ "model.layers.29.self_attn.v_proj.bias": "model-00014-of-00014.safetensors",
283
+ "model.layers.29.self_attn.v_proj.weight": "model-00005-of-00014.safetensors",
284
+ "model.layers.3.input_layernorm.weight": "model-00008-of-00014.safetensors",
285
+ "model.layers.3.mlp.down_proj.weight": "model-00003-of-00014.safetensors",
286
+ "model.layers.3.mlp.gate_proj.weight": "model-00002-of-00014.safetensors",
287
+ "model.layers.3.mlp.up_proj.weight": "model-00010-of-00014.safetensors",
288
+ "model.layers.3.post_attention_layernorm.weight": "model-00010-of-00014.safetensors",
289
+ "model.layers.3.self_attn.k_proj.bias": "model-00012-of-00014.safetensors",
290
+ "model.layers.3.self_attn.k_proj.weight": "model-00001-of-00014.safetensors",
291
+ "model.layers.3.self_attn.o_proj.weight": "model-00001-of-00014.safetensors",
292
+ "model.layers.3.self_attn.q_proj.bias": "model-00007-of-00014.safetensors",
293
+ "model.layers.3.self_attn.q_proj.weight": "model-00009-of-00014.safetensors",
294
+ "model.layers.3.self_attn.v_proj.bias": "model-00002-of-00014.safetensors",
295
+ "model.layers.3.self_attn.v_proj.weight": "model-00011-of-00014.safetensors",
296
+ "model.layers.30.input_layernorm.weight": "model-00006-of-00014.safetensors",
297
+ "model.layers.30.mlp.down_proj.weight": "model-00004-of-00014.safetensors",
298
+ "model.layers.30.mlp.gate_proj.weight": "model-00001-of-00014.safetensors",
299
+ "model.layers.30.mlp.up_proj.weight": "model-00004-of-00014.safetensors",
300
+ "model.layers.30.post_attention_layernorm.weight": "model-00002-of-00014.safetensors",
301
+ "model.layers.30.self_attn.k_proj.bias": "model-00014-of-00014.safetensors",
302
+ "model.layers.30.self_attn.k_proj.weight": "model-00007-of-00014.safetensors",
303
+ "model.layers.30.self_attn.o_proj.weight": "model-00004-of-00014.safetensors",
304
+ "model.layers.30.self_attn.q_proj.bias": "model-00004-of-00014.safetensors",
305
+ "model.layers.30.self_attn.q_proj.weight": "model-00003-of-00014.safetensors",
306
+ "model.layers.30.self_attn.v_proj.bias": "model-00005-of-00014.safetensors",
307
+ "model.layers.30.self_attn.v_proj.weight": "model-00008-of-00014.safetensors",
308
+ "model.layers.31.input_layernorm.weight": "model-00013-of-00014.safetensors",
309
+ "model.layers.31.mlp.down_proj.weight": "model-00006-of-00014.safetensors",
310
+ "model.layers.31.mlp.gate_proj.weight": "model-00009-of-00014.safetensors",
311
+ "model.layers.31.mlp.up_proj.weight": "model-00011-of-00014.safetensors",
312
+ "model.layers.31.post_attention_layernorm.weight": "model-00013-of-00014.safetensors",
313
+ "model.layers.31.self_attn.k_proj.bias": "model-00010-of-00014.safetensors",
314
+ "model.layers.31.self_attn.k_proj.weight": "model-00008-of-00014.safetensors",
315
+ "model.layers.31.self_attn.o_proj.weight": "model-00005-of-00014.safetensors",
316
+ "model.layers.31.self_attn.q_proj.bias": "model-00003-of-00014.safetensors",
317
+ "model.layers.31.self_attn.q_proj.weight": "model-00005-of-00014.safetensors",
318
+ "model.layers.31.self_attn.v_proj.bias": "model-00007-of-00014.safetensors",
319
+ "model.layers.31.self_attn.v_proj.weight": "model-00008-of-00014.safetensors",
320
+ "model.layers.32.input_layernorm.weight": "model-00011-of-00014.safetensors",
321
+ "model.layers.32.mlp.down_proj.weight": "model-00011-of-00014.safetensors",
322
+ "model.layers.32.mlp.gate_proj.weight": "model-00006-of-00014.safetensors",
323
+ "model.layers.32.mlp.up_proj.weight": "model-00005-of-00014.safetensors",
324
+ "model.layers.32.post_attention_layernorm.weight": "model-00005-of-00014.safetensors",
325
+ "model.layers.32.self_attn.k_proj.bias": "model-00004-of-00014.safetensors",
326
+ "model.layers.32.self_attn.k_proj.weight": "model-00005-of-00014.safetensors",
327
+ "model.layers.32.self_attn.o_proj.weight": "model-00002-of-00014.safetensors",
328
+ "model.layers.32.self_attn.q_proj.bias": "model-00010-of-00014.safetensors",
329
+ "model.layers.32.self_attn.q_proj.weight": "model-00014-of-00014.safetensors",
330
+ "model.layers.32.self_attn.v_proj.bias": "model-00004-of-00014.safetensors",
331
+ "model.layers.32.self_attn.v_proj.weight": "model-00003-of-00014.safetensors",
332
+ "model.layers.33.input_layernorm.weight": "model-00010-of-00014.safetensors",
333
+ "model.layers.33.mlp.down_proj.weight": "model-00012-of-00014.safetensors",
334
+ "model.layers.33.mlp.gate_proj.weight": "model-00001-of-00014.safetensors",
335
+ "model.layers.33.mlp.up_proj.weight": "model-00003-of-00014.safetensors",
336
+ "model.layers.33.post_attention_layernorm.weight": "model-00013-of-00014.safetensors",
337
+ "model.layers.33.self_attn.k_proj.bias": "model-00010-of-00014.safetensors",
338
+ "model.layers.33.self_attn.k_proj.weight": "model-00010-of-00014.safetensors",
339
+ "model.layers.33.self_attn.o_proj.weight": "model-00001-of-00014.safetensors",
340
+ "model.layers.33.self_attn.q_proj.bias": "model-00005-of-00014.safetensors",
341
+ "model.layers.33.self_attn.q_proj.weight": "model-00004-of-00014.safetensors",
342
+ "model.layers.33.self_attn.v_proj.bias": "model-00004-of-00014.safetensors",
343
+ "model.layers.33.self_attn.v_proj.weight": "model-00014-of-00014.safetensors",
344
+ "model.layers.34.input_layernorm.weight": "model-00009-of-00014.safetensors",
345
+ "model.layers.34.mlp.down_proj.weight": "model-00001-of-00014.safetensors",
346
+ "model.layers.34.mlp.gate_proj.weight": "model-00006-of-00014.safetensors",
347
+ "model.layers.34.mlp.up_proj.weight": "model-00001-of-00014.safetensors",
348
+ "model.layers.34.post_attention_layernorm.weight": "model-00012-of-00014.safetensors",
349
+ "model.layers.34.self_attn.k_proj.bias": "model-00013-of-00014.safetensors",
350
+ "model.layers.34.self_attn.k_proj.weight": "model-00003-of-00014.safetensors",
351
+ "model.layers.34.self_attn.o_proj.weight": "model-00001-of-00014.safetensors",
352
+ "model.layers.34.self_attn.q_proj.bias": "model-00009-of-00014.safetensors",
353
+ "model.layers.34.self_attn.q_proj.weight": "model-00013-of-00014.safetensors",
354
+ "model.layers.34.self_attn.v_proj.bias": "model-00006-of-00014.safetensors",
355
+ "model.layers.34.self_attn.v_proj.weight": "model-00007-of-00014.safetensors",
356
+ "model.layers.35.input_layernorm.weight": "model-00005-of-00014.safetensors",
357
+ "model.layers.35.mlp.down_proj.weight": "model-00009-of-00014.safetensors",
358
+ "model.layers.35.mlp.gate_proj.weight": "model-00012-of-00014.safetensors",
359
+ "model.layers.35.mlp.up_proj.weight": "model-00011-of-00014.safetensors",
360
+ "model.layers.35.post_attention_layernorm.weight": "model-00001-of-00014.safetensors",
361
+ "model.layers.35.self_attn.k_proj.bias": "model-00011-of-00014.safetensors",
362
+ "model.layers.35.self_attn.k_proj.weight": "model-00007-of-00014.safetensors",
363
+ "model.layers.35.self_attn.o_proj.weight": "model-00004-of-00014.safetensors",
364
+ "model.layers.35.self_attn.q_proj.bias": "model-00007-of-00014.safetensors",
365
+ "model.layers.35.self_attn.q_proj.weight": "model-00009-of-00014.safetensors",
366
+ "model.layers.35.self_attn.v_proj.bias": "model-00009-of-00014.safetensors",
367
+ "model.layers.35.self_attn.v_proj.weight": "model-00004-of-00014.safetensors",
368
+ "model.layers.36.input_layernorm.weight": "model-00010-of-00014.safetensors",
369
+ "model.layers.36.mlp.down_proj.weight": "model-00006-of-00014.safetensors",
370
+ "model.layers.36.mlp.gate_proj.weight": "model-00006-of-00014.safetensors",
371
+ "model.layers.36.mlp.up_proj.weight": "model-00002-of-00014.safetensors",
372
+ "model.layers.36.post_attention_layernorm.weight": "model-00011-of-00014.safetensors",
373
+ "model.layers.36.self_attn.k_proj.bias": "model-00007-of-00014.safetensors",
374
+ "model.layers.36.self_attn.k_proj.weight": "model-00014-of-00014.safetensors",
375
+ "model.layers.36.self_attn.o_proj.weight": "model-00002-of-00014.safetensors",
376
+ "model.layers.36.self_attn.q_proj.bias": "model-00009-of-00014.safetensors",
377
+ "model.layers.36.self_attn.q_proj.weight": "model-00010-of-00014.safetensors",
378
+ "model.layers.36.self_attn.v_proj.bias": "model-00004-of-00014.safetensors",
379
+ "model.layers.36.self_attn.v_proj.weight": "model-00012-of-00014.safetensors",
380
+ "model.layers.37.input_layernorm.weight": "model-00012-of-00014.safetensors",
381
+ "model.layers.37.mlp.down_proj.weight": "model-00004-of-00014.safetensors",
382
+ "model.layers.37.mlp.gate_proj.weight": "model-00005-of-00014.safetensors",
383
+ "model.layers.37.mlp.up_proj.weight": "model-00009-of-00014.safetensors",
384
+ "model.layers.37.post_attention_layernorm.weight": "model-00011-of-00014.safetensors",
385
+ "model.layers.37.self_attn.k_proj.bias": "model-00008-of-00014.safetensors",
386
+ "model.layers.37.self_attn.k_proj.weight": "model-00010-of-00014.safetensors",
387
+ "model.layers.37.self_attn.o_proj.weight": "model-00006-of-00014.safetensors",
388
+ "model.layers.37.self_attn.q_proj.bias": "model-00008-of-00014.safetensors",
389
+ "model.layers.37.self_attn.q_proj.weight": "model-00008-of-00014.safetensors",
390
+ "model.layers.37.self_attn.v_proj.bias": "model-00010-of-00014.safetensors",
391
+ "model.layers.37.self_attn.v_proj.weight": "model-00009-of-00014.safetensors",
392
+ "model.layers.38.input_layernorm.weight": "model-00010-of-00014.safetensors",
393
+ "model.layers.38.mlp.down_proj.weight": "model-00005-of-00014.safetensors",
394
+ "model.layers.38.mlp.gate_proj.weight": "model-00013-of-00014.safetensors",
395
+ "model.layers.38.mlp.up_proj.weight": "model-00002-of-00014.safetensors",
396
+ "model.layers.38.post_attention_layernorm.weight": "model-00007-of-00014.safetensors",
397
+ "model.layers.38.self_attn.k_proj.bias": "model-00008-of-00014.safetensors",
398
+ "model.layers.38.self_attn.k_proj.weight": "model-00009-of-00014.safetensors",
399
+ "model.layers.38.self_attn.o_proj.weight": "model-00003-of-00014.safetensors",
400
+ "model.layers.38.self_attn.q_proj.bias": "model-00010-of-00014.safetensors",
401
+ "model.layers.38.self_attn.q_proj.weight": "model-00010-of-00014.safetensors",
402
+ "model.layers.38.self_attn.v_proj.bias": "model-00001-of-00014.safetensors",
403
+ "model.layers.38.self_attn.v_proj.weight": "model-00003-of-00014.safetensors",
404
+ "model.layers.39.input_layernorm.weight": "model-00004-of-00014.safetensors",
405
+ "model.layers.39.mlp.down_proj.weight": "model-00003-of-00014.safetensors",
406
+ "model.layers.39.mlp.gate_proj.weight": "model-00014-of-00014.safetensors",
407
+ "model.layers.39.mlp.up_proj.weight": "model-00004-of-00014.safetensors",
408
+ "model.layers.39.post_attention_layernorm.weight": "model-00013-of-00014.safetensors",
409
+ "model.layers.39.self_attn.k_proj.bias": "model-00008-of-00014.safetensors",
410
+ "model.layers.39.self_attn.k_proj.weight": "model-00006-of-00014.safetensors",
411
+ "model.layers.39.self_attn.o_proj.weight": "model-00001-of-00014.safetensors",
412
+ "model.layers.39.self_attn.q_proj.bias": "model-00011-of-00014.safetensors",
413
+ "model.layers.39.self_attn.q_proj.weight": "model-00001-of-00014.safetensors",
414
+ "model.layers.39.self_attn.v_proj.bias": "model-00004-of-00014.safetensors",
415
+ "model.layers.39.self_attn.v_proj.weight": "model-00002-of-00014.safetensors",
416
+ "model.layers.4.input_layernorm.weight": "model-00005-of-00014.safetensors",
417
+ "model.layers.4.mlp.down_proj.weight": "model-00009-of-00014.safetensors",
418
+ "model.layers.4.mlp.gate_proj.weight": "model-00001-of-00014.safetensors",
419
+ "model.layers.4.mlp.up_proj.weight": "model-00013-of-00014.safetensors",
420
+ "model.layers.4.post_attention_layernorm.weight": "model-00010-of-00014.safetensors",
421
+ "model.layers.4.self_attn.k_proj.bias": "model-00003-of-00014.safetensors",
422
+ "model.layers.4.self_attn.k_proj.weight": "model-00004-of-00014.safetensors",
423
+ "model.layers.4.self_attn.o_proj.weight": "model-00005-of-00014.safetensors",
424
+ "model.layers.4.self_attn.q_proj.bias": "model-00012-of-00014.safetensors",
425
+ "model.layers.4.self_attn.q_proj.weight": "model-00014-of-00014.safetensors",
426
+ "model.layers.4.self_attn.v_proj.bias": "model-00006-of-00014.safetensors",
427
+ "model.layers.4.self_attn.v_proj.weight": "model-00010-of-00014.safetensors",
428
+ "model.layers.40.input_layernorm.weight": "model-00001-of-00014.safetensors",
429
+ "model.layers.40.mlp.down_proj.weight": "model-00004-of-00014.safetensors",
430
+ "model.layers.40.mlp.gate_proj.weight": "model-00004-of-00014.safetensors",
431
+ "model.layers.40.mlp.up_proj.weight": "model-00008-of-00014.safetensors",
432
+ "model.layers.40.post_attention_layernorm.weight": "model-00005-of-00014.safetensors",
433
+ "model.layers.40.self_attn.k_proj.bias": "model-00010-of-00014.safetensors",
434
+ "model.layers.40.self_attn.k_proj.weight": "model-00004-of-00014.safetensors",
435
+ "model.layers.40.self_attn.o_proj.weight": "model-00012-of-00014.safetensors",
436
+ "model.layers.40.self_attn.q_proj.bias": "model-00012-of-00014.safetensors",
437
+ "model.layers.40.self_attn.q_proj.weight": "model-00001-of-00014.safetensors",
438
+ "model.layers.40.self_attn.v_proj.bias": "model-00003-of-00014.safetensors",
439
+ "model.layers.40.self_attn.v_proj.weight": "model-00006-of-00014.safetensors",
440
+ "model.layers.41.input_layernorm.weight": "model-00006-of-00014.safetensors",
441
+ "model.layers.41.mlp.down_proj.weight": "model-00009-of-00014.safetensors",
442
+ "model.layers.41.mlp.gate_proj.weight": "model-00012-of-00014.safetensors",
443
+ "model.layers.41.mlp.up_proj.weight": "model-00002-of-00014.safetensors",
444
+ "model.layers.41.post_attention_layernorm.weight": "model-00010-of-00014.safetensors",
445
+ "model.layers.41.self_attn.k_proj.bias": "model-00001-of-00014.safetensors",
446
+ "model.layers.41.self_attn.k_proj.weight": "model-00007-of-00014.safetensors",
447
+ "model.layers.41.self_attn.o_proj.weight": "model-00004-of-00014.safetensors",
448
+ "model.layers.41.self_attn.q_proj.bias": "model-00001-of-00014.safetensors",
449
+ "model.layers.41.self_attn.q_proj.weight": "model-00005-of-00014.safetensors",
450
+ "model.layers.41.self_attn.v_proj.bias": "model-00012-of-00014.safetensors",
451
+ "model.layers.41.self_attn.v_proj.weight": "model-00001-of-00014.safetensors",
452
+ "model.layers.42.input_layernorm.weight": "model-00001-of-00014.safetensors",
453
+ "model.layers.42.mlp.down_proj.weight": "model-00011-of-00014.safetensors",
454
+ "model.layers.42.mlp.gate_proj.weight": "model-00005-of-00014.safetensors",
455
+ "model.layers.42.mlp.up_proj.weight": "model-00001-of-00014.safetensors",
456
+ "model.layers.42.post_attention_layernorm.weight": "model-00003-of-00014.safetensors",
457
+ "model.layers.42.self_attn.k_proj.bias": "model-00003-of-00014.safetensors",
458
+ "model.layers.42.self_attn.k_proj.weight": "model-00006-of-00014.safetensors",
459
+ "model.layers.42.self_attn.o_proj.weight": "model-00002-of-00014.safetensors",
460
+ "model.layers.42.self_attn.q_proj.bias": "model-00010-of-00014.safetensors",
461
+ "model.layers.42.self_attn.q_proj.weight": "model-00006-of-00014.safetensors",
462
+ "model.layers.42.self_attn.v_proj.bias": "model-00010-of-00014.safetensors",
463
+ "model.layers.42.self_attn.v_proj.weight": "model-00001-of-00014.safetensors",
464
+ "model.layers.43.input_layernorm.weight": "model-00006-of-00014.safetensors",
465
+ "model.layers.43.mlp.down_proj.weight": "model-00010-of-00014.safetensors",
466
+ "model.layers.43.mlp.gate_proj.weight": "model-00012-of-00014.safetensors",
467
+ "model.layers.43.mlp.up_proj.weight": "model-00010-of-00014.safetensors",
468
+ "model.layers.43.post_attention_layernorm.weight": "model-00011-of-00014.safetensors",
469
+ "model.layers.43.self_attn.k_proj.bias": "model-00010-of-00014.safetensors",
470
+ "model.layers.43.self_attn.k_proj.weight": "model-00005-of-00014.safetensors",
471
+ "model.layers.43.self_attn.o_proj.weight": "model-00010-of-00014.safetensors",
472
+ "model.layers.43.self_attn.q_proj.bias": "model-00007-of-00014.safetensors",
473
+ "model.layers.43.self_attn.q_proj.weight": "model-00003-of-00014.safetensors",
474
+ "model.layers.43.self_attn.v_proj.bias": "model-00009-of-00014.safetensors",
475
+ "model.layers.43.self_attn.v_proj.weight": "model-00006-of-00014.safetensors",
476
+ "model.layers.44.input_layernorm.weight": "model-00001-of-00014.safetensors",
477
+ "model.layers.44.mlp.down_proj.weight": "model-00001-of-00014.safetensors",
478
+ "model.layers.44.mlp.gate_proj.weight": "model-00013-of-00014.safetensors",
479
+ "model.layers.44.mlp.up_proj.weight": "model-00014-of-00014.safetensors",
480
+ "model.layers.44.post_attention_layernorm.weight": "model-00014-of-00014.safetensors",
481
+ "model.layers.44.self_attn.k_proj.bias": "model-00007-of-00014.safetensors",
482
+ "model.layers.44.self_attn.k_proj.weight": "model-00013-of-00014.safetensors",
483
+ "model.layers.44.self_attn.o_proj.weight": "model-00007-of-00014.safetensors",
484
+ "model.layers.44.self_attn.q_proj.bias": "model-00011-of-00014.safetensors",
485
+ "model.layers.44.self_attn.q_proj.weight": "model-00012-of-00014.safetensors",
486
+ "model.layers.44.self_attn.v_proj.bias": "model-00011-of-00014.safetensors",
487
+ "model.layers.44.self_attn.v_proj.weight": "model-00013-of-00014.safetensors",
488
+ "model.layers.45.input_layernorm.weight": "model-00001-of-00014.safetensors",
489
+ "model.layers.45.mlp.down_proj.weight": "model-00009-of-00014.safetensors",
490
+ "model.layers.45.mlp.gate_proj.weight": "model-00010-of-00014.safetensors",
491
+ "model.layers.45.mlp.up_proj.weight": "model-00004-of-00014.safetensors",
492
+ "model.layers.45.post_attention_layernorm.weight": "model-00004-of-00014.safetensors",
493
+ "model.layers.45.self_attn.k_proj.bias": "model-00005-of-00014.safetensors",
494
+ "model.layers.45.self_attn.k_proj.weight": "model-00010-of-00014.safetensors",
495
+ "model.layers.45.self_attn.o_proj.weight": "model-00013-of-00014.safetensors",
496
+ "model.layers.45.self_attn.q_proj.bias": "model-00003-of-00014.safetensors",
497
+ "model.layers.45.self_attn.q_proj.weight": "model-00003-of-00014.safetensors",
498
+ "model.layers.45.self_attn.v_proj.bias": "model-00009-of-00014.safetensors",
499
+ "model.layers.45.self_attn.v_proj.weight": "model-00004-of-00014.safetensors",
500
+ "model.layers.46.input_layernorm.weight": "model-00001-of-00014.safetensors",
501
+ "model.layers.46.mlp.down_proj.weight": "model-00001-of-00014.safetensors",
502
+ "model.layers.46.mlp.gate_proj.weight": "model-00005-of-00014.safetensors",
503
+ "model.layers.46.mlp.up_proj.weight": "model-00009-of-00014.safetensors",
504
+ "model.layers.46.post_attention_layernorm.weight": "model-00001-of-00014.safetensors",
505
+ "model.layers.46.self_attn.k_proj.bias": "model-00012-of-00014.safetensors",
506
+ "model.layers.46.self_attn.k_proj.weight": "model-00013-of-00014.safetensors",
507
+ "model.layers.46.self_attn.o_proj.weight": "model-00001-of-00014.safetensors",
508
+ "model.layers.46.self_attn.q_proj.bias": "model-00010-of-00014.safetensors",
509
+ "model.layers.46.self_attn.q_proj.weight": "model-00004-of-00014.safetensors",
510
+ "model.layers.46.self_attn.v_proj.bias": "model-00002-of-00014.safetensors",
511
+ "model.layers.46.self_attn.v_proj.weight": "model-00003-of-00014.safetensors",
512
+ "model.layers.47.input_layernorm.weight": "model-00002-of-00014.safetensors",
513
+ "model.layers.47.mlp.down_proj.weight": "model-00008-of-00014.safetensors",
514
+ "model.layers.47.mlp.gate_proj.weight": "model-00002-of-00014.safetensors",
515
+ "model.layers.47.mlp.up_proj.weight": "model-00013-of-00014.safetensors",
516
+ "model.layers.47.post_attention_layernorm.weight": "model-00004-of-00014.safetensors",
517
+ "model.layers.47.self_attn.k_proj.bias": "model-00009-of-00014.safetensors",
518
+ "model.layers.47.self_attn.k_proj.weight": "model-00012-of-00014.safetensors",
519
+ "model.layers.47.self_attn.o_proj.weight": "model-00005-of-00014.safetensors",
520
+ "model.layers.47.self_attn.q_proj.bias": "model-00003-of-00014.safetensors",
521
+ "model.layers.47.self_attn.q_proj.weight": "model-00005-of-00014.safetensors",
522
+ "model.layers.47.self_attn.v_proj.bias": "model-00009-of-00014.safetensors",
523
+ "model.layers.47.self_attn.v_proj.weight": "model-00005-of-00014.safetensors",
524
+ "model.layers.48.input_layernorm.weight": "model-00008-of-00014.safetensors",
525
+ "model.layers.48.mlp.down_proj.weight": "model-00009-of-00014.safetensors",
526
+ "model.layers.48.mlp.gate_proj.weight": "model-00002-of-00014.safetensors",
527
+ "model.layers.48.mlp.up_proj.weight": "model-00012-of-00014.safetensors",
528
+ "model.layers.48.post_attention_layernorm.weight": "model-00008-of-00014.safetensors",
529
+ "model.layers.48.self_attn.k_proj.bias": "model-00013-of-00014.safetensors",
530
+ "model.layers.48.self_attn.k_proj.weight": "model-00013-of-00014.safetensors",
531
+ "model.layers.48.self_attn.o_proj.weight": "model-00003-of-00014.safetensors",
532
+ "model.layers.48.self_attn.q_proj.bias": "model-00007-of-00014.safetensors",
533
+ "model.layers.48.self_attn.q_proj.weight": "model-00001-of-00014.safetensors",
534
+ "model.layers.48.self_attn.v_proj.bias": "model-00012-of-00014.safetensors",
535
+ "model.layers.48.self_attn.v_proj.weight": "model-00013-of-00014.safetensors",
536
+ "model.layers.49.input_layernorm.weight": "model-00014-of-00014.safetensors",
537
+ "model.layers.49.mlp.down_proj.weight": "model-00012-of-00014.safetensors",
538
+ "model.layers.49.mlp.gate_proj.weight": "model-00003-of-00014.safetensors",
539
+ "model.layers.49.mlp.up_proj.weight": "model-00002-of-00014.safetensors",
540
+ "model.layers.49.post_attention_layernorm.weight": "model-00006-of-00014.safetensors",
541
+ "model.layers.49.self_attn.k_proj.bias": "model-00010-of-00014.safetensors",
542
+ "model.layers.49.self_attn.k_proj.weight": "model-00014-of-00014.safetensors",
543
+ "model.layers.49.self_attn.o_proj.weight": "model-00002-of-00014.safetensors",
544
+ "model.layers.49.self_attn.q_proj.bias": "model-00009-of-00014.safetensors",
545
+ "model.layers.49.self_attn.q_proj.weight": "model-00013-of-00014.safetensors",
546
+ "model.layers.49.self_attn.v_proj.bias": "model-00002-of-00014.safetensors",
547
+ "model.layers.49.self_attn.v_proj.weight": "model-00003-of-00014.safetensors",
548
+ "model.layers.5.input_layernorm.weight": "model-00004-of-00014.safetensors",
549
+ "model.layers.5.mlp.down_proj.weight": "model-00014-of-00014.safetensors",
550
+ "model.layers.5.mlp.gate_proj.weight": "model-00007-of-00014.safetensors",
551
+ "model.layers.5.mlp.up_proj.weight": "model-00014-of-00014.safetensors",
552
+ "model.layers.5.post_attention_layernorm.weight": "model-00011-of-00014.safetensors",
553
+ "model.layers.5.self_attn.k_proj.bias": "model-00003-of-00014.safetensors",
554
+ "model.layers.5.self_attn.k_proj.weight": "model-00011-of-00014.safetensors",
555
+ "model.layers.5.self_attn.o_proj.weight": "model-00004-of-00014.safetensors",
556
+ "model.layers.5.self_attn.q_proj.bias": "model-00008-of-00014.safetensors",
557
+ "model.layers.5.self_attn.q_proj.weight": "model-00004-of-00014.safetensors",
558
+ "model.layers.5.self_attn.v_proj.bias": "model-00014-of-00014.safetensors",
559
+ "model.layers.5.self_attn.v_proj.weight": "model-00004-of-00014.safetensors",
560
+ "model.layers.50.input_layernorm.weight": "model-00002-of-00014.safetensors",
561
+ "model.layers.50.mlp.down_proj.weight": "model-00006-of-00014.safetensors",
562
+ "model.layers.50.mlp.gate_proj.weight": "model-00002-of-00014.safetensors",
563
+ "model.layers.50.mlp.up_proj.weight": "model-00007-of-00014.safetensors",
564
+ "model.layers.50.post_attention_layernorm.weight": "model-00003-of-00014.safetensors",
565
+ "model.layers.50.self_attn.k_proj.bias": "model-00011-of-00014.safetensors",
566
+ "model.layers.50.self_attn.k_proj.weight": "model-00008-of-00014.safetensors",
567
+ "model.layers.50.self_attn.o_proj.weight": "model-00014-of-00014.safetensors",
568
+ "model.layers.50.self_attn.q_proj.bias": "model-00008-of-00014.safetensors",
569
+ "model.layers.50.self_attn.q_proj.weight": "model-00012-of-00014.safetensors",
570
+ "model.layers.50.self_attn.v_proj.bias": "model-00005-of-00014.safetensors",
571
+ "model.layers.50.self_attn.v_proj.weight": "model-00001-of-00014.safetensors",
572
+ "model.layers.51.input_layernorm.weight": "model-00003-of-00014.safetensors",
573
+ "model.layers.51.mlp.down_proj.weight": "model-00002-of-00014.safetensors",
574
+ "model.layers.51.mlp.gate_proj.weight": "model-00006-of-00014.safetensors",
575
+ "model.layers.51.mlp.up_proj.weight": "model-00010-of-00014.safetensors",
576
+ "model.layers.51.post_attention_layernorm.weight": "model-00010-of-00014.safetensors",
577
+ "model.layers.51.self_attn.k_proj.bias": "model-00001-of-00014.safetensors",
578
+ "model.layers.51.self_attn.k_proj.weight": "model-00012-of-00014.safetensors",
579
+ "model.layers.51.self_attn.o_proj.weight": "model-00014-of-00014.safetensors",
580
+ "model.layers.51.self_attn.q_proj.bias": "model-00006-of-00014.safetensors",
581
+ "model.layers.51.self_attn.q_proj.weight": "model-00005-of-00014.safetensors",
582
+ "model.layers.51.self_attn.v_proj.bias": "model-00001-of-00014.safetensors",
583
+ "model.layers.51.self_attn.v_proj.weight": "model-00012-of-00014.safetensors",
584
+ "model.layers.52.input_layernorm.weight": "model-00013-of-00014.safetensors",
585
+ "model.layers.52.mlp.down_proj.weight": "model-00012-of-00014.safetensors",
586
+ "model.layers.52.mlp.gate_proj.weight": "model-00005-of-00014.safetensors",
587
+ "model.layers.52.mlp.up_proj.weight": "model-00009-of-00014.safetensors",
588
+ "model.layers.52.post_attention_layernorm.weight": "model-00001-of-00014.safetensors",
589
+ "model.layers.52.self_attn.k_proj.bias": "model-00011-of-00014.safetensors",
590
+ "model.layers.52.self_attn.k_proj.weight": "model-00013-of-00014.safetensors",
591
+ "model.layers.52.self_attn.o_proj.weight": "model-00001-of-00014.safetensors",
592
+ "model.layers.52.self_attn.q_proj.bias": "model-00012-of-00014.safetensors",
593
+ "model.layers.52.self_attn.q_proj.weight": "model-00012-of-00014.safetensors",
594
+ "model.layers.52.self_attn.v_proj.bias": "model-00008-of-00014.safetensors",
595
+ "model.layers.52.self_attn.v_proj.weight": "model-00003-of-00014.safetensors",
596
+ "model.layers.53.input_layernorm.weight": "model-00011-of-00014.safetensors",
597
+ "model.layers.53.mlp.down_proj.weight": "model-00013-of-00014.safetensors",
598
+ "model.layers.53.mlp.gate_proj.weight": "model-00006-of-00014.safetensors",
599
+ "model.layers.53.mlp.up_proj.weight": "model-00002-of-00014.safetensors",
600
+ "model.layers.53.post_attention_layernorm.weight": "model-00007-of-00014.safetensors",
601
+ "model.layers.53.self_attn.k_proj.bias": "model-00008-of-00014.safetensors",
602
+ "model.layers.53.self_attn.k_proj.weight": "model-00012-of-00014.safetensors",
603
+ "model.layers.53.self_attn.o_proj.weight": "model-00012-of-00014.safetensors",
604
+ "model.layers.53.self_attn.q_proj.bias": "model-00014-of-00014.safetensors",
605
+ "model.layers.53.self_attn.q_proj.weight": "model-00012-of-00014.safetensors",
606
+ "model.layers.53.self_attn.v_proj.bias": "model-00011-of-00014.safetensors",
607
+ "model.layers.53.self_attn.v_proj.weight": "model-00013-of-00014.safetensors",
608
+ "model.layers.54.input_layernorm.weight": "model-00013-of-00014.safetensors",
609
+ "model.layers.54.mlp.down_proj.weight": "model-00002-of-00014.safetensors",
610
+ "model.layers.54.mlp.gate_proj.weight": "model-00014-of-00014.safetensors",
611
+ "model.layers.54.mlp.up_proj.weight": "model-00005-of-00014.safetensors",
612
+ "model.layers.54.post_attention_layernorm.weight": "model-00010-of-00014.safetensors",
613
+ "model.layers.54.self_attn.k_proj.bias": "model-00004-of-00014.safetensors",
614
+ "model.layers.54.self_attn.k_proj.weight": "model-00011-of-00014.safetensors",
615
+ "model.layers.54.self_attn.o_proj.weight": "model-00012-of-00014.safetensors",
616
+ "model.layers.54.self_attn.q_proj.bias": "model-00006-of-00014.safetensors",
617
+ "model.layers.54.self_attn.q_proj.weight": "model-00001-of-00014.safetensors",
618
+ "model.layers.54.self_attn.v_proj.bias": "model-00001-of-00014.safetensors",
619
+ "model.layers.54.self_attn.v_proj.weight": "model-00008-of-00014.safetensors",
620
+ "model.layers.55.input_layernorm.weight": "model-00014-of-00014.safetensors",
621
+ "model.layers.55.mlp.down_proj.weight": "model-00012-of-00014.safetensors",
622
+ "model.layers.55.mlp.gate_proj.weight": "model-00001-of-00014.safetensors",
623
+ "model.layers.55.mlp.up_proj.weight": "model-00003-of-00014.safetensors",
624
+ "model.layers.55.post_attention_layernorm.weight": "model-00004-of-00014.safetensors",
625
+ "model.layers.55.self_attn.k_proj.bias": "model-00001-of-00014.safetensors",
626
+ "model.layers.55.self_attn.k_proj.weight": "model-00004-of-00014.safetensors",
627
+ "model.layers.55.self_attn.o_proj.weight": "model-00012-of-00014.safetensors",
628
+ "model.layers.55.self_attn.q_proj.bias": "model-00012-of-00014.safetensors",
629
+ "model.layers.55.self_attn.q_proj.weight": "model-00013-of-00014.safetensors",
630
+ "model.layers.55.self_attn.v_proj.bias": "model-00001-of-00014.safetensors",
631
+ "model.layers.55.self_attn.v_proj.weight": "model-00001-of-00014.safetensors",
632
+ "model.layers.56.input_layernorm.weight": "model-00002-of-00014.safetensors",
633
+ "model.layers.56.mlp.down_proj.weight": "model-00011-of-00014.safetensors",
634
+ "model.layers.56.mlp.gate_proj.weight": "model-00007-of-00014.safetensors",
635
+ "model.layers.56.mlp.up_proj.weight": "model-00010-of-00014.safetensors",
636
+ "model.layers.56.post_attention_layernorm.weight": "model-00014-of-00014.safetensors",
637
+ "model.layers.56.self_attn.k_proj.bias": "model-00001-of-00014.safetensors",
638
+ "model.layers.56.self_attn.k_proj.weight": "model-00007-of-00014.safetensors",
639
+ "model.layers.56.self_attn.o_proj.weight": "model-00004-of-00014.safetensors",
640
+ "model.layers.56.self_attn.q_proj.bias": "model-00001-of-00014.safetensors",
641
+ "model.layers.56.self_attn.q_proj.weight": "model-00009-of-00014.safetensors",
642
+ "model.layers.56.self_attn.v_proj.bias": "model-00013-of-00014.safetensors",
643
+ "model.layers.56.self_attn.v_proj.weight": "model-00003-of-00014.safetensors",
644
+ "model.layers.57.input_layernorm.weight": "model-00010-of-00014.safetensors",
645
+ "model.layers.57.mlp.down_proj.weight": "model-00014-of-00014.safetensors",
646
+ "model.layers.57.mlp.gate_proj.weight": "model-00011-of-00014.safetensors",
647
+ "model.layers.57.mlp.up_proj.weight": "model-00002-of-00014.safetensors",
648
+ "model.layers.57.post_attention_layernorm.weight": "model-00001-of-00014.safetensors",
649
+ "model.layers.57.self_attn.k_proj.bias": "model-00005-of-00014.safetensors",
650
+ "model.layers.57.self_attn.k_proj.weight": "model-00009-of-00014.safetensors",
651
+ "model.layers.57.self_attn.o_proj.weight": "model-00004-of-00014.safetensors",
652
+ "model.layers.57.self_attn.q_proj.bias": "model-00009-of-00014.safetensors",
653
+ "model.layers.57.self_attn.q_proj.weight": "model-00009-of-00014.safetensors",
654
+ "model.layers.57.self_attn.v_proj.bias": "model-00005-of-00014.safetensors",
655
+ "model.layers.57.self_attn.v_proj.weight": "model-00007-of-00014.safetensors",
656
+ "model.layers.58.input_layernorm.weight": "model-00014-of-00014.safetensors",
657
+ "model.layers.58.mlp.down_proj.weight": "model-00006-of-00014.safetensors",
658
+ "model.layers.58.mlp.gate_proj.weight": "model-00001-of-00014.safetensors",
659
+ "model.layers.58.mlp.up_proj.weight": "model-00004-of-00014.safetensors",
660
+ "model.layers.58.post_attention_layernorm.weight": "model-00004-of-00014.safetensors",
661
+ "model.layers.58.self_attn.k_proj.bias": "model-00006-of-00014.safetensors",
662
+ "model.layers.58.self_attn.k_proj.weight": "model-00001-of-00014.safetensors",
663
+ "model.layers.58.self_attn.o_proj.weight": "model-00005-of-00014.safetensors",
664
+ "model.layers.58.self_attn.q_proj.bias": "model-00012-of-00014.safetensors",
665
+ "model.layers.58.self_attn.q_proj.weight": "model-00001-of-00014.safetensors",
666
+ "model.layers.58.self_attn.v_proj.bias": "model-00010-of-00014.safetensors",
667
+ "model.layers.58.self_attn.v_proj.weight": "model-00005-of-00014.safetensors",
668
+ "model.layers.59.input_layernorm.weight": "model-00001-of-00014.safetensors",
669
+ "model.layers.59.mlp.down_proj.weight": "model-00006-of-00014.safetensors",
670
+ "model.layers.59.mlp.gate_proj.weight": "model-00007-of-00014.safetensors",
671
+ "model.layers.59.mlp.up_proj.weight": "model-00007-of-00014.safetensors",
672
+ "model.layers.59.post_attention_layernorm.weight": "model-00004-of-00014.safetensors",
673
+ "model.layers.59.self_attn.k_proj.bias": "model-00004-of-00014.safetensors",
674
+ "model.layers.59.self_attn.k_proj.weight": "model-00011-of-00014.safetensors",
675
+ "model.layers.59.self_attn.o_proj.weight": "model-00011-of-00014.safetensors",
676
+ "model.layers.59.self_attn.q_proj.bias": "model-00010-of-00014.safetensors",
677
+ "model.layers.59.self_attn.q_proj.weight": "model-00013-of-00014.safetensors",
678
+ "model.layers.59.self_attn.v_proj.bias": "model-00010-of-00014.safetensors",
679
+ "model.layers.59.self_attn.v_proj.weight": "model-00002-of-00014.safetensors",
680
+ "model.layers.6.input_layernorm.weight": "model-00012-of-00014.safetensors",
681
+ "model.layers.6.mlp.down_proj.weight": "model-00003-of-00014.safetensors",
682
+ "model.layers.6.mlp.gate_proj.weight": "model-00011-of-00014.safetensors",
683
+ "model.layers.6.mlp.up_proj.weight": "model-00011-of-00014.safetensors",
684
+ "model.layers.6.post_attention_layernorm.weight": "model-00006-of-00014.safetensors",
685
+ "model.layers.6.self_attn.k_proj.bias": "model-00011-of-00014.safetensors",
686
+ "model.layers.6.self_attn.k_proj.weight": "model-00004-of-00014.safetensors",
687
+ "model.layers.6.self_attn.o_proj.weight": "model-00014-of-00014.safetensors",
688
+ "model.layers.6.self_attn.q_proj.bias": "model-00004-of-00014.safetensors",
689
+ "model.layers.6.self_attn.q_proj.weight": "model-00014-of-00014.safetensors",
690
+ "model.layers.6.self_attn.v_proj.bias": "model-00001-of-00014.safetensors",
691
+ "model.layers.6.self_attn.v_proj.weight": "model-00001-of-00014.safetensors",
692
+ "model.layers.60.input_layernorm.weight": "model-00003-of-00014.safetensors",
693
+ "model.layers.60.mlp.down_proj.weight": "model-00005-of-00014.safetensors",
694
+ "model.layers.60.mlp.gate_proj.weight": "model-00006-of-00014.safetensors",
695
+ "model.layers.60.mlp.up_proj.weight": "model-00005-of-00014.safetensors",
696
+ "model.layers.60.post_attention_layernorm.weight": "model-00009-of-00014.safetensors",
697
+ "model.layers.60.self_attn.k_proj.bias": "model-00005-of-00014.safetensors",
698
+ "model.layers.60.self_attn.k_proj.weight": "model-00011-of-00014.safetensors",
699
+ "model.layers.60.self_attn.o_proj.weight": "model-00008-of-00014.safetensors",
700
+ "model.layers.60.self_attn.q_proj.bias": "model-00010-of-00014.safetensors",
701
+ "model.layers.60.self_attn.q_proj.weight": "model-00001-of-00014.safetensors",
702
+ "model.layers.60.self_attn.v_proj.bias": "model-00009-of-00014.safetensors",
703
+ "model.layers.60.self_attn.v_proj.weight": "model-00003-of-00014.safetensors",
704
+ "model.layers.61.input_layernorm.weight": "model-00002-of-00014.safetensors",
705
+ "model.layers.61.mlp.down_proj.weight": "model-00009-of-00014.safetensors",
706
+ "model.layers.61.mlp.gate_proj.weight": "model-00005-of-00014.safetensors",
707
+ "model.layers.61.mlp.up_proj.weight": "model-00011-of-00014.safetensors",
708
+ "model.layers.61.post_attention_layernorm.weight": "model-00014-of-00014.safetensors",
709
+ "model.layers.61.self_attn.k_proj.bias": "model-00012-of-00014.safetensors",
710
+ "model.layers.61.self_attn.k_proj.weight": "model-00001-of-00014.safetensors",
711
+ "model.layers.61.self_attn.o_proj.weight": "model-00009-of-00014.safetensors",
712
+ "model.layers.61.self_attn.q_proj.bias": "model-00005-of-00014.safetensors",
713
+ "model.layers.61.self_attn.q_proj.weight": "model-00009-of-00014.safetensors",
714
+ "model.layers.61.self_attn.v_proj.bias": "model-00004-of-00014.safetensors",
715
+ "model.layers.61.self_attn.v_proj.weight": "model-00013-of-00014.safetensors",
716
+ "model.layers.62.input_layernorm.weight": "model-00013-of-00014.safetensors",
717
+ "model.layers.62.mlp.down_proj.weight": "model-00007-of-00014.safetensors",
718
+ "model.layers.62.mlp.gate_proj.weight": "model-00012-of-00014.safetensors",
719
+ "model.layers.62.mlp.up_proj.weight": "model-00011-of-00014.safetensors",
720
+ "model.layers.62.post_attention_layernorm.weight": "model-00004-of-00014.safetensors",
721
+ "model.layers.62.self_attn.k_proj.bias": "model-00004-of-00014.safetensors",
722
+ "model.layers.62.self_attn.k_proj.weight": "model-00004-of-00014.safetensors",
723
+ "model.layers.62.self_attn.o_proj.weight": "model-00007-of-00014.safetensors",
724
+ "model.layers.62.self_attn.q_proj.bias": "model-00004-of-00014.safetensors",
725
+ "model.layers.62.self_attn.q_proj.weight": "model-00002-of-00014.safetensors",
726
+ "model.layers.62.self_attn.v_proj.bias": "model-00007-of-00014.safetensors",
727
+ "model.layers.62.self_attn.v_proj.weight": "model-00006-of-00014.safetensors",
728
+ "model.layers.63.input_layernorm.weight": "model-00008-of-00014.safetensors",
729
+ "model.layers.63.mlp.down_proj.weight": "model-00007-of-00014.safetensors",
730
+ "model.layers.63.mlp.gate_proj.weight": "model-00013-of-00014.safetensors",
731
+ "model.layers.63.mlp.up_proj.weight": "model-00008-of-00014.safetensors",
732
+ "model.layers.63.post_attention_layernorm.weight": "model-00006-of-00014.safetensors",
733
+ "model.layers.63.self_attn.k_proj.bias": "model-00005-of-00014.safetensors",
734
+ "model.layers.63.self_attn.k_proj.weight": "model-00005-of-00014.safetensors",
735
+ "model.layers.63.self_attn.o_proj.weight": "model-00001-of-00014.safetensors",
736
+ "model.layers.63.self_attn.q_proj.bias": "model-00007-of-00014.safetensors",
737
+ "model.layers.63.self_attn.q_proj.weight": "model-00007-of-00014.safetensors",
738
+ "model.layers.63.self_attn.v_proj.bias": "model-00010-of-00014.safetensors",
739
+ "model.layers.63.self_attn.v_proj.weight": "model-00001-of-00014.safetensors",
740
+ "model.layers.7.input_layernorm.weight": "model-00001-of-00014.safetensors",
741
+ "model.layers.7.mlp.down_proj.weight": "model-00011-of-00014.safetensors",
742
+ "model.layers.7.mlp.gate_proj.weight": "model-00010-of-00014.safetensors",
743
+ "model.layers.7.mlp.up_proj.weight": "model-00006-of-00014.safetensors",
744
+ "model.layers.7.post_attention_layernorm.weight": "model-00009-of-00014.safetensors",
745
+ "model.layers.7.self_attn.k_proj.bias": "model-00007-of-00014.safetensors",
746
+ "model.layers.7.self_attn.k_proj.weight": "model-00006-of-00014.safetensors",
747
+ "model.layers.7.self_attn.o_proj.weight": "model-00013-of-00014.safetensors",
748
+ "model.layers.7.self_attn.q_proj.bias": "model-00001-of-00014.safetensors",
749
+ "model.layers.7.self_attn.q_proj.weight": "model-00005-of-00014.safetensors",
750
+ "model.layers.7.self_attn.v_proj.bias": "model-00007-of-00014.safetensors",
751
+ "model.layers.7.self_attn.v_proj.weight": "model-00010-of-00014.safetensors",
752
+ "model.layers.8.input_layernorm.weight": "model-00001-of-00014.safetensors",
753
+ "model.layers.8.mlp.down_proj.weight": "model-00014-of-00014.safetensors",
754
+ "model.layers.8.mlp.gate_proj.weight": "model-00005-of-00014.safetensors",
755
+ "model.layers.8.mlp.up_proj.weight": "model-00010-of-00014.safetensors",
756
+ "model.layers.8.post_attention_layernorm.weight": "model-00011-of-00014.safetensors",
757
+ "model.layers.8.self_attn.k_proj.bias": "model-00001-of-00014.safetensors",
758
+ "model.layers.8.self_attn.k_proj.weight": "model-00014-of-00014.safetensors",
759
+ "model.layers.8.self_attn.o_proj.weight": "model-00006-of-00014.safetensors",
760
+ "model.layers.8.self_attn.q_proj.bias": "model-00007-of-00014.safetensors",
761
+ "model.layers.8.self_attn.q_proj.weight": "model-00003-of-00014.safetensors",
762
+ "model.layers.8.self_attn.v_proj.bias": "model-00004-of-00014.safetensors",
763
+ "model.layers.8.self_attn.v_proj.weight": "model-00013-of-00014.safetensors",
764
+ "model.layers.9.input_layernorm.weight": "model-00012-of-00014.safetensors",
765
+ "model.layers.9.mlp.down_proj.weight": "model-00012-of-00014.safetensors",
766
+ "model.layers.9.mlp.gate_proj.weight": "model-00011-of-00014.safetensors",
767
+ "model.layers.9.mlp.up_proj.weight": "model-00003-of-00014.safetensors",
768
+ "model.layers.9.post_attention_layernorm.weight": "model-00007-of-00014.safetensors",
769
+ "model.layers.9.self_attn.k_proj.bias": "model-00011-of-00014.safetensors",
770
+ "model.layers.9.self_attn.k_proj.weight": "model-00009-of-00014.safetensors",
771
+ "model.layers.9.self_attn.o_proj.weight": "model-00013-of-00014.safetensors",
772
+ "model.layers.9.self_attn.q_proj.bias": "model-00010-of-00014.safetensors",
773
+ "model.layers.9.self_attn.q_proj.weight": "model-00009-of-00014.safetensors",
774
+ "model.layers.9.self_attn.v_proj.bias": "model-00011-of-00014.safetensors",
775
+ "model.layers.9.self_attn.v_proj.weight": "model-00012-of-00014.safetensors",
776
+ "model.norm.weight": "model-00003-of-00014.safetensors",
777
+ "visual.blocks.0.attn.proj.bias": "model-00004-of-00014.safetensors",
778
+ "visual.blocks.0.attn.proj.weight": "model-00001-of-00014.safetensors",
779
+ "visual.blocks.0.attn.qkv.bias": "model-00004-of-00014.safetensors",
780
+ "visual.blocks.0.attn.qkv.weight": "model-00003-of-00014.safetensors",
781
+ "visual.blocks.0.mlp.down_proj.bias": "model-00013-of-00014.safetensors",
782
+ "visual.blocks.0.mlp.down_proj.weight": "model-00010-of-00014.safetensors",
783
+ "visual.blocks.0.mlp.gate_proj.bias": "model-00005-of-00014.safetensors",
784
+ "visual.blocks.0.mlp.gate_proj.weight": "model-00001-of-00014.safetensors",
785
+ "visual.blocks.0.mlp.up_proj.bias": "model-00002-of-00014.safetensors",
786
+ "visual.blocks.0.mlp.up_proj.weight": "model-00012-of-00014.safetensors",
787
+ "visual.blocks.0.norm1.weight": "model-00009-of-00014.safetensors",
788
+ "visual.blocks.0.norm2.weight": "model-00001-of-00014.safetensors",
789
+ "visual.blocks.1.attn.proj.bias": "model-00001-of-00014.safetensors",
790
+ "visual.blocks.1.attn.proj.weight": "model-00001-of-00014.safetensors",
791
+ "visual.blocks.1.attn.qkv.bias": "model-00003-of-00014.safetensors",
792
+ "visual.blocks.1.attn.qkv.weight": "model-00014-of-00014.safetensors",
793
+ "visual.blocks.1.mlp.down_proj.bias": "model-00002-of-00014.safetensors",
794
+ "visual.blocks.1.mlp.down_proj.weight": "model-00014-of-00014.safetensors",
795
+ "visual.blocks.1.mlp.gate_proj.bias": "model-00010-of-00014.safetensors",
796
+ "visual.blocks.1.mlp.gate_proj.weight": "model-00001-of-00014.safetensors",
797
+ "visual.blocks.1.mlp.up_proj.bias": "model-00014-of-00014.safetensors",
798
+ "visual.blocks.1.mlp.up_proj.weight": "model-00003-of-00014.safetensors",
799
+ "visual.blocks.1.norm1.weight": "model-00010-of-00014.safetensors",
800
+ "visual.blocks.1.norm2.weight": "model-00004-of-00014.safetensors",
801
+ "visual.blocks.10.attn.proj.bias": "model-00014-of-00014.safetensors",
802
+ "visual.blocks.10.attn.proj.weight": "model-00001-of-00014.safetensors",
803
+ "visual.blocks.10.attn.qkv.bias": "model-00010-of-00014.safetensors",
804
+ "visual.blocks.10.attn.qkv.weight": "model-00013-of-00014.safetensors",
805
+ "visual.blocks.10.mlp.down_proj.bias": "model-00012-of-00014.safetensors",
806
+ "visual.blocks.10.mlp.down_proj.weight": "model-00014-of-00014.safetensors",
807
+ "visual.blocks.10.mlp.gate_proj.bias": "model-00014-of-00014.safetensors",
808
+ "visual.blocks.10.mlp.gate_proj.weight": "model-00010-of-00014.safetensors",
809
+ "visual.blocks.10.mlp.up_proj.bias": "model-00001-of-00014.safetensors",
810
+ "visual.blocks.10.mlp.up_proj.weight": "model-00013-of-00014.safetensors",
811
+ "visual.blocks.10.norm1.weight": "model-00010-of-00014.safetensors",
812
+ "visual.blocks.10.norm2.weight": "model-00014-of-00014.safetensors",
813
+ "visual.blocks.11.attn.proj.bias": "model-00006-of-00014.safetensors",
814
+ "visual.blocks.11.attn.proj.weight": "model-00011-of-00014.safetensors",
815
+ "visual.blocks.11.attn.qkv.bias": "model-00006-of-00014.safetensors",
816
+ "visual.blocks.11.attn.qkv.weight": "model-00001-of-00014.safetensors",
817
+ "visual.blocks.11.mlp.down_proj.bias": "model-00009-of-00014.safetensors",
818
+ "visual.blocks.11.mlp.down_proj.weight": "model-00006-of-00014.safetensors",
819
+ "visual.blocks.11.mlp.gate_proj.bias": "model-00011-of-00014.safetensors",
820
+ "visual.blocks.11.mlp.gate_proj.weight": "model-00006-of-00014.safetensors",
821
+ "visual.blocks.11.mlp.up_proj.bias": "model-00006-of-00014.safetensors",
822
+ "visual.blocks.11.mlp.up_proj.weight": "model-00003-of-00014.safetensors",
823
+ "visual.blocks.11.norm1.weight": "model-00003-of-00014.safetensors",
824
+ "visual.blocks.11.norm2.weight": "model-00010-of-00014.safetensors",
825
+ "visual.blocks.12.attn.proj.bias": "model-00014-of-00014.safetensors",
826
+ "visual.blocks.12.attn.proj.weight": "model-00013-of-00014.safetensors",
827
+ "visual.blocks.12.attn.qkv.bias": "model-00003-of-00014.safetensors",
828
+ "visual.blocks.12.attn.qkv.weight": "model-00010-of-00014.safetensors",
829
+ "visual.blocks.12.mlp.down_proj.bias": "model-00001-of-00014.safetensors",
830
+ "visual.blocks.12.mlp.down_proj.weight": "model-00007-of-00014.safetensors",
831
+ "visual.blocks.12.mlp.gate_proj.bias": "model-00005-of-00014.safetensors",
832
+ "visual.blocks.12.mlp.gate_proj.weight": "model-00008-of-00014.safetensors",
833
+ "visual.blocks.12.mlp.up_proj.bias": "model-00001-of-00014.safetensors",
834
+ "visual.blocks.12.mlp.up_proj.weight": "model-00010-of-00014.safetensors",
835
+ "visual.blocks.12.norm1.weight": "model-00013-of-00014.safetensors",
836
+ "visual.blocks.12.norm2.weight": "model-00003-of-00014.safetensors",
837
+ "visual.blocks.13.attn.proj.bias": "model-00012-of-00014.safetensors",
838
+ "visual.blocks.13.attn.proj.weight": "model-00003-of-00014.safetensors",
839
+ "visual.blocks.13.attn.qkv.bias": "model-00008-of-00014.safetensors",
840
+ "visual.blocks.13.attn.qkv.weight": "model-00003-of-00014.safetensors",
841
+ "visual.blocks.13.mlp.down_proj.bias": "model-00014-of-00014.safetensors",
842
+ "visual.blocks.13.mlp.down_proj.weight": "model-00010-of-00014.safetensors",
843
+ "visual.blocks.13.mlp.gate_proj.bias": "model-00012-of-00014.safetensors",
844
+ "visual.blocks.13.mlp.gate_proj.weight": "model-00009-of-00014.safetensors",
845
+ "visual.blocks.13.mlp.up_proj.bias": "model-00013-of-00014.safetensors",
846
+ "visual.blocks.13.mlp.up_proj.weight": "model-00001-of-00014.safetensors",
847
+ "visual.blocks.13.norm1.weight": "model-00003-of-00014.safetensors",
848
+ "visual.blocks.13.norm2.weight": "model-00014-of-00014.safetensors",
849
+ "visual.blocks.14.attn.proj.bias": "model-00004-of-00014.safetensors",
850
+ "visual.blocks.14.attn.proj.weight": "model-00013-of-00014.safetensors",
851
+ "visual.blocks.14.attn.qkv.bias": "model-00009-of-00014.safetensors",
852
+ "visual.blocks.14.attn.qkv.weight": "model-00001-of-00014.safetensors",
853
+ "visual.blocks.14.mlp.down_proj.bias": "model-00001-of-00014.safetensors",
854
+ "visual.blocks.14.mlp.down_proj.weight": "model-00010-of-00014.safetensors",
855
+ "visual.blocks.14.mlp.gate_proj.bias": "model-00009-of-00014.safetensors",
856
+ "visual.blocks.14.mlp.gate_proj.weight": "model-00013-of-00014.safetensors",
857
+ "visual.blocks.14.mlp.up_proj.bias": "model-00001-of-00014.safetensors",
858
+ "visual.blocks.14.mlp.up_proj.weight": "model-00012-of-00014.safetensors",
859
+ "visual.blocks.14.norm1.weight": "model-00003-of-00014.safetensors",
860
+ "visual.blocks.14.norm2.weight": "model-00011-of-00014.safetensors",
861
+ "visual.blocks.15.attn.proj.bias": "model-00010-of-00014.safetensors",
862
+ "visual.blocks.15.attn.proj.weight": "model-00009-of-00014.safetensors",
863
+ "visual.blocks.15.attn.qkv.bias": "model-00006-of-00014.safetensors",
864
+ "visual.blocks.15.attn.qkv.weight": "model-00009-of-00014.safetensors",
865
+ "visual.blocks.15.mlp.down_proj.bias": "model-00003-of-00014.safetensors",
866
+ "visual.blocks.15.mlp.down_proj.weight": "model-00009-of-00014.safetensors",
867
+ "visual.blocks.15.mlp.gate_proj.bias": "model-00005-of-00014.safetensors",
868
+ "visual.blocks.15.mlp.gate_proj.weight": "model-00002-of-00014.safetensors",
869
+ "visual.blocks.15.mlp.up_proj.bias": "model-00012-of-00014.safetensors",
870
+ "visual.blocks.15.mlp.up_proj.weight": "model-00012-of-00014.safetensors",
871
+ "visual.blocks.15.norm1.weight": "model-00010-of-00014.safetensors",
872
+ "visual.blocks.15.norm2.weight": "model-00012-of-00014.safetensors",
873
+ "visual.blocks.16.attn.proj.bias": "model-00001-of-00014.safetensors",
874
+ "visual.blocks.16.attn.proj.weight": "model-00012-of-00014.safetensors",
875
+ "visual.blocks.16.attn.qkv.bias": "model-00009-of-00014.safetensors",
876
+ "visual.blocks.16.attn.qkv.weight": "model-00011-of-00014.safetensors",
877
+ "visual.blocks.16.mlp.down_proj.bias": "model-00004-of-00014.safetensors",
878
+ "visual.blocks.16.mlp.down_proj.weight": "model-00005-of-00014.safetensors",
879
+ "visual.blocks.16.mlp.gate_proj.bias": "model-00013-of-00014.safetensors",
880
+ "visual.blocks.16.mlp.gate_proj.weight": "model-00012-of-00014.safetensors",
881
+ "visual.blocks.16.mlp.up_proj.bias": "model-00007-of-00014.safetensors",
882
+ "visual.blocks.16.mlp.up_proj.weight": "model-00004-of-00014.safetensors",
883
+ "visual.blocks.16.norm1.weight": "model-00001-of-00014.safetensors",
884
+ "visual.blocks.16.norm2.weight": "model-00014-of-00014.safetensors",
885
+ "visual.blocks.17.attn.proj.bias": "model-00004-of-00014.safetensors",
886
+ "visual.blocks.17.attn.proj.weight": "model-00004-of-00014.safetensors",
887
+ "visual.blocks.17.attn.qkv.bias": "model-00013-of-00014.safetensors",
888
+ "visual.blocks.17.attn.qkv.weight": "model-00012-of-00014.safetensors",
889
+ "visual.blocks.17.mlp.down_proj.bias": "model-00003-of-00014.safetensors",
890
+ "visual.blocks.17.mlp.down_proj.weight": "model-00010-of-00014.safetensors",
891
+ "visual.blocks.17.mlp.gate_proj.bias": "model-00009-of-00014.safetensors",
892
+ "visual.blocks.17.mlp.gate_proj.weight": "model-00007-of-00014.safetensors",
893
+ "visual.blocks.17.mlp.up_proj.bias": "model-00010-of-00014.safetensors",
894
+ "visual.blocks.17.mlp.up_proj.weight": "model-00004-of-00014.safetensors",
895
+ "visual.blocks.17.norm1.weight": "model-00001-of-00014.safetensors",
896
+ "visual.blocks.17.norm2.weight": "model-00007-of-00014.safetensors",
897
+ "visual.blocks.18.attn.proj.bias": "model-00009-of-00014.safetensors",
898
+ "visual.blocks.18.attn.proj.weight": "model-00003-of-00014.safetensors",
899
+ "visual.blocks.18.attn.qkv.bias": "model-00011-of-00014.safetensors",
900
+ "visual.blocks.18.attn.qkv.weight": "model-00012-of-00014.safetensors",
901
+ "visual.blocks.18.mlp.down_proj.bias": "model-00006-of-00014.safetensors",
902
+ "visual.blocks.18.mlp.down_proj.weight": "model-00013-of-00014.safetensors",
903
+ "visual.blocks.18.mlp.gate_proj.bias": "model-00004-of-00014.safetensors",
904
+ "visual.blocks.18.mlp.gate_proj.weight": "model-00014-of-00014.safetensors",
905
+ "visual.blocks.18.mlp.up_proj.bias": "model-00008-of-00014.safetensors",
906
+ "visual.blocks.18.mlp.up_proj.weight": "model-00003-of-00014.safetensors",
907
+ "visual.blocks.18.norm1.weight": "model-00014-of-00014.safetensors",
908
+ "visual.blocks.18.norm2.weight": "model-00003-of-00014.safetensors",
909
+ "visual.blocks.19.attn.proj.bias": "model-00003-of-00014.safetensors",
910
+ "visual.blocks.19.attn.proj.weight": "model-00001-of-00014.safetensors",
911
+ "visual.blocks.19.attn.qkv.bias": "model-00014-of-00014.safetensors",
912
+ "visual.blocks.19.attn.qkv.weight": "model-00004-of-00014.safetensors",
913
+ "visual.blocks.19.mlp.down_proj.bias": "model-00010-of-00014.safetensors",
914
+ "visual.blocks.19.mlp.down_proj.weight": "model-00009-of-00014.safetensors",
915
+ "visual.blocks.19.mlp.gate_proj.bias": "model-00010-of-00014.safetensors",
916
+ "visual.blocks.19.mlp.gate_proj.weight": "model-00003-of-00014.safetensors",
917
+ "visual.blocks.19.mlp.up_proj.bias": "model-00001-of-00014.safetensors",
918
+ "visual.blocks.19.mlp.up_proj.weight": "model-00014-of-00014.safetensors",
919
+ "visual.blocks.19.norm1.weight": "model-00006-of-00014.safetensors",
920
+ "visual.blocks.19.norm2.weight": "model-00005-of-00014.safetensors",
921
+ "visual.blocks.2.attn.proj.bias": "model-00014-of-00014.safetensors",
922
+ "visual.blocks.2.attn.proj.weight": "model-00004-of-00014.safetensors",
923
+ "visual.blocks.2.attn.qkv.bias": "model-00005-of-00014.safetensors",
924
+ "visual.blocks.2.attn.qkv.weight": "model-00007-of-00014.safetensors",
925
+ "visual.blocks.2.mlp.down_proj.bias": "model-00010-of-00014.safetensors",
926
+ "visual.blocks.2.mlp.down_proj.weight": "model-00009-of-00014.safetensors",
927
+ "visual.blocks.2.mlp.gate_proj.bias": "model-00014-of-00014.safetensors",
928
+ "visual.blocks.2.mlp.gate_proj.weight": "model-00005-of-00014.safetensors",
929
+ "visual.blocks.2.mlp.up_proj.bias": "model-00001-of-00014.safetensors",
930
+ "visual.blocks.2.mlp.up_proj.weight": "model-00012-of-00014.safetensors",
931
+ "visual.blocks.2.norm1.weight": "model-00008-of-00014.safetensors",
932
+ "visual.blocks.2.norm2.weight": "model-00010-of-00014.safetensors",
933
+ "visual.blocks.20.attn.proj.bias": "model-00010-of-00014.safetensors",
934
+ "visual.blocks.20.attn.proj.weight": "model-00001-of-00014.safetensors",
935
+ "visual.blocks.20.attn.qkv.bias": "model-00010-of-00014.safetensors",
936
+ "visual.blocks.20.attn.qkv.weight": "model-00003-of-00014.safetensors",
937
+ "visual.blocks.20.mlp.down_proj.bias": "model-00012-of-00014.safetensors",
938
+ "visual.blocks.20.mlp.down_proj.weight": "model-00010-of-00014.safetensors",
939
+ "visual.blocks.20.mlp.gate_proj.bias": "model-00005-of-00014.safetensors",
940
+ "visual.blocks.20.mlp.gate_proj.weight": "model-00011-of-00014.safetensors",
941
+ "visual.blocks.20.mlp.up_proj.bias": "model-00001-of-00014.safetensors",
942
+ "visual.blocks.20.mlp.up_proj.weight": "model-00001-of-00014.safetensors",
943
+ "visual.blocks.20.norm1.weight": "model-00005-of-00014.safetensors",
944
+ "visual.blocks.20.norm2.weight": "model-00006-of-00014.safetensors",
945
+ "visual.blocks.21.attn.proj.bias": "model-00009-of-00014.safetensors",
946
+ "visual.blocks.21.attn.proj.weight": "model-00001-of-00014.safetensors",
947
+ "visual.blocks.21.attn.qkv.bias": "model-00007-of-00014.safetensors",
948
+ "visual.blocks.21.attn.qkv.weight": "model-00007-of-00014.safetensors",
949
+ "visual.blocks.21.mlp.down_proj.bias": "model-00013-of-00014.safetensors",
950
+ "visual.blocks.21.mlp.down_proj.weight": "model-00001-of-00014.safetensors",
951
+ "visual.blocks.21.mlp.gate_proj.bias": "model-00010-of-00014.safetensors",
952
+ "visual.blocks.21.mlp.gate_proj.weight": "model-00006-of-00014.safetensors",
953
+ "visual.blocks.21.mlp.up_proj.bias": "model-00010-of-00014.safetensors",
954
+ "visual.blocks.21.mlp.up_proj.weight": "model-00011-of-00014.safetensors",
955
+ "visual.blocks.21.norm1.weight": "model-00010-of-00014.safetensors",
956
+ "visual.blocks.21.norm2.weight": "model-00001-of-00014.safetensors",
957
+ "visual.blocks.22.attn.proj.bias": "model-00006-of-00014.safetensors",
958
+ "visual.blocks.22.attn.proj.weight": "model-00010-of-00014.safetensors",
959
+ "visual.blocks.22.attn.qkv.bias": "model-00007-of-00014.safetensors",
960
+ "visual.blocks.22.attn.qkv.weight": "model-00010-of-00014.safetensors",
961
+ "visual.blocks.22.mlp.down_proj.bias": "model-00005-of-00014.safetensors",
962
+ "visual.blocks.22.mlp.down_proj.weight": "model-00011-of-00014.safetensors",
963
+ "visual.blocks.22.mlp.gate_proj.bias": "model-00001-of-00014.safetensors",
964
+ "visual.blocks.22.mlp.gate_proj.weight": "model-00010-of-00014.safetensors",
965
+ "visual.blocks.22.mlp.up_proj.bias": "model-00014-of-00014.safetensors",
966
+ "visual.blocks.22.mlp.up_proj.weight": "model-00010-of-00014.safetensors",
967
+ "visual.blocks.22.norm1.weight": "model-00007-of-00014.safetensors",
968
+ "visual.blocks.22.norm2.weight": "model-00001-of-00014.safetensors",
969
+ "visual.blocks.23.attn.proj.bias": "model-00010-of-00014.safetensors",
970
+ "visual.blocks.23.attn.proj.weight": "model-00013-of-00014.safetensors",
971
+ "visual.blocks.23.attn.qkv.bias": "model-00007-of-00014.safetensors",
972
+ "visual.blocks.23.attn.qkv.weight": "model-00010-of-00014.safetensors",
973
+ "visual.blocks.23.mlp.down_proj.bias": "model-00003-of-00014.safetensors",
974
+ "visual.blocks.23.mlp.down_proj.weight": "model-00004-of-00014.safetensors",
975
+ "visual.blocks.23.mlp.gate_proj.bias": "model-00007-of-00014.safetensors",
976
+ "visual.blocks.23.mlp.gate_proj.weight": "model-00007-of-00014.safetensors",
977
+ "visual.blocks.23.mlp.up_proj.bias": "model-00004-of-00014.safetensors",
978
+ "visual.blocks.23.mlp.up_proj.weight": "model-00003-of-00014.safetensors",
979
+ "visual.blocks.23.norm1.weight": "model-00011-of-00014.safetensors",
980
+ "visual.blocks.23.norm2.weight": "model-00005-of-00014.safetensors",
981
+ "visual.blocks.24.attn.proj.bias": "model-00008-of-00014.safetensors",
982
+ "visual.blocks.24.attn.proj.weight": "model-00002-of-00014.safetensors",
983
+ "visual.blocks.24.attn.qkv.bias": "model-00012-of-00014.safetensors",
984
+ "visual.blocks.24.attn.qkv.weight": "model-00007-of-00014.safetensors",
985
+ "visual.blocks.24.mlp.down_proj.bias": "model-00003-of-00014.safetensors",
986
+ "visual.blocks.24.mlp.down_proj.weight": "model-00013-of-00014.safetensors",
987
+ "visual.blocks.24.mlp.gate_proj.bias": "model-00005-of-00014.safetensors",
988
+ "visual.blocks.24.mlp.gate_proj.weight": "model-00001-of-00014.safetensors",
989
+ "visual.blocks.24.mlp.up_proj.bias": "model-00013-of-00014.safetensors",
990
+ "visual.blocks.24.mlp.up_proj.weight": "model-00009-of-00014.safetensors",
991
+ "visual.blocks.24.norm1.weight": "model-00004-of-00014.safetensors",
992
+ "visual.blocks.24.norm2.weight": "model-00012-of-00014.safetensors",
993
+ "visual.blocks.25.attn.proj.bias": "model-00005-of-00014.safetensors",
994
+ "visual.blocks.25.attn.proj.weight": "model-00010-of-00014.safetensors",
995
+ "visual.blocks.25.attn.qkv.bias": "model-00007-of-00014.safetensors",
996
+ "visual.blocks.25.attn.qkv.weight": "model-00003-of-00014.safetensors",
997
+ "visual.blocks.25.mlp.down_proj.bias": "model-00006-of-00014.safetensors",
998
+ "visual.blocks.25.mlp.down_proj.weight": "model-00010-of-00014.safetensors",
999
+ "visual.blocks.25.mlp.gate_proj.bias": "model-00002-of-00014.safetensors",
1000
+ "visual.blocks.25.mlp.gate_proj.weight": "model-00009-of-00014.safetensors",
1001
+ "visual.blocks.25.mlp.up_proj.bias": "model-00004-of-00014.safetensors",
1002
+ "visual.blocks.25.mlp.up_proj.weight": "model-00010-of-00014.safetensors",
1003
+ "visual.blocks.25.norm1.weight": "model-00014-of-00014.safetensors",
1004
+ "visual.blocks.25.norm2.weight": "model-00010-of-00014.safetensors",
1005
+ "visual.blocks.26.attn.proj.bias": "model-00010-of-00014.safetensors",
1006
+ "visual.blocks.26.attn.proj.weight": "model-00009-of-00014.safetensors",
1007
+ "visual.blocks.26.attn.qkv.bias": "model-00003-of-00014.safetensors",
1008
+ "visual.blocks.26.attn.qkv.weight": "model-00011-of-00014.safetensors",
1009
+ "visual.blocks.26.mlp.down_proj.bias": "model-00009-of-00014.safetensors",
1010
+ "visual.blocks.26.mlp.down_proj.weight": "model-00001-of-00014.safetensors",
1011
+ "visual.blocks.26.mlp.gate_proj.bias": "model-00012-of-00014.safetensors",
1012
+ "visual.blocks.26.mlp.gate_proj.weight": "model-00009-of-00014.safetensors",
1013
+ "visual.blocks.26.mlp.up_proj.bias": "model-00011-of-00014.safetensors",
1014
+ "visual.blocks.26.mlp.up_proj.weight": "model-00005-of-00014.safetensors",
1015
+ "visual.blocks.26.norm1.weight": "model-00005-of-00014.safetensors",
1016
+ "visual.blocks.26.norm2.weight": "model-00006-of-00014.safetensors",
1017
+ "visual.blocks.27.attn.proj.bias": "model-00007-of-00014.safetensors",
1018
+ "visual.blocks.27.attn.proj.weight": "model-00001-of-00014.safetensors",
1019
+ "visual.blocks.27.attn.qkv.bias": "model-00005-of-00014.safetensors",
1020
+ "visual.blocks.27.attn.qkv.weight": "model-00012-of-00014.safetensors",
1021
+ "visual.blocks.27.mlp.down_proj.bias": "model-00010-of-00014.safetensors",
1022
+ "visual.blocks.27.mlp.down_proj.weight": "model-00011-of-00014.safetensors",
1023
+ "visual.blocks.27.mlp.gate_proj.bias": "model-00001-of-00014.safetensors",
1024
+ "visual.blocks.27.mlp.gate_proj.weight": "model-00011-of-00014.safetensors",
1025
+ "visual.blocks.27.mlp.up_proj.bias": "model-00003-of-00014.safetensors",
1026
+ "visual.blocks.27.mlp.up_proj.weight": "model-00001-of-00014.safetensors",
1027
+ "visual.blocks.27.norm1.weight": "model-00004-of-00014.safetensors",
1028
+ "visual.blocks.27.norm2.weight": "model-00002-of-00014.safetensors",
1029
+ "visual.blocks.28.attn.proj.bias": "model-00006-of-00014.safetensors",
1030
+ "visual.blocks.28.attn.proj.weight": "model-00009-of-00014.safetensors",
1031
+ "visual.blocks.28.attn.qkv.bias": "model-00010-of-00014.safetensors",
1032
+ "visual.blocks.28.attn.qkv.weight": "model-00014-of-00014.safetensors",
1033
+ "visual.blocks.28.mlp.down_proj.bias": "model-00001-of-00014.safetensors",
1034
+ "visual.blocks.28.mlp.down_proj.weight": "model-00010-of-00014.safetensors",
1035
+ "visual.blocks.28.mlp.gate_proj.bias": "model-00013-of-00014.safetensors",
1036
+ "visual.blocks.28.mlp.gate_proj.weight": "model-00012-of-00014.safetensors",
1037
+ "visual.blocks.28.mlp.up_proj.bias": "model-00002-of-00014.safetensors",
1038
+ "visual.blocks.28.mlp.up_proj.weight": "model-00001-of-00014.safetensors",
1039
+ "visual.blocks.28.norm1.weight": "model-00003-of-00014.safetensors",
1040
+ "visual.blocks.28.norm2.weight": "model-00013-of-00014.safetensors",
1041
+ "visual.blocks.29.attn.proj.bias": "model-00001-of-00014.safetensors",
1042
+ "visual.blocks.29.attn.proj.weight": "model-00013-of-00014.safetensors",
1043
+ "visual.blocks.29.attn.qkv.bias": "model-00012-of-00014.safetensors",
1044
+ "visual.blocks.29.attn.qkv.weight": "model-00011-of-00014.safetensors",
1045
+ "visual.blocks.29.mlp.down_proj.bias": "model-00008-of-00014.safetensors",
1046
+ "visual.blocks.29.mlp.down_proj.weight": "model-00010-of-00014.safetensors",
1047
+ "visual.blocks.29.mlp.gate_proj.bias": "model-00007-of-00014.safetensors",
1048
+ "visual.blocks.29.mlp.gate_proj.weight": "model-00010-of-00014.safetensors",
1049
+ "visual.blocks.29.mlp.up_proj.bias": "model-00006-of-00014.safetensors",
1050
+ "visual.blocks.29.mlp.up_proj.weight": "model-00011-of-00014.safetensors",
1051
+ "visual.blocks.29.norm1.weight": "model-00013-of-00014.safetensors",
1052
+ "visual.blocks.29.norm2.weight": "model-00004-of-00014.safetensors",
1053
+ "visual.blocks.3.attn.proj.bias": "model-00002-of-00014.safetensors",
1054
+ "visual.blocks.3.attn.proj.weight": "model-00001-of-00014.safetensors",
1055
+ "visual.blocks.3.attn.qkv.bias": "model-00001-of-00014.safetensors",
1056
+ "visual.blocks.3.attn.qkv.weight": "model-00007-of-00014.safetensors",
1057
+ "visual.blocks.3.mlp.down_proj.bias": "model-00006-of-00014.safetensors",
1058
+ "visual.blocks.3.mlp.down_proj.weight": "model-00006-of-00014.safetensors",
1059
+ "visual.blocks.3.mlp.gate_proj.bias": "model-00011-of-00014.safetensors",
1060
+ "visual.blocks.3.mlp.gate_proj.weight": "model-00002-of-00014.safetensors",
1061
+ "visual.blocks.3.mlp.up_proj.bias": "model-00001-of-00014.safetensors",
1062
+ "visual.blocks.3.mlp.up_proj.weight": "model-00001-of-00014.safetensors",
1063
+ "visual.blocks.3.norm1.weight": "model-00003-of-00014.safetensors",
1064
+ "visual.blocks.3.norm2.weight": "model-00004-of-00014.safetensors",
1065
+ "visual.blocks.30.attn.proj.bias": "model-00003-of-00014.safetensors",
1066
+ "visual.blocks.30.attn.proj.weight": "model-00006-of-00014.safetensors",
1067
+ "visual.blocks.30.attn.qkv.bias": "model-00003-of-00014.safetensors",
1068
+ "visual.blocks.30.attn.qkv.weight": "model-00003-of-00014.safetensors",
1069
+ "visual.blocks.30.mlp.down_proj.bias": "model-00001-of-00014.safetensors",
1070
+ "visual.blocks.30.mlp.down_proj.weight": "model-00003-of-00014.safetensors",
1071
+ "visual.blocks.30.mlp.gate_proj.bias": "model-00005-of-00014.safetensors",
1072
+ "visual.blocks.30.mlp.gate_proj.weight": "model-00007-of-00014.safetensors",
1073
+ "visual.blocks.30.mlp.up_proj.bias": "model-00010-of-00014.safetensors",
1074
+ "visual.blocks.30.mlp.up_proj.weight": "model-00001-of-00014.safetensors",
1075
+ "visual.blocks.30.norm1.weight": "model-00002-of-00014.safetensors",
1076
+ "visual.blocks.30.norm2.weight": "model-00001-of-00014.safetensors",
1077
+ "visual.blocks.31.attn.proj.bias": "model-00013-of-00014.safetensors",
1078
+ "visual.blocks.31.attn.proj.weight": "model-00007-of-00014.safetensors",
1079
+ "visual.blocks.31.attn.qkv.bias": "model-00002-of-00014.safetensors",
1080
+ "visual.blocks.31.attn.qkv.weight": "model-00014-of-00014.safetensors",
1081
+ "visual.blocks.31.mlp.down_proj.bias": "model-00013-of-00014.safetensors",
1082
+ "visual.blocks.31.mlp.down_proj.weight": "model-00003-of-00014.safetensors",
1083
+ "visual.blocks.31.mlp.gate_proj.bias": "model-00010-of-00014.safetensors",
1084
+ "visual.blocks.31.mlp.gate_proj.weight": "model-00009-of-00014.safetensors",
1085
+ "visual.blocks.31.mlp.up_proj.bias": "model-00002-of-00014.safetensors",
1086
+ "visual.blocks.31.mlp.up_proj.weight": "model-00003-of-00014.safetensors",
1087
+ "visual.blocks.31.norm1.weight": "model-00001-of-00014.safetensors",
1088
+ "visual.blocks.31.norm2.weight": "model-00001-of-00014.safetensors",
1089
+ "visual.blocks.4.attn.proj.bias": "model-00013-of-00014.safetensors",
1090
+ "visual.blocks.4.attn.proj.weight": "model-00011-of-00014.safetensors",
1091
+ "visual.blocks.4.attn.qkv.bias": "model-00013-of-00014.safetensors",
1092
+ "visual.blocks.4.attn.qkv.weight": "model-00009-of-00014.safetensors",
1093
+ "visual.blocks.4.mlp.down_proj.bias": "model-00001-of-00014.safetensors",
1094
+ "visual.blocks.4.mlp.down_proj.weight": "model-00001-of-00014.safetensors",
1095
+ "visual.blocks.4.mlp.gate_proj.bias": "model-00012-of-00014.safetensors",
1096
+ "visual.blocks.4.mlp.gate_proj.weight": "model-00014-of-00014.safetensors",
1097
+ "visual.blocks.4.mlp.up_proj.bias": "model-00007-of-00014.safetensors",
1098
+ "visual.blocks.4.mlp.up_proj.weight": "model-00004-of-00014.safetensors",
1099
+ "visual.blocks.4.norm1.weight": "model-00005-of-00014.safetensors",
1100
+ "visual.blocks.4.norm2.weight": "model-00001-of-00014.safetensors",
1101
+ "visual.blocks.5.attn.proj.bias": "model-00003-of-00014.safetensors",
1102
+ "visual.blocks.5.attn.proj.weight": "model-00006-of-00014.safetensors",
1103
+ "visual.blocks.5.attn.qkv.bias": "model-00012-of-00014.safetensors",
1104
+ "visual.blocks.5.attn.qkv.weight": "model-00014-of-00014.safetensors",
1105
+ "visual.blocks.5.mlp.down_proj.bias": "model-00010-of-00014.safetensors",
1106
+ "visual.blocks.5.mlp.down_proj.weight": "model-00002-of-00014.safetensors",
1107
+ "visual.blocks.5.mlp.gate_proj.bias": "model-00014-of-00014.safetensors",
1108
+ "visual.blocks.5.mlp.gate_proj.weight": "model-00004-of-00014.safetensors",
1109
+ "visual.blocks.5.mlp.up_proj.bias": "model-00001-of-00014.safetensors",
1110
+ "visual.blocks.5.mlp.up_proj.weight": "model-00010-of-00014.safetensors",
1111
+ "visual.blocks.5.norm1.weight": "model-00009-of-00014.safetensors",
1112
+ "visual.blocks.5.norm2.weight": "model-00011-of-00014.safetensors",
1113
+ "visual.blocks.6.attn.proj.bias": "model-00001-of-00014.safetensors",
1114
+ "visual.blocks.6.attn.proj.weight": "model-00013-of-00014.safetensors",
1115
+ "visual.blocks.6.attn.qkv.bias": "model-00010-of-00014.safetensors",
1116
+ "visual.blocks.6.attn.qkv.weight": "model-00011-of-00014.safetensors",
1117
+ "visual.blocks.6.mlp.down_proj.bias": "model-00011-of-00014.safetensors",
1118
+ "visual.blocks.6.mlp.down_proj.weight": "model-00014-of-00014.safetensors",
1119
+ "visual.blocks.6.mlp.gate_proj.bias": "model-00011-of-00014.safetensors",
1120
+ "visual.blocks.6.mlp.gate_proj.weight": "model-00009-of-00014.safetensors",
1121
+ "visual.blocks.6.mlp.up_proj.bias": "model-00005-of-00014.safetensors",
1122
+ "visual.blocks.6.mlp.up_proj.weight": "model-00013-of-00014.safetensors",
1123
+ "visual.blocks.6.norm1.weight": "model-00003-of-00014.safetensors",
1124
+ "visual.blocks.6.norm2.weight": "model-00004-of-00014.safetensors",
1125
+ "visual.blocks.7.attn.proj.bias": "model-00011-of-00014.safetensors",
1126
+ "visual.blocks.7.attn.proj.weight": "model-00007-of-00014.safetensors",
1127
+ "visual.blocks.7.attn.qkv.bias": "model-00004-of-00014.safetensors",
1128
+ "visual.blocks.7.attn.qkv.weight": "model-00009-of-00014.safetensors",
1129
+ "visual.blocks.7.mlp.down_proj.bias": "model-00005-of-00014.safetensors",
1130
+ "visual.blocks.7.mlp.down_proj.weight": "model-00013-of-00014.safetensors",
1131
+ "visual.blocks.7.mlp.gate_proj.bias": "model-00009-of-00014.safetensors",
1132
+ "visual.blocks.7.mlp.gate_proj.weight": "model-00001-of-00014.safetensors",
1133
+ "visual.blocks.7.mlp.up_proj.bias": "model-00013-of-00014.safetensors",
1134
+ "visual.blocks.7.mlp.up_proj.weight": "model-00003-of-00014.safetensors",
1135
+ "visual.blocks.7.norm1.weight": "model-00014-of-00014.safetensors",
1136
+ "visual.blocks.7.norm2.weight": "model-00004-of-00014.safetensors",
1137
+ "visual.blocks.8.attn.proj.bias": "model-00004-of-00014.safetensors",
1138
+ "visual.blocks.8.attn.proj.weight": "model-00006-of-00014.safetensors",
1139
+ "visual.blocks.8.attn.qkv.bias": "model-00004-of-00014.safetensors",
1140
+ "visual.blocks.8.attn.qkv.weight": "model-00012-of-00014.safetensors",
1141
+ "visual.blocks.8.mlp.down_proj.bias": "model-00005-of-00014.safetensors",
1142
+ "visual.blocks.8.mlp.down_proj.weight": "model-00005-of-00014.safetensors",
1143
+ "visual.blocks.8.mlp.gate_proj.bias": "model-00003-of-00014.safetensors",
1144
+ "visual.blocks.8.mlp.gate_proj.weight": "model-00013-of-00014.safetensors",
1145
+ "visual.blocks.8.mlp.up_proj.bias": "model-00004-of-00014.safetensors",
1146
+ "visual.blocks.8.mlp.up_proj.weight": "model-00004-of-00014.safetensors",
1147
+ "visual.blocks.8.norm1.weight": "model-00012-of-00014.safetensors",
1148
+ "visual.blocks.8.norm2.weight": "model-00005-of-00014.safetensors",
1149
+ "visual.blocks.9.attn.proj.bias": "model-00010-of-00014.safetensors",
1150
+ "visual.blocks.9.attn.proj.weight": "model-00012-of-00014.safetensors",
1151
+ "visual.blocks.9.attn.qkv.bias": "model-00003-of-00014.safetensors",
1152
+ "visual.blocks.9.attn.qkv.weight": "model-00009-of-00014.safetensors",
1153
+ "visual.blocks.9.mlp.down_proj.bias": "model-00010-of-00014.safetensors",
1154
+ "visual.blocks.9.mlp.down_proj.weight": "model-00006-of-00014.safetensors",
1155
+ "visual.blocks.9.mlp.gate_proj.bias": "model-00001-of-00014.safetensors",
1156
+ "visual.blocks.9.mlp.gate_proj.weight": "model-00007-of-00014.safetensors",
1157
+ "visual.blocks.9.mlp.up_proj.bias": "model-00003-of-00014.safetensors",
1158
+ "visual.blocks.9.mlp.up_proj.weight": "model-00010-of-00014.safetensors",
1159
+ "visual.blocks.9.norm1.weight": "model-00014-of-00014.safetensors",
1160
+ "visual.blocks.9.norm2.weight": "model-00013-of-00014.safetensors",
1161
+ "visual.merger.ln_q.weight": "model-00012-of-00014.safetensors",
1162
+ "visual.merger.mlp.0.bias": "model-00011-of-00014.safetensors",
1163
+ "visual.merger.mlp.0.weight": "model-00013-of-00014.safetensors",
1164
+ "visual.merger.mlp.2.bias": "model-00012-of-00014.safetensors",
1165
+ "visual.merger.mlp.2.weight": "model-00006-of-00014.safetensors",
1166
+ "visual.patch_embed.proj.weight": "model-00005-of-00014.safetensors"
1167
+ }
1168
+ }
modeling_opencua.py ADDED
@@ -0,0 +1,449 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ------------------------------------------------------------------------------
2
+ # OpenCUA‑7B Model
3
+ #
4
+ # This implementation is adapted from the Qwen2‑VL reference code in
5
+ # Hugging Face Transformers v4.53.0:
6
+ # https://github.com/huggingface/transformers/tree/v4.53.0/src/transformers/models/qwen2_5_vl
7
+ #
8
+ # Checkpoint used for weight initialisation:
9
+ # "Qwen/Qwen2.5-VL-32B-Instruct" – https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct
10
+ #
11
+ # Key modifications
12
+ # -----------------
13
+ # • Replaced Multimodal Rotary Position Embedding (M‑RoPE) with 1‑D RoPE for
14
+ # compatibility with OpenCUA training settings.
15
+ # • Wrapped vision encoder and language model into a single
16
+ # `OpenCUAForConditionalGeneration` class.
17
+ # • Simplified weight initialisation — this file targets inference / fine‑tuning,
18
+ # not training from scratch.
19
+ #
20
+ # Copyright (c) 2025 XLANG Lab, The University of Hong Kong
21
+ #
22
+ # Permission is hereby granted, free of charge, to any person obtaining a copy
23
+ # of this software and associated documentation files (the “Software”), to deal
24
+ # in the Software without restriction, including without limitation the rights
25
+ # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
26
+ # copies of the Software, and to permit persons to whom the Software is
27
+ # furnished to do so, subject to the following conditions:
28
+ #
29
+ # The above copyright notice and this permission notice shall be included in all
30
+ # copies or substantial portions of the Software.
31
+ #
32
+ # THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
33
+ # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
34
+ # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
35
+ # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
36
+ # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
37
+ # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
38
+ # SOFTWARE.
39
+ #
40
+ # ------------------------------------------------------------------------------
41
+ # Prohibited Uses & Additional Disclaimer
42
+ # ---------------------------------------
43
+ # • The Software may **not** be used for any purpose or activity that violates
44
+ # applicable laws or regulations in any jurisdiction.
45
+ # • The authors, contributors, and copyright holders are **not responsible**
46
+ # for any illegal, unethical, or harmful use of the Software, nor for any
47
+ # direct or indirect damages resulting from such use.
48
+ # • Use of the “OpenCUA” name, logo, or trademarks does **not** imply any
49
+ # endorsement or affiliation unless a separate written permission is obtained.
50
+
51
+ import torch
52
+ import torch.nn as nn
53
+ from transformers.cache_utils import Cache
54
+ from transformers.modeling_utils import PreTrainedModel
55
+ from transformers.models.llava.modeling_llava import LlavaCausalLMOutputWithPast
56
+
57
+ from .configuration_opencua import OpenCUAConfig
58
+ from transformers.models.qwen2_5_vl.modeling_qwen2_5_vl import Qwen2_5_VisionTransformerPretrainedModel
59
+ from transformers.models.qwen2.modeling_qwen2 import Qwen2ForCausalLM
60
+
61
+
62
+ class OpenCUAPreTrainedModel(PreTrainedModel):
63
+ config_class = OpenCUAConfig
64
+ base_model_prefix = "model"
65
+ _no_split_modules = ["Qwen2_5_VisionTransformerPretrainedModel"]
66
+ _skip_keys_device_placement = "past_key_values"
67
+ _supports_flash_attn_2 = True
68
+
69
+ def _init_weights(self, module):
70
+ # important: this ported version of Llava isn't meant for training from scratch - only
71
+ # inference and fine-tuning - so the proper init weights code has been removed - the original codebase
72
+ # https://github.com/haotian-liu/LLaVA/tree/main/llava should serve for that purpose
73
+ std = (
74
+ self.config.initializer_range
75
+ if hasattr(self.config, "initializer_range")
76
+ else self.config.text_config.initializer_range
77
+ )
78
+
79
+ if hasattr(module, "class_embedding"):
80
+ module.class_embedding.data.normal_(mean=0.0, std=std)
81
+
82
+ if isinstance(module, (nn.Linear, nn.Conv2d)):
83
+ module.weight.data.normal_(mean=0.0, std=std)
84
+ if module.bias is not None:
85
+ module.bias.data.zero_()
86
+ elif isinstance(module, nn.Embedding):
87
+ module.weight.data.normal_(mean=0.0, std=std)
88
+ if module.padding_idx is not None:
89
+ module.weight.data[module.padding_idx].zero_()
90
+
91
+ @property
92
+ def _supports_sdpa(self):
93
+ """
94
+ Retrieve language_model's attribute to check whether the model supports
95
+ SDPA or not.
96
+ """
97
+ return self.language_model._supports_sdpa
98
+
99
+
100
+ class OpenCUAForConditionalGeneration(OpenCUAPreTrainedModel):
101
+
102
+ def __init__(self, config: OpenCUAConfig):
103
+ super().__init__(config)
104
+ self.vision_tower = Qwen2_5_VisionTransformerPretrainedModel(config.vision_config)
105
+ self.language_model = Qwen2ForCausalLM(config.text_config)
106
+ self.post_init()
107
+
108
+ def get_input_embeddings(self):
109
+ return self.language_model.get_input_embeddings()
110
+
111
+ def set_input_embeddings(self, value):
112
+ self.language_model.set_input_embeddings(value)
113
+
114
+ def get_output_embeddings(self):
115
+ return self.language_model.get_output_embeddings()
116
+
117
+ def set_output_embeddings(self, new_embeddings):
118
+ self.language_model.set_output_embeddings(new_embeddings)
119
+
120
+ def set_decoder(self, decoder):
121
+ self.language_model.set_decoder(decoder)
122
+
123
+ def get_decoder(self):
124
+ return self.language_model.get_decoder()
125
+
126
+ def tie_weights(self):
127
+ return self.language_model.tie_weights()
128
+
129
+ def resize_token_embeddings(self, new_num_tokens: int | None = None, pad_to_multiple_of=None) -> nn.Embedding:
130
+ model_embeds = self.language_model.resize_token_embeddings(
131
+ new_num_tokens, pad_to_multiple_of)
132
+ # update vocab size
133
+ self.config.text_config.vocab_size = model_embeds.num_embeddings
134
+ self.vocab_size = model_embeds.num_embeddings
135
+ return model_embeds
136
+
137
+ def _merge_input_ids_with_image_features(
138
+ self,
139
+ image_features: torch.Tensor,
140
+ feature_lengths: list[int],
141
+ inputs_embeds: torch.Tensor,
142
+ input_ids: torch.Tensor,
143
+ attention_mask: torch.Tensor,
144
+ labels: torch.Tensor | None = None):
145
+ """
146
+ Args:
147
+ image_features (:obj:`torch.Tensor` of shape :obj:`(num_image_tokens, embed_dim)`):
148
+ The image features to merge with the input embeddings.
149
+ feature_lengths: the length of image feature.
150
+ inputs_embeds (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length, embed_dim)`):
151
+ The input embeddings.
152
+ input_ids (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`):
153
+ The input ids.
154
+ attention_mask (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`):
155
+ The attention mask.
156
+ labels (:obj:`torch.Tensor` of shape :obj:`(batch_size, sequence_length)`, *optional*):
157
+ The labels.
158
+ """
159
+
160
+ image_token_index: int = self.config.media_placeholder_token_id
161
+ pad_token_id: int = self.config.pad_token_id
162
+ ignore_index: int = self.config.ignore_index
163
+
164
+ _, embed_dim = image_features.shape
165
+
166
+ batch_size, sequence_length = input_ids.shape
167
+ left_padding = not torch.sum(
168
+ input_ids[:, -1] == torch.tensor(pad_token_id))
169
+
170
+ # 1. Create a mask to know where special image tokens are
171
+ _token_occupation_table = torch.ones_like(input_ids.flatten())
172
+ _token_occupation_table[input_ids.flatten() == image_token_index] = \
173
+ torch.tensor(feature_lengths,
174
+ dtype=torch.long, device=input_ids.device)
175
+ _token_occupation_table = _token_occupation_table.reshape(
176
+ input_ids.shape)
177
+
178
+ max_embed_dim = _token_occupation_table.sum(-1).max().item()
179
+ assert max_embed_dim >= sequence_length, (
180
+ f"The maximum embedding dimension ({max_embed_dim}) is less than the sequence length ({sequence_length})"
181
+ )
182
+ batch_indices, non_image_indices = torch.where(input_ids != image_token_index)
183
+
184
+ # 2. Compute the positions where text should be written
185
+ # Calculate new positions for text tokens in merged image-text sequence.
186
+ new_token_positions = torch.cumsum(_token_occupation_table, -1) - 1
187
+ nb_image_pad = max_embed_dim - 1 - new_token_positions[:, -1]
188
+ if left_padding:
189
+ new_token_positions += nb_image_pad[:, None] # offset for left padding
190
+ text_to_overwrite = new_token_positions[batch_indices, non_image_indices]
191
+
192
+ # 3. Create the full embedding, already padded to the maximum position
193
+ final_embedding = torch.zeros(
194
+ batch_size, max_embed_dim, embed_dim, dtype=inputs_embeds.dtype, device=inputs_embeds.device
195
+ )
196
+ final_attention_mask = torch.zeros(
197
+ batch_size, max_embed_dim, dtype=attention_mask.dtype, device=inputs_embeds.device
198
+ )
199
+ if labels is not None:
200
+ final_labels = torch.full(
201
+ (batch_size, max_embed_dim), ignore_index, dtype=input_ids.dtype, device=input_ids.device
202
+ )
203
+ # In case the Vision model or the Language model has been offloaded to CPU, we need to manually
204
+ # set the corresponding tensors into their correct target device.
205
+ target_device = inputs_embeds.device
206
+ batch_indices, non_image_indices, text_to_overwrite = (
207
+ batch_indices.to(target_device),
208
+ non_image_indices.to(target_device),
209
+ text_to_overwrite.to(target_device),
210
+ )
211
+ attention_mask = attention_mask.to(target_device)
212
+
213
+ # 4. Fill the embeddings based on the mask.
214
+ final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]
215
+ final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
216
+ if labels is not None:
217
+ final_labels[batch_indices, text_to_overwrite] = labels[batch_indices, non_image_indices]
218
+
219
+ # 5. Fill the embeddings corresponding to the images. Anything that is not `text_positions` needs filling (#29835)
220
+ image_to_overwrite = torch.full(
221
+ (batch_size, max_embed_dim), True, dtype=torch.bool, device=inputs_embeds.device
222
+ )
223
+ image_to_overwrite[batch_indices, text_to_overwrite] = False
224
+ image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None].to(target_device)
225
+
226
+ if image_to_overwrite.sum() != image_features.shape[:-1].numel():
227
+ raise ValueError(
228
+ f"The input provided to the model are wrong. The number of image tokens is {image_to_overwrite.sum()} while"
229
+ f" the number of image features given to the model is {image_features.shape[:-1].numel()}. "
230
+ "This prevents correct indexing and breaks batch generation."
231
+ )
232
+
233
+ final_embedding[image_to_overwrite] = image_features.contiguous().reshape(-1, embed_dim).to(target_device)
234
+ final_attention_mask |= image_to_overwrite
235
+ position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill_((final_attention_mask == 0), 1)
236
+
237
+ # 6. Mask out the embedding at padding positions, as we later use the past_key_value value to determine the non-attended tokens.
238
+ batch_indices, pad_indices = torch.where(input_ids == pad_token_id)
239
+ indices_to_mask = new_token_positions[batch_indices, pad_indices]
240
+
241
+ final_embedding[batch_indices, indices_to_mask] = 0
242
+
243
+ if labels is None:
244
+ final_labels = None
245
+
246
+ return final_embedding, final_attention_mask, final_labels, position_ids
247
+
248
+ def _extract_image_features(self,
249
+ pixel_values: torch.FloatTensor | list[torch.FloatTensor],
250
+ grid_thws: torch.FloatTensor,
251
+ ):
252
+ """
253
+ Args:
254
+ pixel_values (:obj:`torch.FloatTensor` of shape :obj:`(sum_num_image_tokens, channels)`):
255
+ The pixel values of the images processed by image processor.
256
+ grid_thws: (B,3)
257
+
258
+ Returns:
259
+ selected_image_feature (:obj:`torch.FloatTensor` of shape :obj:`(num_image_tokens, embed_dim)`):
260
+ The selected image features to use as input to the projector head.
261
+
262
+ """
263
+
264
+ assert len(grid_thws.shape)==2 and grid_thws.shape[1]==3, f"grid_thws must be a 2D tensor with shape (batched, 3), but got {grid_thws.shape}"
265
+ if isinstance(pixel_values, list):
266
+ pixel_values = torch.cat(pixel_values, dim=0)
267
+ image_features_ = self.vision_tower(pixel_values, grid_thw=grid_thws)
268
+ image_features_list = []
269
+ start_idx = 0
270
+ for i, grid_thw in enumerate(grid_thws):
271
+ end_idx = start_idx + (grid_thw[0] * grid_thw[1] * grid_thw[2]) // 4
272
+ image_features_list.append(image_features_[start_idx:end_idx, :])
273
+ start_idx = end_idx
274
+
275
+ selected_image_feature = torch.cat(image_features_list, dim=0)
276
+ feature_lengths = [x.size(0) for x in image_features_list]
277
+ return selected_image_feature, feature_lengths
278
+
279
+ def forward(
280
+ self,
281
+ input_ids: torch.LongTensor | None = None,
282
+ pixel_values: torch.FloatTensor | list[torch.FloatTensor] | None = None,
283
+ grid_thws: torch.Tensor = None,
284
+ attention_mask: torch.Tensor | None = None,
285
+ position_ids: torch.LongTensor | None = None,
286
+ past_key_values: list[torch.FloatTensor] | None = None,
287
+ inputs_embeds: torch.FloatTensor | None = None,
288
+ labels: torch.LongTensor | None = None,
289
+ use_cache: bool | None = None,
290
+ output_attentions: bool | None = None,
291
+ output_hidden_states: bool | None = None,
292
+ return_dict: bool | None = None,
293
+ ) -> tuple | LlavaCausalLMOutputWithPast:
294
+ r"""
295
+ Args:
296
+ labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
297
+ Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
298
+ config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
299
+ (masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
300
+
301
+ ```"""
302
+
303
+ output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
304
+ output_hidden_states = (
305
+ output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
306
+ )
307
+ return_dict = return_dict if return_dict is not None else self.config.use_return_dict
308
+ if inputs_embeds is None:
309
+ # 1. Extra the input embeddings
310
+ inputs_embeds = self.get_input_embeddings()(input_ids)
311
+ # 2. Merge text and images
312
+ if pixel_values is not None and len(pixel_values) > 0 and input_ids.shape[1] != 1:
313
+ image_feature, feature_lengths = self._extract_image_features(
314
+ pixel_values, grid_thws)
315
+
316
+ inputs_embeds = inputs_embeds.to(image_feature.dtype) # num_tokens, embed_dim
317
+ inputs_embeds, attention_mask, labels, position_ids = \
318
+ self._merge_input_ids_with_image_features(image_feature, feature_lengths, inputs_embeds, input_ids, attention_mask, labels
319
+ )
320
+ # In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
321
+ # generation with cache
322
+ elif past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
323
+ # Retrieve the first layer to inspect the logits and mask out the hidden states
324
+ # that are set to 0
325
+ first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]
326
+
327
+ # Sum all dimensions of head_dim (-2) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
328
+ batch_index, non_attended_tokens = torch.where(first_layer_past_key_value.float().sum(-2) == 0)
329
+
330
+ # Get the target length
331
+ target_length = input_ids.shape[1]
332
+ past_length = first_layer_past_key_value.shape[-1]
333
+
334
+ extended_attention_mask = torch.ones(
335
+ (attention_mask.shape[0], past_length),
336
+ dtype=attention_mask.dtype,
337
+ device=attention_mask.device,
338
+ )
339
+
340
+ # Filter out only the tokens that can be un-attended, this can happen
341
+ # if one uses Llava + Fused modules where the cache on the
342
+ # first iteration is already big enough, or if one passes custom cache
343
+ valid_indices = non_attended_tokens < extended_attention_mask.size(-1)
344
+ new_batch_index = batch_index[valid_indices]
345
+ new_non_attended_tokens = non_attended_tokens[valid_indices]
346
+
347
+ # Zero-out the places where we don't need to attend
348
+ extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0
349
+
350
+ attention_mask = torch.cat((extended_attention_mask, attention_mask[:, -target_length:]), dim=1)
351
+ position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
352
+
353
+ outputs = self.language_model(
354
+ attention_mask=attention_mask,
355
+ position_ids=position_ids,
356
+ past_key_values=past_key_values,
357
+ inputs_embeds=inputs_embeds,
358
+ use_cache=use_cache,
359
+ output_attentions=output_attentions,
360
+ output_hidden_states=output_hidden_states,
361
+ return_dict=return_dict,
362
+ )
363
+
364
+ logits = outputs[0]
365
+
366
+ loss = None
367
+ if labels is not None:
368
+ # Shift so that tokens < n predict n
369
+ if attention_mask is not None:
370
+ shift_attention_mask = attention_mask[..., 1:]
371
+ shift_logits = logits[..., :-1, :][shift_attention_mask.to(logits.device) != 0].contiguous()
372
+ shift_labels = labels[..., 1:][shift_attention_mask.to(labels.device) != 0].contiguous()
373
+ else:
374
+ shift_logits = logits[..., :-1, :].contiguous()
375
+ shift_labels = labels[..., 1:].contiguous()
376
+ # Flatten the tokens
377
+ loss_fct = nn.CrossEntropyLoss()
378
+ loss = loss_fct(
379
+ shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1).to(shift_logits.device)
380
+ )
381
+
382
+ if not return_dict:
383
+ output = (logits,) + outputs[1:]
384
+ return (loss,) + output if loss is not None else output
385
+
386
+ return LlavaCausalLMOutputWithPast(
387
+ loss=loss,
388
+ logits=logits,
389
+ past_key_values=outputs.past_key_values,
390
+ hidden_states=outputs.hidden_states,
391
+ attentions=outputs.attentions,
392
+ )
393
+
394
+ def prepare_inputs_for_generation(
395
+ self, input_ids, past_key_values=None, inputs_embeds=None, pixel_values=None, grid_thws=None, attention_mask=None, **kwargs
396
+ ):
397
+ if past_key_values is not None:
398
+ if isinstance(past_key_values, Cache):
399
+ cache_length = past_key_values.get_seq_length()
400
+ past_length = past_key_values.seen_tokens
401
+ else:
402
+ cache_length = past_length = past_key_values[0][0].shape[2]
403
+
404
+ # Keep only the unprocessed tokens:
405
+ # 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
406
+ # some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
407
+ # input)
408
+ if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
409
+ input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
410
+ # 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
411
+ # input_ids based on the past_length.
412
+ elif past_length < input_ids.shape[1]:
413
+ input_ids = input_ids[:, past_length:]
414
+ # 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
415
+ elif self.config.media_placeholder_token_id in input_ids:
416
+ input_ids = input_ids[:, input_ids.shape[1] - 1 :]
417
+ # If the cache has seen more tokens than it can hold, then the cache has a size limit. Let's discard the
418
+ # older attention values, as their corresponding values are not part of the input.
419
+ if cache_length < past_length and attention_mask is not None:
420
+ attention_mask = attention_mask[:, -(cache_length + input_ids.shape[1]) :]
421
+
422
+ position_ids = kwargs.get("position_ids", None)
423
+ if attention_mask is not None and position_ids is None:
424
+ # create position_ids on the fly for batch generation
425
+ position_ids = attention_mask.long().cumsum(-1) - 1
426
+ position_ids.masked_fill_(attention_mask == 0, 1)
427
+ if past_key_values:
428
+ position_ids = position_ids[:, -input_ids.shape[1] :]
429
+
430
+ # if `inputs_embeds` are passed, we only want to use them in the 1st generation step
431
+ if inputs_embeds is not None and past_key_values is None:
432
+ model_inputs = {"inputs_embeds": inputs_embeds}
433
+ else:
434
+ model_inputs = {"input_ids": input_ids}
435
+
436
+ model_inputs.update(
437
+ {
438
+ "position_ids": position_ids,
439
+ "past_key_values": past_key_values,
440
+ "use_cache": kwargs.get("use_cache"),
441
+ "attention_mask": attention_mask,
442
+ "pixel_values": pixel_values,
443
+ "grid_thws": grid_thws,
444
+ }
445
+ )
446
+ return model_inputs
447
+
448
+ def _reorder_cache(self, *args, **kwargs):
449
+ return self.language_model._reorder_cache(*args, **kwargs)
preprocessor_config.json ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "min_pixels": 3136,
3
+ "max_pixels": 12845056,
4
+ "patch_size": 14,
5
+ "temporal_patch_size": 2,
6
+ "merge_size": 2,
7
+ "image_mean": [
8
+ 0.48145466,
9
+ 0.4578275,
10
+ 0.40821073
11
+ ],
12
+ "image_std": [
13
+ 0.26862954,
14
+ 0.26130258,
15
+ 0.27577711
16
+ ],
17
+ "image_processor_type": "Qwen2VLImageProcessor"
18
+ }
processing_opencua.py ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # processing_opencua.py
2
+ import torch
3
+ from typing import List, Dict, Any, Union
4
+ from PIL import Image
5
+ from transformers.processing_utils import ProcessorMixin, BatchFeature
6
+ from transformers import AutoTokenizer, AutoImageProcessor
7
+
8
+ PLACEHOLDER = "<|media_placeholder|>"
9
+
10
+ class OpenCUAProcessor(ProcessorMixin):
11
+ attributes = ["image_processor", "tokenizer", "image_token_id", "merge_size"]
12
+
13
+ def __init__(self, image_processor, tokenizer, image_token_id: int = 151664, merge_size: int = 2, **kwargs):
14
+ self.image_processor = image_processor
15
+ self.tokenizer = tokenizer
16
+ self.image_token_id = image_token_id
17
+ self.merge_size = getattr(image_processor, "merge_size", merge_size)
18
+
19
+ @classmethod
20
+ def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
21
+ trust = kwargs.get("trust_remote_code", True)
22
+ # 优先用你仓库的 TikTokenV3;失败回退 AutoTokenizer(只用于初始化/占位)
23
+ try:
24
+ from tokenization_opencua import TikTokenV3
25
+ tok = TikTokenV3.from_pretrained(pretrained_model_name_or_path, trust_remote_code=trust)
26
+ except Exception:
27
+ tok = AutoTokenizer.from_pretrained(pretrained_model_name_or_path, trust_remote_code=trust)
28
+ imgproc = AutoImageProcessor.from_pretrained(pretrained_model_name_or_path, trust_remote_code=trust)
29
+ return cls(imgproc, tok, **kwargs)
30
+
31
+ def apply_chat_template(self, messages: List[Dict[str, Any]], **kwargs) -> Union[str, List[int]]:
32
+ return self.tokenizer.apply_chat_template(messages, **kwargs)
33
+
34
+ # 下面这些方法给 HF 路径用;vLLM 初始化只需要能成功 new 出来即可
35
+ def __call__(self, *args, **kwargs) -> BatchFeature:
36
+ # 返回一个最小结构,避免被实际调用时崩溃
37
+ data = {"input_ids": torch.zeros(1, 1, dtype=torch.long)}
38
+ return BatchFeature(data=data)
39
+
40
+ # 提供给你自己脚本用的辅助(可选)
41
+ def prepare_vllm_inputs(self, messages, images, add_generation_prompt=True):
42
+ text = self.apply_chat_template(messages, tokenize=False, add_generation_prompt=add_generation_prompt)
43
+ proc = self.image_processor(images=images, return_tensors="pt")
44
+ grid = torch.as_tensor(proc["image_grid_thw"])
45
+ merge = getattr(self, "merge_size", 2)
46
+ for thw in grid:
47
+ num = int((thw[0] * thw[1] * thw[2]) // (merge ** 2))
48
+ text = text.replace(PLACEHOLDER, PLACEHOLDER * num, 1)
49
+ return text, images
50
+
51
+
52
+
53
+ # # processing_opencua.py
54
+ # from transformers import Qwen2_5_VLProcessor, AutoTokenizer, AutoImageProcessor
55
+
56
+ # class OpenCUAProcessor(Qwen2_5_VLProcessor):
57
+ # # 用字符串就行,但我们会在 from_pretrained 里手动加载,避免字符串反射
58
+ # tokenizer_class = "TikTokenV3"
59
+
60
+ # @classmethod
61
+ # def from_pretrained(cls, pretrained_model_name_or_path, **kwargs):
62
+ # # 确保 remote code 可用
63
+ # trust_remote_code = kwargs.get("trust_remote_code", False)
64
+
65
+ # # 1) 手动加载 tokenizer(会按模型目录里的 tokenizer_config.json -> TikTokenV3 + tokenization_opencua.py)
66
+ # tokenizer = AutoTokenizer.from_pretrained(
67
+ # pretrained_model_name_or_path,
68
+ # trust_remote_code=trust_remote_code,
69
+ # )
70
+
71
+ # # 2) 手动加载图像处理器(保持 Qwen2VLImageProcessor)
72
+ # image_processor = AutoImageProcessor.from_pretrained(
73
+ # pretrained_model_name_or_path,
74
+ # trust_remote_code=trust_remote_code,
75
+ # )
76
+
77
+ # # 3) 获取chat_template,如果tokenizer有的话
78
+ # chat_template = getattr(tokenizer, 'chat_template', None)
79
+
80
+ # # 4) 构造并返回 Qwen2.5-VL 的 Processor 实例,传递chat_template
81
+ # processor = cls(image_processor=image_processor, tokenizer=tokenizer, chat_template=chat_template)
82
+
83
+ # # 5) 添加vLLM需要的属性
84
+ # # 这些token ID需要与tokenizer_config.json中的定义一致
85
+ # processor.image_token = "<|media_placeholder|>" # 使用OpenCUA的媒体占位符
86
+ # processor.video_token = "<|media_placeholder|>" # 视频也使用相同的占位符
87
+
88
+ # # 添加token ID(从tokenizer_config.json中获取)
89
+ # vocab = tokenizer.get_vocab()
90
+ # processor.image_token_id = vocab.get("<|media_placeholder|>", 151664) # 默认151664
91
+ # processor.video_token_id = vocab.get("<|media_placeholder|>", 151664) # 视频使用相同ID
92
+
93
+ # return processor
processor_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "processor_class": "Qwen2VLProcessor"
3
+ }
4
+
qwen.tiktoken ADDED
The diff for this file is too large to render. See raw diff
 
tiktoken.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:b2b1b8dfb5cc5f024bafc373121c6aba3f66f9a5a0269e243470a1de16a33186
3
+ size 2561218
tokenization_opencua.py ADDED
@@ -0,0 +1,379 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import tiktoken
3
+
4
+ from logging import getLogger
5
+ from pathlib import Path
6
+ from typing import (
7
+ cast,
8
+ Tuple,
9
+ Dict,
10
+ Iterator,
11
+ List,
12
+ Union,
13
+ Optional,
14
+ )
15
+ from shutil import copyfile
16
+ from tiktoken.load import load_tiktoken_bpe
17
+ from tokenizers import AddedToken
18
+ from transformers.tokenization_utils import PreTrainedTokenizer
19
+ from transformers.models.gpt2.tokenization_gpt2 import bytes_to_unicode
20
+
21
+ # 导入Qwen2Tokenizer用于继承
22
+ try:
23
+ from transformers.models.qwen2.tokenization_qwen2 import Qwen2Tokenizer
24
+ QWEN2_AVAILABLE = True
25
+ except ImportError:
26
+ QWEN2_AVAILABLE = False
27
+ Qwen2Tokenizer = PreTrainedTokenizer
28
+
29
+
30
+ logger = getLogger(__name__)
31
+ VOCAB_FILES_NAMES = {"vocab_file": "tiktoken.model"}
32
+
33
+ class TikTokenTokenizer(PreTrainedTokenizer):
34
+ """
35
+ Tokenizing and encoding/decoding text using the Tiktoken tokenizer. See megatron/tokenizer/tiktoken_tokenizer.py.
36
+
37
+ This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to
38
+ this superclass for more information regarding those methods.
39
+
40
+ Args:
41
+ vocab_file (`str`):
42
+ The path to the Tiktoken model file.
43
+ bos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|begin_of_text|>",`):
44
+ The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.
45
+ eos_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|end_of_text|>"`):
46
+ The end of sequence token.
47
+ unk_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|reserved_special_token_249|>"`):
48
+ The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this
49
+ token instead. The second to last item in special_tokens.
50
+ pad_token (`str` or `tokenizers.AddedToken`, *optional*, defaults to `"<|reserved_special_token_250|>"`):
51
+ The token used for padding, for example when batching sequences of different lengths.
52
+ additional_special_tokens (list of `str`, *optional*):
53
+ A tuple or a list of additional tokens, which will be marked as `special`, meaning that they will be
54
+ skipped when decoding if `skip_special_tokens` is set to `True`.
55
+ """
56
+
57
+ vocab_files_names = VOCAB_FILES_NAMES
58
+
59
+ model_input_names = ["input_ids", "attention_mask"]
60
+
61
+ special_tokens: Dict[str, int]
62
+
63
+ num_reserved_special_tokens = 256
64
+
65
+ pat_str = "|".join(
66
+ [
67
+ r"""[\p{Han}]+""",
68
+ r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
69
+ r"""[^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]+[\p{Ll}\p{Lm}\p{Lo}\p{M}&&[^\p{Han}]]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?""",
70
+ r"""\p{N}{1,3}""",
71
+ r""" ?[^\s\p{L}\p{N}]+[\r\n]*""",
72
+ r"""\s*[\r\n]+""",
73
+ r"""\s+(?!\S)""",
74
+ r"""\s+""",
75
+ ]
76
+ )
77
+
78
+ def __init__(
79
+ self,
80
+ vocab_file,
81
+ bos_token: Union[str, AddedToken]="[BOS]",
82
+ eos_token: Union[str, AddedToken]="[EOS]",
83
+ unk_token: Union[str, AddedToken, None]=None,
84
+ pad_token: Union[str, AddedToken, None]=None,
85
+ additional_special_tokens: List[str]=None,
86
+ added_tokens_decoder: Optional[dict] = None,
87
+ **kwargs,
88
+ ):
89
+ assert os.path.isfile(vocab_file), vocab_file
90
+
91
+ if additional_special_tokens is None:
92
+ # dumping mode
93
+ used_special_tokens = [
94
+ "<|im_end|>",
95
+ "<|im_user|>",
96
+ "<|im_assistant|>",
97
+ "<|reserved_token_0|>",
98
+ "<|start_header_id|>",
99
+ "<|end_header_id|>",
100
+ "<|reserved_token_1|>",
101
+ "[EOT]",
102
+ "<|im_system|>",
103
+ "<|reserved_token_2|>",
104
+ "<|reserved_token_3|>",
105
+ "<|reserved_token_4|>",
106
+ "<|reserved_token_5|>",
107
+ "<|reserved_token_6|>",
108
+ "<|reserved_token_7|>",
109
+ "<|im_middle|>",
110
+ "<|media_begin|>",
111
+ "<|media_content|>",
112
+ "<|media_end|>",
113
+ "<|media_placeholder|>",
114
+ # 添加标准Qwen2.5-VL需要的token
115
+ "<|vision_start|>",
116
+ "<|vision_end|>",
117
+ "<|image_pad|>",
118
+ "<|video_pad|>",
119
+ ]
120
+ used_reserved_tokens = 12 # 原来8个 + 新增4个vision相关token
121
+ last_reserved_token_id = self.num_reserved_special_tokens - 4 - len(used_special_tokens) + used_reserved_tokens - 1
122
+ additional_special_tokens = used_special_tokens + [
123
+ f"<|reserved_token_{i}|>"
124
+ for i in range(used_reserved_tokens, last_reserved_token_id + 1)
125
+ ]
126
+ # num_reserved_special_tokens = additional_special_tokens + BOS + EOS + unk_token + pad_token
127
+ assert len(additional_special_tokens) + 4 == self.num_reserved_special_tokens, f"additional_special_tokens num: {len(additional_special_tokens)} is not correct"
128
+ # we assume that the instance is under initialization and unk_token and pad_token should be automatically inferred
129
+ if unk_token is not None:
130
+ raise ValueError("unk_token should not be set in dumping mode when additional_special_tokens is None")
131
+ if pad_token is not None:
132
+ raise ValueError("pad_token should not be set in dumping mode when additional_special_tokens is None")
133
+ # last two reserved tokens
134
+ unk_token = f"[UNK]"
135
+ pad_token = f"[PAD]"
136
+
137
+ logger.info(f"adding unk_token: {unk_token} and pad_token: {pad_token}")
138
+ self.additional_special_tokens = additional_special_tokens
139
+ special_tokens = [str(bos_token), str(eos_token)] + additional_special_tokens + [str(unk_token), str(pad_token)]
140
+
141
+ self.vocab_file = vocab_file
142
+ mergeable_ranks = load_tiktoken_bpe(vocab_file)
143
+ num_base_tokens = len(mergeable_ranks)
144
+ self.special_tokens = {
145
+ token: num_base_tokens + i for i, token in enumerate(special_tokens)
146
+ }
147
+ else:
148
+ self.additional_special_tokens = additional_special_tokens
149
+ special_tokens_mapping = {
150
+ i: added_tokens_decoder[i].content for i in added_tokens_decoder
151
+ }
152
+
153
+ self.vocab_file = vocab_file
154
+ mergeable_ranks = load_tiktoken_bpe(vocab_file)
155
+ num_base_tokens = len(mergeable_ranks)
156
+ self.special_tokens = {
157
+ special_tokens_mapping.get(i, f"<|reserved_token_{i}|>"): i
158
+ for i in range(
159
+ num_base_tokens, num_base_tokens + self.num_reserved_special_tokens + 2
160
+ )
161
+ }
162
+
163
+
164
+
165
+ self.model = tiktoken.Encoding(
166
+ name=Path(vocab_file).name,
167
+ pat_str=self.pat_str,
168
+ mergeable_ranks=mergeable_ranks,
169
+ special_tokens=self.special_tokens,
170
+ )
171
+ logger.info(f"Reloaded tiktoken model from {vocab_file}")
172
+
173
+ self.n_words: int = self.model.n_vocab
174
+ # BOS / EOS token IDs
175
+ self.bos_id: int = self.special_tokens[str(bos_token)]
176
+ self.eos_id: int = self.special_tokens[str(eos_token)]
177
+
178
+ logger.info(
179
+ f"#words: {self.n_words} - BOS ID: {self.bos_id} - EOS ID: {self.eos_id}"
180
+ )
181
+
182
+ self.pad_id: int = self.special_tokens[str(pad_token)]
183
+ self.unk_id: int = self.special_tokens[str(unk_token)]
184
+ self.byte_encoder = bytes_to_unicode()
185
+ self.byte_decoder = {v: k for k, v in self.byte_encoder.items()}
186
+
187
+ self.decoder = {}
188
+ for i in range(self.n_words):
189
+ # Taken from https://gist.github.com/xenova/a452a6474428de0182b17605a98631ee
190
+ decoding = ''.join([
191
+ self.byte_encoder[ord(char)] for char in
192
+ self.model.decode_single_token_bytes(i).decode('latin-1')
193
+ ])
194
+ self.decoder[i] = decoding
195
+
196
+ self.encoder = {}
197
+ for i in range(self.n_words):
198
+ if i in self.decoder:
199
+ self.encoder[self.decoder[i]] = i
200
+
201
+ super().__init__(
202
+ bos_token=bos_token,
203
+ eos_token=eos_token,
204
+ unk_token=unk_token,
205
+ pad_token=pad_token,
206
+ additional_special_tokens=self.additional_special_tokens,
207
+ **kwargs,
208
+ )
209
+ self.all_special_ids_set = set(self.all_special_ids)
210
+
211
+ def encode(
212
+ self,
213
+ text: str,
214
+ allow_special_tokens = True,
215
+ **kwargs
216
+ ) -> List[int]:
217
+ """
218
+ Encodes a string into a list of token IDs.
219
+
220
+ Args:
221
+ text (str): The input string to be encoded.
222
+
223
+ Returns:
224
+ list[int]: A list of token IDs.
225
+ """
226
+ # If there are other args, we should call super().encode because there are a lot of code
227
+ # to handle those args. supper().encode finally will call _tokenize and _convert_token_to_id.
228
+ # NOTE: our encode method is not compatible with the super().encode method,
229
+ # e.g. split_special_tokens' default is True in our encode method.
230
+ if len(kwargs) > 0:
231
+ logger.warning( f"Calling super().encode with {kwargs}" )
232
+ return super().encode(text, **kwargs)
233
+
234
+ assert type(text) is str
235
+
236
+ # The tiktoken tokenizer can handle <=400k chars without
237
+ # pyo3_runtime.PanicException.
238
+ TIKTOKEN_MAX_ENCODE_CHARS = 400_000
239
+
240
+ # https://github.com/openai/tiktoken/issues/195
241
+ # Here we iterate over subsequences and split if we exceed the limit
242
+ # of max consecutive non-whitespace or whitespace characters.
243
+ MAX_NO_WHITESPACES_CHARS = 25_000
244
+
245
+ texts = self.pre_tokenizer_process(text)
246
+
247
+ all_substrs = []
248
+ for text in texts:
249
+ substrs = (
250
+ substr
251
+ for i in range(0, len(text), TIKTOKEN_MAX_ENCODE_CHARS)
252
+ for substr in self._split_whitespaces_or_nonwhitespaces(
253
+ text[i: i + TIKTOKEN_MAX_ENCODE_CHARS], MAX_NO_WHITESPACES_CHARS
254
+ )
255
+ )
256
+ all_substrs.extend(substrs)
257
+
258
+ t: List[int] = []
259
+ for substr in all_substrs:
260
+ if allow_special_tokens:
261
+ t.extend(
262
+ self.model.encode(
263
+ substr,
264
+ allowed_special="all",
265
+ )
266
+ )
267
+ else:
268
+ t.extend(
269
+ self.model.encode(
270
+ substr,
271
+ disallowed_special=(),
272
+ )
273
+ )
274
+
275
+ return t
276
+
277
+ def decode(
278
+ self,
279
+ token_ids: Union[int, List[int]],
280
+ **kwargs
281
+ ) -> str:
282
+ """
283
+ Decodes a list of token IDs into a string.
284
+
285
+ Args:
286
+ token_ids (List[int]): The list of token IDs to be decoded.
287
+
288
+ Returns:
289
+ str: The decoded string.
290
+ """
291
+ # If there are other args, we should call super().decode because there are a lot of code
292
+ # to handle those args. supper().encode finally will call convert_tokens_to_string and _convert_id_to_token.
293
+ if len(kwargs) > 0:
294
+ return super().decode(token_ids, **kwargs)
295
+
296
+ if type(token_ids) is int:
297
+ token_ids = [token_ids]
298
+
299
+ return self.model.decode(cast(List[int], token_ids))
300
+
301
+ @staticmethod
302
+ def _split_whitespaces_or_nonwhitespaces(
303
+ s: str, max_consecutive_slice_len: int
304
+ ) -> Iterator[str]:
305
+ """
306
+ Splits the string `s` so that each substring contains no more than `max_consecutive_slice_len`
307
+ consecutive whitespaces or consecutive non-whitespaces.
308
+ """
309
+ current_slice_len = 0
310
+ current_slice_is_space = s[0].isspace() if len(s) > 0 else False
311
+ slice_start = 0
312
+
313
+ for i in range(len(s)):
314
+ is_now_space = s[i].isspace()
315
+
316
+ if current_slice_is_space ^ is_now_space:
317
+ current_slice_len = 1
318
+ current_slice_is_space = is_now_space
319
+ else:
320
+ current_slice_len += 1
321
+ if current_slice_len > max_consecutive_slice_len:
322
+ yield s[slice_start:i]
323
+ slice_start = i
324
+ current_slice_len = 1
325
+ yield s[slice_start:]
326
+
327
+ def pre_tokenizer_process(self, text: str) -> List[str]:
328
+ """
329
+ pre-tokenizes the input text into a list of tokens.
330
+ This method is used to split the input text into smaller chunks for internal processing.
331
+ """
332
+ return [text]
333
+
334
+
335
+ """ ----- Below are the abstract methods required by PreTrainedTokenizer ----- """
336
+ @property
337
+ def vocab_size(self) -> int:
338
+ return self.n_words
339
+
340
+ def get_vocab(self) -> Dict[str, int]:
341
+ return self.encoder
342
+
343
+ def _tokenize(self, text: str, **kwargs) -> List[str]:
344
+ return [
345
+ self.decoder[t]
346
+ for t in self.encode(text)
347
+ ]
348
+
349
+ def _convert_token_to_id(self, token: str) -> int:
350
+ return self.encoder.get(token, self.unk_id)
351
+
352
+ def _convert_id_to_token(self, index: int) -> str:
353
+ return self.decoder.get(index)
354
+
355
+ @staticmethod
356
+ def clean_up_tokenization(out_string: str) -> str:
357
+ return out_string
358
+
359
+ def convert_tokens_to_string(self, tokens: List[str]) -> str:
360
+ text = ''.join(tokens)
361
+ text = bytearray([self.byte_decoder[c] for c in text]).decode('utf-8', 'replace')
362
+ return text
363
+
364
+ def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
365
+ if not os.path.isdir(save_directory):
366
+ raise ValueError(f"vocabulary path ({save_directory}) should be a directory")
367
+ out_vocab_file = os.path.join(
368
+ save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
369
+ )
370
+
371
+ if os.path.abspath(self.vocab_file) != os.path.abspath(out_vocab_file) and os.path.isfile(self.vocab_file):
372
+ copyfile(self.vocab_file, out_vocab_file)
373
+
374
+ return (out_vocab_file,)
375
+
376
+
377
+ class TikTokenV3(TikTokenTokenizer):
378
+ num_reserved_special_tokens = 293 + 128
379
+ pat_str = "(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
tokenizer_config.json ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "151643": {"content": "[BOS]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
4
+ "151644": {"content": "[EOS]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
5
+ "151645": {"content": "<|im_end|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
6
+ "151646": {"content": "<|im_user|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
7
+ "151647": {"content": "<|im_assistant|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
8
+ "151648": {"content": "<|reserved_token_0|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
9
+ "151649": {"content": "<|start_header_id|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
10
+ "151650": {"content": "<|end_header_id|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
11
+ "151651": {"content": "<|reserved_token_1|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
12
+ "151652": {"content": "[EOT]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
13
+ "151653": {"content": "<|im_system|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
14
+ "151654": {"content": "<|reserved_token_2|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
15
+ "151655": {"content": "<|reserved_token_3|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
16
+ "151656": {"content": "<|reserved_token_4|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
17
+ "151657": {"content": "<|reserved_token_5|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
18
+ "151658": {"content": "<|reserved_token_6|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
19
+ "151659": {"content": "<|reserved_token_7|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
20
+ "151660": {"content": "<|im_middle|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
21
+ "151661": {"content": "<|media_begin|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
22
+ "151662": {"content": "<|media_content|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
23
+ "151663": {"content": "<|media_end|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
24
+ "151664": {"content": "<|media_placeholder|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
25
+
26
+ "151665": {"content": "<|vision_start|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
27
+ "151666": {"content": "<|vision_end|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
28
+ "151667": {"content": "<|image_pad|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
29
+ "151668": {"content": "<|video_pad|>", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
30
+
31
+ "152062": {"content": "[UNK]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true},
32
+ "152063": {"content": "[PAD]", "lstrip": false, "normalized": false, "rstrip": false, "single_word": false, "special": true}
33
+ },
34
+
35
+ "additional_special_tokens": [
36
+ "<|im_end|>", "<|im_user|>", "<|im_assistant|>",
37
+ "<|reserved_token_0|>", "<|start_header_id|>", "<|end_header_id|>",
38
+ "<|reserved_token_1|>", "[EOT]", "<|im_system|>",
39
+ "<|reserved_token_2|>", "<|reserved_token_3|>", "<|reserved_token_4|>",
40
+ "<|reserved_token_5|>", "<|reserved_token_6|>", "<|reserved_token_7|>",
41
+ "<|im_middle|>",
42
+ "<|media_begin|>", "<|media_content|>", "<|media_end|>", "<|media_placeholder|>",
43
+ "<|vision_start|>", "<|vision_end|>", "<|image_pad|>", "<|video_pad|>"
44
+ ],
45
+
46
+ "bos_token": "[BOS]",
47
+ "clean_up_tokenization_spaces": false,
48
+ "eos_token": "[EOS]",
49
+ "extra_special_tokens": {},
50
+ "chat_template": "{%- for message in messages -%}{%- if loop.first and messages[0]['role'] != 'system' -%}{{'<|im_system|>system<|im_middle|>You are a helpful assistant<|im_end|>'}}{%- endif -%}{%- if message['role'] == 'system' -%}{{'<|im_system|>'}}{%- endif -%}{%- if message['role'] == 'user' -%}{{'<|im_user|>'}}{%- endif -%}{%- if message['role'] == 'assistant' -%}{{'<|im_assistant|>'}}{%- endif -%}{{- message['role'] -}}{{'<|im_middle|>'}}{%- if message['content'] is string -%}{{- message['content'] + '<|im_end|>' -}}{%- else -%}{%- for content in message['content'] -%}{%- if content['type'] == 'image' or 'image' in content or 'image_url' in content -%}{{'<|media_begin|>image<|media_content|><|media_placeholder|><|media_end|>'}}{%- else -%}{{content['text']}}{%- endif -%}{%- endfor -%}{{'<|im_end|>'}}{%- endif -%}{%- endfor -%}{%- if add_generation_prompt -%}{{'<|im_assistant|>assistant<|im_middle|>'}}{%- endif -%}",
51
+ "model_max_length": 1000000000000000019884624838656,
52
+ "pad_token": "[PAD]",
53
+ "tokenizer_class": "TikTokenV3",
54
+ "unk_token": "[UNK]",
55
+ "auto_map": {
56
+ "AutoTokenizer": ["tokenization_opencua.TikTokenV3", null]
57
+ }
58
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff