wlsaidhi commited on
Commit
0762ece
·
verified ·
1 Parent(s): 0f82c72

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -34,3 +34,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  text_encoder/gemma/tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
 
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  text_encoder/gemma/tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
+ tokenizer/tokenizer.json filter=lfs diff=lfs merge=lfs -text
model_index.json CHANGED
@@ -33,5 +33,6 @@
33
  "diffusers",
34
  "LTX2Vocoder"
35
  ],
36
- "fastvideo_refine_lora_path": "FastVideo/LTX2-Distilled-LoRA"
 
37
  }
 
33
  "diffusers",
34
  "LTX2Vocoder"
35
  ],
36
+ "fastvideo_refine_lora_path": "FastVideo/LTX2-Distilled-LoRA",
37
+ "gemma_model_path": "text_encoder/gemma"
38
  }
text_encoder/gemma/.gitattributes CHANGED
@@ -33,10 +33,10 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
- tokenizer.model filter=lfs diff=lfs merge=lfs -text
37
  model-00001-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
38
  model-00002-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
39
  model-00003-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
40
  model-00004-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
41
  model-00005-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
42
- tokenizer.json filter=lfs diff=lfs merge=lfs -text
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ tokenizer.json filter=lfs diff=lfs merge=lfs -text
37
  model-00001-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
38
  model-00002-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
39
  model-00003-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
40
  model-00004-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
41
  model-00005-of-00005.safetensors filter=lfs diff=lfs merge=lfs -text
42
+ tokenizer.model filter=lfs diff=lfs merge=lfs -text
text_encoder/gemma/README.md CHANGED
@@ -1,33 +1,19 @@
1
  ---
2
- base_model: google/gemma-3-12b-it
3
  license: gemma
4
- tags:
5
- - gemma3
6
- - gemma
7
- - google
8
- pipeline_tag: image-text-to-text
9
  library_name: transformers
 
10
  extra_gated_heading: Access Gemma on Hugging Face
11
- extra_gated_prompt: >-
12
- To access Gemma on Hugging Face, you’re required to review and agree to
13
- Google’s usage license. To do this, please ensure you’re logged in to Hugging
14
  Face and click below. Requests are processed immediately.
15
  extra_gated_button_content: Acknowledge license
 
16
  ---
17
 
18
  # Gemma 3 model card
19
 
20
  **Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core)
21
 
22
- > [!Note]
23
- > This repository corresponds to the 12B **instruction-tuned** version of the Gemma 3 model using Quantization Aware Training (QAT).
24
- >
25
- > **The checkpoint in this repository is unquantized, please make sure to quantize with Q4_0 with your favorite tool**
26
- >
27
- > Thanks to QAT, the model is able to preserve similar quality as `bfloat16` while significantly reducing the memory requirements
28
- > to load the model.
29
-
30
-
31
  **Resources and Technical Documentation**:
32
 
33
  * [Gemma 3 Technical Report][g3-tech-report]
@@ -72,6 +58,107 @@ for everyone.
72
  question, analysis of image content, or a summary of a document
73
  - Total output context of 8192 tokens
74
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
75
  ### Citation
76
 
77
  ```none
@@ -170,10 +257,6 @@ development workflow."*
170
 
171
  ## Evaluation
172
 
173
- > [!Note]
174
- > The evaluation in this section correspond to the original checkpoint, not the QAT checkpoint.
175
- >
176
-
177
  Model evaluation metrics and results.
178
 
179
  ### Benchmark Results
 
1
  ---
 
2
  license: gemma
 
 
 
 
 
3
  library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
  extra_gated_heading: Access Gemma on Hugging Face
6
+ extra_gated_prompt: To access Gemma on Hugging Face, you’re required to review and
7
+ agree to Google’s usage license. To do this, please ensure you’re logged in to Hugging
 
8
  Face and click below. Requests are processed immediately.
9
  extra_gated_button_content: Acknowledge license
10
+ base_model: google/gemma-3-12b-pt
11
  ---
12
 
13
  # Gemma 3 model card
14
 
15
  **Model Page**: [Gemma](https://ai.google.dev/gemma/docs/core)
16
 
 
 
 
 
 
 
 
 
 
17
  **Resources and Technical Documentation**:
18
 
19
  * [Gemma 3 Technical Report][g3-tech-report]
 
58
  question, analysis of image content, or a summary of a document
59
  - Total output context of 8192 tokens
60
 
61
+ ### Usage
62
+
63
+ Below, there are some code snippets on how to get quickly started with running the model. First, install the Transformers library. Gemma 3 is supported starting from transformers 4.50.0.
64
+
65
+ ```sh
66
+ $ pip install -U transformers
67
+ ```
68
+
69
+ Then, copy the snippet from the section that is relevant for your use case.
70
+
71
+ #### Running with the `pipeline` API
72
+
73
+ You can initialize the model and processor for inference with `pipeline` as follows.
74
+
75
+ ```python
76
+ from transformers import pipeline
77
+ import torch
78
+
79
+ pipe = pipeline(
80
+ "image-text-to-text",
81
+ model="google/gemma-3-12b-it",
82
+ device="cuda",
83
+ torch_dtype=torch.bfloat16
84
+ )
85
+ ```
86
+
87
+ With instruction-tuned models, you need to use chat templates to process our inputs first. Then, you can pass it to the pipeline.
88
+
89
+ ```python
90
+ messages = [
91
+ {
92
+ "role": "system",
93
+ "content": [{"type": "text", "text": "You are a helpful assistant."}]
94
+ },
95
+ {
96
+ "role": "user",
97
+ "content": [
98
+ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
99
+ {"type": "text", "text": "What animal is on the candy?"}
100
+ ]
101
+ }
102
+ ]
103
+
104
+ output = pipe(text=messages, max_new_tokens=200)
105
+ print(output[0]["generated_text"][-1]["content"])
106
+ # Okay, let's take a look!
107
+ # Based on the image, the animal on the candy is a **turtle**.
108
+ # You can see the shell shape and the head and legs.
109
+ ```
110
+
111
+ #### Running the model on a single / multi GPU
112
+
113
+ ```python
114
+ # pip install accelerate
115
+
116
+ from transformers import AutoProcessor, Gemma3ForConditionalGeneration
117
+ from PIL import Image
118
+ import requests
119
+ import torch
120
+
121
+ model_id = "google/gemma-3-12b-it"
122
+
123
+ model = Gemma3ForConditionalGeneration.from_pretrained(
124
+ model_id, device_map="auto"
125
+ ).eval()
126
+
127
+ processor = AutoProcessor.from_pretrained(model_id)
128
+
129
+ messages = [
130
+ {
131
+ "role": "system",
132
+ "content": [{"type": "text", "text": "You are a helpful assistant."}]
133
+ },
134
+ {
135
+ "role": "user",
136
+ "content": [
137
+ {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
138
+ {"type": "text", "text": "Describe this image in detail."}
139
+ ]
140
+ }
141
+ ]
142
+
143
+ inputs = processor.apply_chat_template(
144
+ messages, add_generation_prompt=True, tokenize=True,
145
+ return_dict=True, return_tensors="pt"
146
+ ).to(model.device, dtype=torch.bfloat16)
147
+
148
+ input_len = inputs["input_ids"].shape[-1]
149
+
150
+ with torch.inference_mode():
151
+ generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
152
+ generation = generation[0][input_len:]
153
+
154
+ decoded = processor.decode(generation, skip_special_tokens=True)
155
+ print(decoded)
156
+
157
+ # **Overall Impression:** The image is a close-up shot of a vibrant garden scene,
158
+ # focusing on a cluster of pink cosmos flowers and a busy bumblebee.
159
+ # It has a slightly soft, natural feel, likely captured in daylight.
160
+ ```
161
+
162
  ### Citation
163
 
164
  ```none
 
257
 
258
  ## Evaluation
259
 
 
 
 
 
260
  Model evaluation metrics and results.
261
 
262
  ### Benchmark Results
text_encoder/gemma/config.json CHANGED
@@ -13,50 +13,28 @@
13
  "mm_tokens_per_image": 256,
14
  "model_type": "gemma3",
15
  "text_config": {
16
- "attention_bias": false,
17
- "attention_dropout": 0.0,
18
- "attn_logit_softcapping": null,
19
- "cache_implementation": "hybrid",
20
- "final_logit_softcapping": null,
21
- "head_dim": 256,
22
- "hidden_activation": "gelu_pytorch_tanh",
23
  "hidden_size": 3840,
24
- "initializer_range": 0.02,
25
  "intermediate_size": 15360,
26
- "max_position_embeddings": 131072,
27
  "model_type": "gemma3_text",
28
  "num_attention_heads": 16,
29
  "num_hidden_layers": 48,
30
  "num_key_value_heads": 8,
31
- "query_pre_attn_scalar": 256,
32
- "rms_norm_eps": 1e-06,
33
- "rope_local_base_freq": 10000,
34
  "rope_scaling": {
35
  "factor": 8.0,
36
  "rope_type": "linear"
37
  },
38
- "rope_theta": 1000000,
39
- "sliding_window": 1024,
40
- "sliding_window_pattern": 6,
41
- "torch_dtype": "bfloat16",
42
- "use_cache": true,
43
- "vocab_size": 262208
44
  },
45
  "torch_dtype": "bfloat16",
46
- "transformers_version": "4.52.0.dev0",
47
  "vision_config": {
48
- "attention_dropout": 0.0,
49
- "hidden_act": "gelu_pytorch_tanh",
50
  "hidden_size": 1152,
51
  "image_size": 896,
52
  "intermediate_size": 4304,
53
- "layer_norm_eps": 1e-06,
54
  "model_type": "siglip_vision_model",
55
  "num_attention_heads": 16,
56
- "num_channels": 3,
57
  "num_hidden_layers": 27,
58
  "patch_size": 14,
59
- "torch_dtype": "bfloat16",
60
  "vision_use_head": false
61
  }
62
  }
 
13
  "mm_tokens_per_image": 256,
14
  "model_type": "gemma3",
15
  "text_config": {
 
 
 
 
 
 
 
16
  "hidden_size": 3840,
 
17
  "intermediate_size": 15360,
 
18
  "model_type": "gemma3_text",
19
  "num_attention_heads": 16,
20
  "num_hidden_layers": 48,
21
  "num_key_value_heads": 8,
 
 
 
22
  "rope_scaling": {
23
  "factor": 8.0,
24
  "rope_type": "linear"
25
  },
26
+ "sliding_window": 1024
 
 
 
 
 
27
  },
28
  "torch_dtype": "bfloat16",
29
+ "transformers_version": "4.50.0.dev0",
30
  "vision_config": {
 
 
31
  "hidden_size": 1152,
32
  "image_size": 896,
33
  "intermediate_size": 4304,
 
34
  "model_type": "siglip_vision_model",
35
  "num_attention_heads": 16,
 
36
  "num_hidden_layers": 27,
37
  "patch_size": 14,
 
38
  "vision_use_head": false
39
  }
40
  }
text_encoder/gemma/generation_config.json CHANGED
@@ -1,11 +1,13 @@
1
  {
 
2
  "cache_implementation": "hybrid",
3
  "do_sample": true,
4
  "eos_token_id": [
5
  1,
6
  106
7
  ],
 
8
  "top_k": 64,
9
  "top_p": 0.95,
10
- "transformers_version": "4.52.0.dev0"
11
  }
 
1
  {
2
+ "bos_token_id": 2,
3
  "cache_implementation": "hybrid",
4
  "do_sample": true,
5
  "eos_token_id": [
6
  1,
7
  106
8
  ],
9
+ "pad_token_id": 0,
10
  "top_k": 64,
11
  "top_p": 0.95,
12
+ "transformers_version": "4.50.0.dev0"
13
  }
text_encoder/gemma/model-00001-of-00005.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e6fb899db428481aafb45a20130457df6e247e7cb03b7d9f01ee4bc2a9a08138
3
  size 4979902192
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4847447e92599833e8dbaa3067cd201c3bb5c052efa91f11ba891e43234f7832
3
  size 4979902192
text_encoder/gemma/model-00002-of-00005.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:d251e7fe9799d529405ddb61705a44cd700bd30a8b66a8d44ae26ddf8365dbc6
3
  size 4931296592
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:891bd54eed03cba9ee1e705533a02a8217fcc29f356e4a1f53e5fd0d178883ad
3
  size 4931296592
text_encoder/gemma/model-00003-of-00005.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:0684ef801385f0669a0b3e4ab160c50877efdbfa40eb97788595985de2743e78
3
  size 4931296656
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:7cee411d9d57324e50ce064a192cc5a858276d508611b12fc599e0c9767112e0
3
  size 4931296656
text_encoder/gemma/model-00004-of-00005.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:b4b964e6526f81ccfa625c900b72ce92d5e0fd2debb75998763038ad06b9c541
3
  size 4931296656
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8bc75a29a730c9e743cad013feda3b0991a913fafe787c58a1c6e20afad97723
3
  size 4931296656
text_encoder/gemma/model-00005-of-00005.safetensors CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4ef2de8f93e165b4e02425769fc566000b0674256ef0c3a27b23a0d45eb12088
3
  size 4601000928
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:ed14bd4908c98fed9f61e8cd410167e0846de9abd78e0452ab092072e5d9252d
3
  size 4601000928
text_encoder/gemma/tokenizer.json CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:7d4046bf0505a327dd5a0abbb427ecd4fc82f99c2ceaa170bc61ecde12809b0c
3
- size 33384570
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4667f2089529e8e7657cfb6d1c19910ae71ff5f28aa7ab2ff2763330affad795
3
+ size 33384568
text_encoder/gemma/tokenizer_config.json CHANGED
@@ -2160,7 +2160,7 @@
2160
  "normalized": false,
2161
  "rstrip": false,
2162
  "single_word": false,
2163
- "special": false
2164
  },
2165
  "256000": {
2166
  "content": "<end_of_image>",
@@ -2168,7 +2168,7 @@
2168
  "normalized": false,
2169
  "rstrip": false,
2170
  "single_word": false,
2171
- "special": false
2172
  },
2173
  "256001": {
2174
  "content": "<unused99>",
 
2160
  "normalized": false,
2161
  "rstrip": false,
2162
  "single_word": false,
2163
+ "special": true
2164
  },
2165
  "256000": {
2166
  "content": "<end_of_image>",
 
2168
  "normalized": false,
2169
  "rstrip": false,
2170
  "single_word": false,
2171
+ "special": true
2172
  },
2173
  "256001": {
2174
  "content": "<unused99>",
tokenizer/added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "<image_soft_token>": 262144
3
+ }
tokenizer/special_tokens_map.json ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "boi_token": "<start_of_image>",
3
+ "bos_token": {
4
+ "content": "<bos>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ "eoi_token": "<end_of_image>",
11
+ "eos_token": {
12
+ "content": "<eos>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false
17
+ },
18
+ "image_token": "<image_soft_token>",
19
+ "pad_token": {
20
+ "content": "<pad>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false
25
+ },
26
+ "unk_token": {
27
+ "content": "<unk>",
28
+ "lstrip": false,
29
+ "normalized": false,
30
+ "rstrip": false,
31
+ "single_word": false
32
+ }
33
+ }
tokenizer/tokenizer.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4667f2089529e8e7657cfb6d1c19910ae71ff5f28aa7ab2ff2763330affad795
3
+ size 33384568
tokenizer/tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1299c11d7cf632ef3b4e11937501358ada021bbdf7c47638d13c0ee982f2e79c
3
+ size 4689074
tokenizer/tokenizer_config.json ADDED
The diff for this file is too large to render. See raw diff