musashihinck commited on
Commit
22045a9
β€’
1 Parent(s): f8bf4da

Updating preprocessor config to LlavaProcessor.py

Browse files
Files changed (2) hide show
  1. README.md +12 -18
  2. preprocessor_config.json +1 -1
README.md CHANGED
@@ -1,6 +1,6 @@
1
  ---
2
  language:
3
- - en
4
  license_name: gemma-terms
5
  license_link: https://ai.google.dev/gemma/terms
6
  ---
@@ -19,18 +19,17 @@ Preprint: [arxiv.org/abs/2404.01331](https://arxiv.org/abs/2404.01331)
19
 
20
  The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
21
 
22
-
23
  ## Bias, Risks, and Limitations
24
 
25
  This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
26
 
27
-
28
  ## How to Get Started with the Model
29
 
30
  Currently using `llava-gemma` requires a [modified preprocessor](https://huggingface.co/Intel/llava-gemma-2b/blob/main/processing_llavagemma.py).
31
 
32
- For example usage, see [`usage.py`](/usage.py) or the following code block:
33
 
 
34
 
35
  ```python
36
  import requests
@@ -62,7 +61,7 @@ url = "https://www.ilankelman.org/stopsigns/australia.jpg"
62
  image = Image.open(requests.get(url, stream=True).raw)
63
  inputs = processor(text=prompt, images=image, return_tensors="pt")
64
  inputs = {k: v.to('cuda') for k, v in inputs.items()}
65
-
66
  # Generate
67
  generate_ids = model.generate(**inputs, max_length=30)
68
  output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
@@ -70,14 +69,10 @@ print(output)
70
 
71
  ```
72
 
73
-
74
-
75
-
76
  ## Training Details
77
 
78
  The `llava-gemma-2b` model was trained on 8 Gaudi 2 accelerators.
79
 
80
-
81
  ### Training Data
82
 
83
  The model was trained using the LLaVA-v1.5 data mixture.
@@ -89,14 +84,13 @@ This is listed as follows:
89
  - 450K academic-task-oriented VQA data mixture.
90
  - 40K ShareGPT data.
91
 
92
-
93
  ## Evaluation
94
 
95
- | LM Backbone​ | Vision Model​ | Pretrained Connector​ | GQA​ | MME​ cognition​ | MME​ perception​ | MM-Vet​ | POPE accuracy​ | POPE​ F1​ | VQAv2​ | TextVQA​ | ScienceQA​ Image​ | MMVP​ |
96
- | ------------ | ------------- | --------------------- | ------ | ---------------- | ----------------- | ------- | ------------------ | ------------ | ------ | -------- | -------------------- | ------ |
97
- | gemma-2b-it​ | CLIP​ | Yes​ | 0.531​ | 236.071​ | 1130.492​ | 17.706​ | 0.850​ | 0.839​ | 70.65​ | 28.06​ | 0.564​ | 0.287​ |
98
- | gemma-2b-it​ | CLIP​ | No​ | 0.481​ | 247.857​ | 934.611​ | 13.119​ | 0.784​ | 0.762​ | 61.74​ | ​ | 0.549​ | 0.180​ |
99
- | gemma-7b-it​ | CLIP​ | Yes​ | 0.472​ | 253.571​ | 894.910​ | 18.165​ | 0.848​ | 0.829​ | 68.7​ | ​ | 0.625​ | 0.327​ |
100
- | gemma-7b-it​ | CLIP​ | No​ | 0.472​ | 278.214​ | 857.274​ | 19.083​ | 0.782​ | 0.734​ | 65.09​ | ​ | 0.636​ | 0.240​ |
101
- | gemma-2b-it​ | DinoV2​ | Yes​ | 0.587​ | 307.143​ | 1132.970​ | 19.128​ | 0.853​ | 0.838​ | 71.37​ | 12.53​ | 0.555​ | 0.227​ |
102
- | gemma-2b-it​ | DinoV2​ | No​ | 0.501​ | 308.929​ | 959.351​ | 14.541​ | 0.793​ | 0.772​ | 61.65​ | 11.1​ | 0.568​ | 0.180​ |
 
1
  ---
2
  language:
3
+ - en
4
  license_name: gemma-terms
5
  license_link: https://ai.google.dev/gemma/terms
6
  ---
 
19
 
20
  The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
21
 
 
22
  ## Bias, Risks, and Limitations
23
 
24
  This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
25
 
 
26
  ## How to Get Started with the Model
27
 
28
  Currently using `llava-gemma` requires a [modified preprocessor](https://huggingface.co/Intel/llava-gemma-2b/blob/main/processing_llavagemma.py).
29
 
30
+ _We are currently working on modifying the `LlavaProcessor` class to streamline usage (see [PR #30030](https://github.com/huggingface/transformers/pull/30030)), expect updates soon._
31
 
32
+ For current usage, see [`usage.py`](/usage.py) or the following code block:
33
 
34
  ```python
35
  import requests
 
61
  image = Image.open(requests.get(url, stream=True).raw)
62
  inputs = processor(text=prompt, images=image, return_tensors="pt")
63
  inputs = {k: v.to('cuda') for k, v in inputs.items()}
64
+
65
  # Generate
66
  generate_ids = model.generate(**inputs, max_length=30)
67
  output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
 
69
 
70
  ```
71
 
 
 
 
72
  ## Training Details
73
 
74
  The `llava-gemma-2b` model was trained on 8 Gaudi 2 accelerators.
75
 
 
76
  ### Training Data
77
 
78
  The model was trained using the LLaVA-v1.5 data mixture.
 
84
  - 450K academic-task-oriented VQA data mixture.
85
  - 40K ShareGPT data.
86
 
 
87
  ## Evaluation
88
 
89
+ | LM Backbone | Vision Model | Pretrained Connector | GQA | MME cognition | MME perception | MM-Vet | POPE accuracy | POPE F1 | VQAv2 | TextVQA | ScienceQA Image | MMVP |
90
+ | ----------- | ------------ | -------------------- | ----- | ------------- | -------------- | ------ | ------------- | ------- | ----- | ------- | --------------- | ----- |
91
+ | gemma-2b-it | CLIP | Yes | 0.531 | 236.071 | 1130.492 | 17.706 | 0.850 | 0.839 | 70.65 | 28.06 | 0.564 | 0.287 |
92
+ | gemma-2b-it | CLIP | No | 0.481 | 247.857 | 934.611 | 13.119 | 0.784 | 0.762 | 61.74 | | 0.549 | 0.180 |
93
+ | gemma-7b-it | CLIP | Yes | 0.472 | 253.571 | 894.910 | 18.165 | 0.848 | 0.829 | 68.7 | | 0.625 | 0.327 |
94
+ | gemma-7b-it | CLIP | No | 0.472 | 278.214 | 857.274 | 19.083 | 0.782 | 0.734 | 65.09 | | 0.636 | 0.240 |
95
+ | gemma-2b-it | DinoV2 | Yes | 0.587 | 307.143 | 1132.970 | 19.128 | 0.853 | 0.838 | 71.37 | 12.53 | 0.555 | 0.227 |
96
+ | gemma-2b-it | DinoV2 | No | 0.501 | 308.929 | 959.351 | 14.541 | 0.793 | 0.772 | 61.65 | 11.1 | 0.568 | 0.180 |
preprocessor_config.json CHANGED
@@ -36,7 +36,7 @@
36
  0.26130258,
37
  0.27577711
38
  ],
39
- "processor_class": "LlavaGemmaProcessor",
40
  "resample": 3,
41
  "rescale_factor": 0.00392156862745098,
42
  "size": {
 
36
  0.26130258,
37
  0.27577711
38
  ],
39
+ "processor_class": "LlavaProcessor",
40
  "resample": 3,
41
  "rescale_factor": 0.00392156862745098,
42
  "size": {