Files changed (1) hide show
  1. README.md +95 -43
README.md CHANGED
@@ -2,58 +2,78 @@
2
  license: other
3
  license_name: intel-research-use-license
4
  license_link: LICENSE
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ---
6
 
7
- # LLaVA-Llama3 Model Card
8
 
9
- _This model card corresponds to the instruction tuned 8B version of the model with the CLIP-based vision encoder._
10
 
 
 
 
 
 
 
 
 
 
11
 
12
- ## Overview
13
 
14
- `llava-llama-3-8b` is a large multimodal model (LMM) trained using the [LLaVA-v1.5 framework](https://arxiv.org/abs/2310.03744) with the 8-billion parameter [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model as language backbone.
15
 
16
- ## Uses
 
 
 
 
17
 
18
- The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot.
19
 
20
- ## Bias, Risks, and Limitations
21
 
22
- This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.
23
-
24
- ## Training Details
25
-
26
- The `llava-llama-3-8b` model was trained on a 4 node cluster with a total of 32 Gaudi 2 accelerators.
27
-
28
- ### Training Data
29
-
30
- The model was trained using the LLaVA-v1.5 data mixture.
31
-
32
- This is listed as follows:
33
-
34
- - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
35
- - 158K GPT-generated multimodal instruction-following data.
36
- - 450K academic-task-oriented VQA data mixture.
37
- - 40K ShareGPT data.
38
-
39
- ## Evaluation
40
-
41
- | Model | Metrics |
42
- |----------|------------------|
43
- | ScienceQA| 72.9797 |
44
- | MMVet | 31.9725 |
45
- | llavaw | 56.9/61.9/73.6/65.7 |
46
- | Pope Acc | 87.33, F1 86.5 |
47
- | GQA | 60.6138 |
48
- | MMVP | 36 |
49
-
50
- ## License
51
- The weights are released under the Intel Research Use License Agreement (see LICENSE file)
52
- All usage code is licensed Apache 2.0
53
-
54
- ## Usage
55
-
56
- Please note, we only provide the trained weights difference and do not provide a copy of the base [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model. Any use of these weights requires a separate download of the base model.
57
 
58
  ```python
59
  # Copyright 2024 Intel Corporation
@@ -135,4 +155,36 @@ inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
135
  generate_ids = model.generate(**inputs, max_length=30)
136
  output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
137
  print(output)
138
- ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  license: other
3
  license_name: intel-research-use-license
4
  license_link: LICENSE
5
+ tags:
6
+ - intel
7
+ - gaudi
8
+ - LLM
9
+ results:
10
+ - task:
11
+ type: Large Language Model
12
+ name: Large Language Model
13
+ metrics:
14
+ - type: GQA
15
+ name: GQA
16
+ value: 60.6138
17
+ - type: MMVP
18
+ name: MMVP
19
+ value: 36
20
+ - type: Pope Acc
21
+ name: Pope Acc
22
+ value: 87.33
23
+ - type: Pope F1
24
+ name: Pope F1
25
+ value: 86.5
26
+ - type: MMVet
27
+ name: MMVet
28
+ value: 31.9725
29
+ - type: ScienceQA
30
+ name: ScienceQA
31
+ value: 72.9797
32
+ - type: llavaw (1)
33
+ name: llavaw
34
+ value: 56.9
35
+ - type: llavaw (2)
36
+ name: llavaw
37
+ value: 61.9
38
+ - type: llavaw (3)
39
+ name: llavaw
40
+ value: 73.6
41
+ - type: llavaw (4)
42
+ name: llavaw
43
+ value: 65.7
44
+
45
+ library_name: transformers
46
+ pipeline_tag: image-text-to-text
47
  ---
48
 
49
+ ## Model Details: LLaVA-llama-3-8B
50
 
51
+ `llava-llama-3-8b` is a large multimodal model (LMM) trained using the [LLaVA-v1.5 framework](https://arxiv.org/abs/2310.03744) with the 8-billion parameter [`meta-llama/Meta-Llama-3-8B-Instruct`](https://huggingface.co/meta-llama/Meta-Llama-3-8B) model as language backbone and the CLIP-based vision encoder.
52
 
53
+ | Model Details | Description |
54
+ | ----------- | ----------- |
55
+ | Authors | Intel: [Musashi Hinck*](https://huggingface.co/musashihinck), [Matthew L. Olson*](https://huggingface.co/matthewlyleolson), [Vasudev Lal](https://huggingface.co/vasudevlal) |
56
+ | Date | May 2024 |
57
+ | Version | 1 |
58
+ | Type | Large multimodal model (LMM) |
59
+ | Paper or Other Resources | [LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model](https://arxiv.org/abs/2404.01331) |
60
+ | License | [Intel Research Use License](https://huggingface.co/Intel/llava-llama-3-8b/blob/main/LICENSE) | All usage code is licensed Apache 2.0
61
+ | Questions or Comments | [Community Tab](https://huggingface.co/Intel/llava-llama-3-8b/discussions) and [Intel DevHub Discord](https://discord.gg/rv2Gp55UJQ)|
62
 
63
+ This model card was created by [Eduardo Alvarez](https://huggingface.co/eduardo-alvarez) and the authors listed above.
64
 
65
+ ## Intended Use
66
 
67
+ | Intended Use | Description |
68
+ | ----------- | ----------- |
69
+ | Primary intended uses | The model has been finetuned for multimodal benchmark evaluations, but can also be used as a multimodal chatbot. |
70
+ | Primary intended users | Anyone using or evaluating multimodal models. |
71
+ | Out-of-scope uses | This model is not intended for uses that require high levels of factuality, high stakes situations, mental health or medical applications, generating misinformation or disinformation, impersonating others, facilitating or inciting harassment or violence, any use that could lead to the violation of a human right under the UN Declaration of Human Rights. |
72
 
 
73
 
74
+ ### How to use
75
 
76
+ Please note, we only provide the trained weights difference and do not provide a copy of the base meta-llama/Meta-Llama-3-8B-Instruct model. Any use of these weights requires a separate download of the base model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
77
 
78
  ```python
79
  # Copyright 2024 Intel Corporation
 
155
  generate_ids = model.generate(**inputs, max_length=30)
156
  output = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
157
  print(output)
158
+ ```
159
+
160
+ ## Factors
161
+
162
+ | Factors | Description |
163
+ | ----------- | ----------- |
164
+ | Environment | Trained on a 4 node cluster with a total of 32 Gaudi 2 accelerators |
165
+ | Card Prompts | Model training and deployment on alternate hardware and software will change model performance |
166
+
167
+ ## Training Data
168
+
169
+ The model was trained using the LLaVA-v1.5 data mixture. This is listed as follows:
170
+
171
+ - 558K filtered image-text pairs from LAION/CC/SBU, captioned by BLIP.
172
+ - 158K GPT-generated multimodal instruction-following data.
173
+ - 450K academic-task-oriented VQA data mixture.
174
+ - 40K ShareGPT data.
175
+
176
+ ## Ethical Considerations
177
+
178
+ Intel is committed to respecting human rights and avoiding causing or contributing to adverse impacts on human rights. See [Intel’s Global Human Rights Principles](https://www.intel.com/content/dam/www/central-libraries/us/en/documents/policy-human-rights.pdf). Intel’s products and software are intended only to be used in applications that do not cause or contribute to adverse impacts on human rights.
179
+
180
+ | Ethical Considerations | Description |
181
+ | ----------- | ----------- |
182
+ | Data | The model was trained using the LLaVA-v1.5 data mixture as described above. |
183
+ | Human life | The model is not intended to inform decisions central to human life or flourishing. |
184
+ | Mitigations | No additional risk mitigation strategies were considered during model development. |
185
+ | Risks and harms | This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm. |
186
+ | Use cases | - |
187
+
188
+ ## Caveats and Recommendations
189
+
190
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. This model has not been assessed for harm or biases, and should not be used for sensitive applications where it may cause harm.