Image-Text-to-Text
Transformers
Safetensors
English
idefics2
pretraining
multimodal
vision
Inference Endpoints
5 papers

Text model not being loaded with Flash Attention 2

#27
Files changed (2) hide show
  1. README.md +3 -73
  2. processor_config.json +0 -1
README.md CHANGED
@@ -18,8 +18,6 @@ datasets:
18
  - camel-ai/math
19
  - AtlasUnified/atlas-math-sets
20
  - tiedong/goat
21
- - Lin-Chen/ShareGPT4V
22
- - jxu124/llava_conversation_58k
23
  language:
24
  - en
25
  tags:
@@ -41,7 +39,7 @@ Idefics2 is an open multimodal model that accepts arbitrary sequences of image a
41
  We release under the Apache 2.0 license 2 checkpoints:
42
  - [idefics2-8b-base](https://huggingface.co/HuggingFaceM4/idefics2-8b-base): the base model
43
  - [idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b): the base model fine-tuned on a mixture of supervised and instruction datasets (text-only and multimodal datasets)
44
- - [idefics2-8b-chatty](https://huggingface.co/HuggingFaceM4/idefics2-8b-chatty): `idefics2-8b` further fine-tuned on long conservation
45
 
46
  # Model Summary
47
 
@@ -53,8 +51,7 @@ We release under the Apache 2.0 license 2 checkpoints:
53
  - **Resources for more information:**
54
  - Description of [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS): [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
55
  ](https://huggingface.co/papers/2306.16527)
56
- - Paper: [What matters when building vision-language models?
57
- ](https://huggingface.co/papers/2405.02246)
58
 
59
 
60
  # Uses
@@ -218,39 +215,6 @@ print(generated_texts)
218
 
219
  </details>
220
 
221
- **Text generation inference**
222
-
223
- Idefics2 is integrated into [TGI](https://github.com/huggingface/text-generation-inference) and we host API endpoints for both `idefics2-8b` and `idefics2-8b-chatty`.
224
-
225
- Multiple images can be passed on with the markdown syntax (`![](IMAGE_URL)`) and no spaces are required before and after. The dialogue utterances can be separated with `<end_of_utterance>\n` followed by `User:` or `Assistant:`. `User:` is followed by a space if the following characters are real text (no space if followed by an image).
226
-
227
- <details><summary>Click to expand.</summary>
228
-
229
- ```python
230
- from text_generation import Client
231
-
232
- API_TOKEN="<YOUR_API_TOKEN>"
233
- API_URL = "https://api-inference.huggingface.co/models/HuggingFaceM4/idefics2-8b-chatty"
234
-
235
- # System prompt used in the playground for `idefics2-8b-chatty`
236
- SYSTEM_PROMPT = "System: The following is a conversation between Idefics2, a highly knowledgeable and intelligent visual AI assistant created by Hugging Face, referred to as Assistant, and a human user called User. In the following interactions, User and Assistant will converse in natural language, and Assistant will do its best to answer User’s questions. Assistant has the ability to perceive images and reason about them, but it cannot generate images. Assistant was built to be respectful, polite and inclusive. It knows a lot, and always tells the truth. When prompted with an image, it does not make up facts.<end_of_utterance>\nAssistant: Hello, I'm Idefics2, Huggingface's latest multimodal assistant. How can I help you?<end_of_utterance>\n"
237
- QUERY = "User:![](https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg)Describe this image.<end_of_utterance>\nAssistant:"
238
-
239
- client = Client(
240
- base_url=API_URL,
241
- headers={"x-use-cache": "0", "Authorization": f"Bearer {API_TOKEN}"},
242
- )
243
- generation_args = {
244
- "max_new_tokens": 512,
245
- "repetition_penalty": 1.1,
246
- "do_sample": False,
247
- }
248
- generated_text = client.generate(prompt=SYSTEM_PROMPT + QUERY, **generation_args)
249
- generated_text
250
- ```
251
-
252
- </details>
253
-
254
  # Model optimizations
255
 
256
  If your GPU allows, we first recommend loading (and running inference) in half precision (`torch.float16` or `torch.bfloat16`).
@@ -327,7 +291,7 @@ Fusing can be de-activated by removing `quantization_config` in the call to `fro
327
  It is also possible to load Idefics2 in 4bits with `bitsandbytes`. To do so, make sure that you have `accelerate` and `bitsandbytes` installed.
328
 
329
  ```diff
330
- + from transformers import BitsAndBytesConfig
331
 
332
  quantization_config = BitsAndBytesConfig(
333
  load_in_4bit=True,
@@ -419,27 +383,6 @@ Alongside this evaluation, we also computed the classification accuracy on FairF
419
  - Despite our efforts in filtering the training data, we found a small proportion of content that is not suitable for all audiences. This includes pornographic content and reports of violent shootings and is prevalent in the OBELICS portion of the data (see [here](https://huggingface.co/datasets/HuggingFaceM4/OBELICS#content-warnings) for more details). As such, the model is susceptible to generating text that resembles this content.
420
  - We note that we know relatively little about the composition of the pre-trained LM backbone, which makes it difficult to link inherited limitations or problematic behaviors to their data.
421
 
422
- **Red-teaming**
423
-
424
- In the context of a **[Red-Teaming](https://huggingface.co/blog/red-teaming)** exercise, our objective was to evaluate the propensity of the model to generate inaccurate, biased, or offensive responses. We evaluated [idefics2-8b-chatty](https://huggingface.co/HuggingFaceM4/idefics2-8b-chatty).
425
-
426
- While the model typically refrains from responding to offensive inputs, we observed that through repeated trials or guided interactions, it tends to hastily form judgments in situations necessitating nuanced contextual understanding, often perpetuating harmful stereotypes. Noteworthy instances include:
427
- - Speculating or passing judgments, or perpetuating historical disparities on individuals' professions, social status, or insurance eligibility based solely on visual cues (e.g., age, attire, gender, facial expressions).
428
- - Generating content that promotes online harassment or offensive memes reinforcing harmful associations from a portrait, or from a benign image.
429
- - Assuming emotional states or mental conditions based on outward appearances.
430
- - Evaluating individuals' attractiveness solely based on their visual appearance.
431
-
432
- Additionally, we identified behaviors that increase security risks that already exist:
433
- - Successfully solving CAPTCHAs featuring distorted text within images.
434
- - Developing phishing schemes from screenshots of legitimate websites to deceive users into divulging their credentials.
435
- - Crafting step-by-step guides on constructing small-scale explosives using readily available chemicals from common supermarkets or manipulating firearms to do maximum damage.
436
-
437
- It's important to note that these security concerns are currently limited by the model's occasional inability to accurately read text within images.
438
-
439
- We emphasize that the model would often encourage the user to exercise caution about the model's generation or flag how problematic the initial query can be in the first place. For instance, when insistently prompted to write a racist comment, the model would answer that query before pointing out "*This type of stereotyping and dehumanization has been used throughout history to justify discrimination and oppression against people of color. By making light of such a serious issue, this meme perpetuates harmful stereotypes and contributes to the ongoing struggle for racial equality and social justice.*".
440
-
441
- However, certain formulations can circumvent (i.e. "jail-break") these cautionary prompts, emphasizing the need for critical thinking and discretion when engaging with the model's outputs. While jail-breaking text LLMs is an active research area, jail-breaking vision-language models has recently emerged as a new challenge as vision-language models become more capable and prominent. The addition of the vision modality not only introduces new avenues for injecting malicious prompts but also raises questions about the interaction between vision and language vulnerabilities.
442
-
443
 
444
  # Misuse and Out-of-scope use
445
 
@@ -475,17 +418,4 @@ The model is built on top of two pre-trained models: [google/siglip-so400m-patch
475
  archivePrefix={arXiv},
476
  primaryClass={cs.IR}
477
  }
478
-
479
- @misc{laurençon2024matters,
480
- title={What matters when building vision-language models?},
481
- author={Hugo Laurençon and Léo Tronchon and Matthieu Cord and Victor Sanh},
482
- year={2024},
483
- eprint={2405.02246},
484
- archivePrefix={arXiv},
485
- primaryClass={cs.CV}
486
- }
487
  ```
488
-
489
- # Acknowledgements
490
-
491
- We thank @yjernite, @sasha, @meg, @giadap, @jack-kumar, and @frimelle, who provided help to red-team the model.
 
18
  - camel-ai/math
19
  - AtlasUnified/atlas-math-sets
20
  - tiedong/goat
 
 
21
  language:
22
  - en
23
  tags:
 
39
  We release under the Apache 2.0 license 2 checkpoints:
40
  - [idefics2-8b-base](https://huggingface.co/HuggingFaceM4/idefics2-8b-base): the base model
41
  - [idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b): the base model fine-tuned on a mixture of supervised and instruction datasets (text-only and multimodal datasets)
42
+ - idefics2-8b-chatty (coming soon): `idefics2-8b` further fine-tuned on long conservation
43
 
44
  # Model Summary
45
 
 
51
  - **Resources for more information:**
52
  - Description of [OBELICS](https://huggingface.co/datasets/HuggingFaceM4/OBELICS): [OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
53
  ](https://huggingface.co/papers/2306.16527)
54
+ - Paper: Coming soon
 
55
 
56
 
57
  # Uses
 
215
 
216
  </details>
217
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218
  # Model optimizations
219
 
220
  If your GPU allows, we first recommend loading (and running inference) in half precision (`torch.float16` or `torch.bfloat16`).
 
291
  It is also possible to load Idefics2 in 4bits with `bitsandbytes`. To do so, make sure that you have `accelerate` and `bitsandbytes` installed.
292
 
293
  ```diff
294
+ + from transformer import BitsAndBytesConfig
295
 
296
  quantization_config = BitsAndBytesConfig(
297
  load_in_4bit=True,
 
383
  - Despite our efforts in filtering the training data, we found a small proportion of content that is not suitable for all audiences. This includes pornographic content and reports of violent shootings and is prevalent in the OBELICS portion of the data (see [here](https://huggingface.co/datasets/HuggingFaceM4/OBELICS#content-warnings) for more details). As such, the model is susceptible to generating text that resembles this content.
384
  - We note that we know relatively little about the composition of the pre-trained LM backbone, which makes it difficult to link inherited limitations or problematic behaviors to their data.
385
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
386
 
387
  # Misuse and Out-of-scope use
388
 
 
418
  archivePrefix={arXiv},
419
  primaryClass={cs.IR}
420
  }
 
 
 
 
 
 
 
 
 
421
  ```
 
 
 
 
processor_config.json CHANGED
@@ -1,5 +1,4 @@
1
  {
2
- "chat_template": "{% for message in messages %}{{message['role'].capitalize()}}{% if message['content'][0]['type'] == 'image' %}{{':'}}{% else %}{{': '}}{% endif %}{% for line in message['content'] %}{% if line['type'] == 'text' %}{{line['text']}}{% elif line['type'] == 'image' %}{{ '<image>' }}{% endif %}{% endfor %}<end_of_utterance>\n{% endfor %}{% if add_generation_prompt %}{{ 'Assistant:' }}{% endif %}",
3
  "image_seq_len": 64,
4
  "processor_class": "Idefics2Processor"
5
  }
 
1
  {
 
2
  "image_seq_len": 64,
3
  "processor_class": "Idefics2Processor"
4
  }