175 54 155

Victor Sanh PRO

VictorSanh

SanhEstPasMoi

VictorSanh

AI & ML interests

None yet

Articles

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

11 days ago

• 86

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Mar 15

• 2

Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model

Aug 22, 2023

• 5

Organizations

Posts 5

Post

2311

Can't wait to see multimodal LLama 3!

We released a resource that might come in handy: The Cauldron 🍯

The Cauldron is a massive manually-curated collection of 50 vision-language sets for instruction fine-tuning. 3.6M images, 30.3M query/answer pairs.

It covers a large variety of downstream uses: visual question answering on natural images, OCR, document/charts/figures/tables understanding, textbooks/academic question, reasoning, captioning, spotting differences between 2 images, and screenshot-to-code.

HuggingFaceM4/the_cauldron

Post

2704

New open multimodal model in town: Idefics2!

💪 Strong 8B-parameters model: often on par with open 30B counterparts.
🔓Open license: Apache 2.0.
🚀 Strong improvement over Idefics1: +12 points on VQAv2, +30 points on TextVQA while having 10x fewer parameters.
📚 Better data: boosting OCR capabilities with 6TB of documents to transcribe, and improving QA capabilities on charts/figures/diagrams.
🕵️‍♀️ Transparent training data: inspect and build upon all the data (10s of TB of data) we trained on.
🔲 More natural image processing: Incorporating strategies to treat images in their native resolution and native aspect ratio.
📸 High-resolution images: image resolutions up to 980 x 980 and integrating strategies that allow to trade computational efficiency for performance.
😎 2 checkpoints: Releasing both base checkpoint and instruction fine-tuned checkpoint. Chat version to come.

Ressources: HuggingFaceM4/idefics2-661d1971b7c50831dd3ce0fe
Blogpost: https://huggingface.co/blog/idefics2

View all posts