Leyo (Leo Tronchon)

upvoted an article 8 months ago

Article

Docmatix - a huge dataset for Document Visual Question Answering

Jul 18, 2024

• 72

upvoted an article 9 months ago

Article

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Apr 15, 2024

• 174

upvoted an article 10 months ago

Article

Multimodal Augmentation for Documents: Recovering “Comprehension” in “Reading and Comprehension” task

By

•

May 16, 2024

• 17

upvoted a paper 10 months ago

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 102

upvoted a collection 11 months ago

Idefics2 🐶

Collection

Idefics2-8B is a foundation vision-language model. In this collection, you will find the models, datasets and demo related to its creation. • 11 items • Updated May 6, 2024 • 91

upvoted a paper 12 months ago

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Paper • 2403.09029 • Published Mar 14, 2024 • 55

upvoted 2 papers about 1 year ago

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

Paper • 2402.10896 • Published Feb 16, 2024 • 16

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Paper • 2312.14238 • Published Dec 21, 2023 • 20

upvoted 10 papers over 1 year ago

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Paper • 2308.01907 • Published Aug 3, 2023 • 12

Retentive Network: A Successor to Transformer for Large Language Models

Paper • 2307.08621 • Published Jul 17, 2023 • 170

Generative Pretraining in Multimodality

Paper • 2307.05222 • Published Jul 11, 2023 • 22

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Paper • 2306.16527 • Published Jun 21, 2023 • 47

Leo Tronchon PRO

AI & ML interests

Organizations

Leyo's activity

Docmatix - a huge dataset for Document Visual Question Answering

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Multimodal Augmentation for Documents: Recovering “Comprehension” in “Reading and Comprehension” task

What matters when building vision-language models?

Idefics2 🐶

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

ConvNets Match Vision Transformers at Scale

FP8-LM: Training FP8 Large Language Models

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Vision Transformers Need Registers

Small-scale proxies for large-scale Transformer training instabilities

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Retentive Network: A Successor to Transformer for Large Language Models

Generative Pretraining in Multimodality

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents