πŸ–ΌοΈπŸ“ OneEncoder: A Unified Text & Image Model

OneEncoder is a lightweight framework for cross-modal alignment, focusing on efficiently integrating text and images (with future extensions to other modalities). Unlike traditional methods relying on massive modality-specific encoders, OneEncoder progressively aligns different data types, making it cost-effective and performant even on small paired datasets.

πŸš€ Key Features

βœ… Multimodal Alignment: Initially supports text & image, with extension to other modalities.
βœ… Lightweight & Efficient: Avoids full retraining when adding new modalities.
βœ… Superior Performance: Outperforms models that require large specialized datasets.

🎯 Applications

  • Visual Question Answering (VQA)
  • Image-Text Retrieval
  • Multimodal Content Understanding

πŸ“„ Research Paper

πŸ“œ arXiv: OneEncoder: Progressive Cross-Modal Alignment

πŸ“Œ Resources

πŸ”— GitHub Repo: OneEncoder
πŸš€ Hugging Face Demo: OneEncoder Retriever
πŸ““ Demo Notebook: OneEncoder Demos
πŸ”Š OneEncoder for Text, Image & Audio: HF Model
πŸ”Š OneEncoder for Text, Image & Video: HF Model
πŸ”Š OneEncoder for Text, Image & X-ray: HF Model

πŸ“ Authors

πŸ“Œ Bilal FAYE, Hanane AZZAG, Mustapha LEBBAH, Djamel BOUCHAFFRA

Note: This model is training with temperature=2.5 and addition as fusion operation

Downloads last month
22
Safetensors
Model size
198M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for bilalfaye/OneEncoder-text-image

Finetuned
(4481)
this model

Space using bilalfaye/OneEncoder-text-image 1