metadata
library_name: dualtowervlm
license: mit
pipeline_tag: image-text-to-text
tags:
- vision-language
- multimodal
- dual-tower
- research
DualTowerVLM is a dual-tower Vision-Language Model (VLM) architecture that processes images and text through separate towers before combining their representations.
For more information, check out the repository.
Usage:
from models.dual_tower.dual_tower import DualTowerVLM
from models.config import VLMConfig
cfg = VLMConfig()
model = DualTowerVLM.from_pretrained("patrickamadeus/dt-cococaps")