mtgv
/

Text Generation
Transformers
PyTorch
English
llava
Inference Endpoints
Edit model card

Model Summery

We propose Lenna a Language enhanced reasoning detection assistant, which utilizes the robust multimodal feature representation of MLLMs, while preserving location information for detection. This is achieved by incorporating an additional token in the MLLM vocabulary that is free of explicit semantic context but serves as a prompt for the detector to identify the corresponding position. To evaluate the reasoning capability of Lenna, we construct a ReasonDet dataset to measure its performance on reasoning-based detection.

Model Sources

How to Get Started with the Model

Model weights can be loaded with Hugging Face Transformers. Examples can be found at Github.

Downloads last month
4

Datasets used to train mtgv/Lenna-7B