license: llama3
datasets:
- google/wit
- coastalcph/multi_eurlex
language:
- it
base_model:
- meta-llama/Meta-Llama-3-8B
- openai/clip-vit-large-patch14-336
Model Card for LLaVA-NDiNO_pt
Model description
LLaVA-NDiNO is a family of Large Vision Language Models (LVLMs) trained for the Italian language.
LLaVA-NDiNO_pt is a pre-trained model that has been trained over three different types of image-text data:
- Wikipedia Image-Text Sections: Wikipedia image together with the text section in which the image appears
- Wikipedia Image-Text Captions: Wikipedia image together with its caption
- OCR PDF Documents: text in PDF documents extracted using Tesseract from MultiEurlex
If you are interested in more details regarding the training procedure, you can find the code we used at the following link:
Repository: https://github.com/swapUniba/LLaVA-NDiNO
Developed by: Elio Musacchio, Lucia Siciliani, Pierpaolo Basile, Giovanni Semeraro
Funded by: PNRR project FAIR - Future AI Research
Compute infrastructure: Leonardo supercomputer
Model type: LLaMA 3 + CLIP
Language(s) (NLP): Italian
License: Llama 3 Community License
Example usage
The model is not intended to be used without fine-tuning. It is recommended to further train it using the LLaVA-NeXT codebase.
Citation
@inproceedings{musacchioLLaVANDiNO,
title={LLaVA-NDiNO: Empowering LLMs with Multimodality for the Italian Language},
author={Musacchio, Elio and Siciliani, Lucia and Basile, Pierpaolo and Semeraro, Giovanni},
booktitle={Proceedings of the Eighth Workshop on Natural Language for Artificial Intelligence (NL4AI 2024) co-located with 23th International Conference of the Italian Association for Artificial Intelligence (AI*IA 2024)},
year={2024}
}