metadata

library_name: transformers
datasets:
  - ucsahin/Turkish-VLM-Mix-Benchmark
language:
  - tr
pipeline_tag: image-text-to-text
license: apache-2.0

English

🎉 Introducing TraVisionLM: The First of Its Kind! 🚀

🌟 This is the very first fast and small (only 875M parameters) visual language model on Hugging Face that responds to Turkish instructions given an image input! 🌟

✨ Developed compatible with the Transformers library, TRaVisionLM is a breeze to load, fine-tune, and use for lightning-fast inferences—all without needing any external libraries! ⚡️

Ready to experience the Turkish visual language model? Let's go! 🇹🇷🖼️🤖

Türkçe

🎉 TraVisionLM: Türünün İlk Örneği! 🚀

🌟 Türkçe görsel dil modelinin ilk hızlı ve küçük (sadece 875M parametre) versiyonu! Bir görüntü ve Türkçe talimat verildiğinde Türkçe yanıt üretir! 🌟

✨ Transformers kütüphanesi ile uyumlu olarak geliştirilen TraVisionLM modeli ile, yükleme, eğitme ve dış kütüphaneler kullanmadan hızlı sonuçlar almak çok kolay! ⚡️

Türkçe görsel dil modelini deneyimlemeye hazır mısınız? Hadi başlayalım! 🇹🇷🖼️🤖

Model Details

English

This model is a multimodal large language model that combines SigLIP as its vision encoder with GPT2-large as its language model. The vision projector connects the two modalities together. Its architecture closely resembles PaliGemma, with some refined adjustments to the vision projector and the causal language modeling.

Here's the summary of the development process:

Unimodal pretraining
- In this stage, instead of pretraining both modalities from scratch, I leverage the image encoder from google/siglip-base-patch16-256-multilingual and the language model from ytu-ce-cosmos/turkish-gpt2-large.
Feature Alignment
- Following the LLaVA training recipe, I train only the vision projector using 500K image-text pairs to align visual and textual features.
Task Specific Training
- The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering, using over 1M image-prompt-completion triplets.
Finetuning on Downstream Tasks
- Finally, the model is fine-tuned for object detection to demonstrate its versatility in various downstream tasks. Explore the fine-tuned model for object detection at ucsahin/TraVisionLM-Object-Detection-ft for more details.

Türkçe

Bu model, SigLIP görsel kodlayıcısını ve GPT2-large dil modelini birleştiren çok modlu büyük bir dil modelidir. Görsel projektör, iki modaliteyi bir araya getirir. Mimarisi, PaliGemma ile yakından benzerlik gösterir, ancak görsel projektör ve neden-sonuç dil modellemesinde bazı uyarlamalar yapılmıştır.

Geliştirme sürecinin özeti:

Tek Modalite Ön Eğitimi
- Bu aşamada, her iki modaliteyi sıfırdan eğitmek yerine, google/siglip-base-patch16-256-multilingual modelinin görsel kodlayıcısını ve ytu-ce-cosmos/turkish-gpt2-large modelinin dil kodlayıcısını kullanıyorum.
Özellik Uyarlama
- LLaVA eğitim tarifesi izlenerek, sadece görsel projektörü 500K görüntü-metin çiftleri ile eğiterek görsel ve metin özelliklerini uyumlu hale getiriyorum.
Görev Spesifik Eğitim
- Bu adımda, uyumlulaştırılmış model, kısa açıklama, detaylı açıklama ve basit görsel soru cevaplama gibi görevler için daha fazla eğitilmiştir; 1M'den fazla resim-istek-tamamlanma üçlüsünden oluşan veri seti kullanılmıştır.
İndirgeme Görevlerinde İnce Ayar
- Son olarak, modelin çeşitli görevlerdeki çok yönlülüğünü göstermek amacıyla nesne tespiti için ince ayarı yapılmıştır. Nesne tespiti için ince ayar yapılmış modele detaylar için ucsahin/TraVisionLM-Object-Detection-ft adresinden ulaşabilirsiniz.

Model Description

Developed by: ucsahin
Model type: Image-Text-to-Text
Language(s) (NLP): Turkish
License: Apache license 2.0

Model Sources [optional]

Repository: [https://huggingface.co/ucsahin/TraVisionLM-base/edit/main/README.md]
Paper [optional]: More info on this later.
Demo [optional]: [More Information Needed]

Friendly Reminder:

First of all, thanks for your interest if you plan to use this model. I developed this model to primarily show that you can build

Kullanıcılar için Önemli Bir Hatırlatma:

Uses

Below are the scenarios where the TraVisionLM visual language model can be used directly or indirectly for various tasks. Also, don't forget to check out the section on out-of-scope uses.

Direct Use

Short Captioning

You can give the model task instructions like "Açıkla", "Kısaca açıkla", "Görseli özetle", "Çok kısa özetle" etc., for this task. The model will generate a short description of the image you provide. Below, the usage code with the Transformer library is shared.

Important reminder: The model tends to hallucinate less for this task. You can try adjusting the generation parameters to produce the most useful answer for your needs.

Detailed Captioning

You can give the model task instructions like "Detaylı açıkla", "Çok detaylı açıkla", "Görseli detaylı anlat", "Görseli çok detaylı anlat" etc., for this task. The model will generate a very detailed description of the image you provide.

Important reminder: The model tends to hallucinate more for this task. Although it generally produces responses related to the image, it may provide details and information that are not present in the image. You can try adjusting the generation parameters to produce the most useful answer for your needs.

Visual Question Answering

You can ask the model open-ended questions like "Resmin odağında ne var?", "Görselde adam ne yapıyor?", "Kaç zürafa var?", "Görselle ilgili ne söylenir?", "Görseldeki *obje* ne renk?" etc., for this task. The model will generate responses that complement your question.

Important reminder: The model tends to hallucinate more for this task. Although it generally produces responses related to the image and the question, it may provide details and information that are not present in the image. You can try adjusting the generation parameters to produce the most useful answer for your needs.

Downstream Use [optional]

(Video-Text-to-Text) The model can be adapted for a question-answering task related to your videos. By sampling video frames and generating answers for each frame, the model can be used without any changes to the architecture.
(Image/Text Retrieval conditioned on Text/Image) For the task of most relevant image retrieval conditioned on text or vice versa, the model can be used directly without any modifications.
(Fine-tuning) For all other tasks that support the model's architecture, such as visual classification, the model can be fine-tuned using the Transformers library. For an example, check out ucsahin/TraVisionLM-Object-Detection-ft.

As time permits, I plan to share more applications for these indirect uses. Meanwhile, I eagerly await support or collaboration requests from the community 🤝💪

Out-of-Scope Use

This model is not suitable for the following scenarios:

Although the model can answer simple questions related to your images, it is not suitable for multi-turn complex chat scenarios. Past information is not retained; the model does not use previously asked questions as context. However, you can easily train the model for this task by preparing a chat template accordingly.
The model does not accept multiple image inputs. For instance, it is not suitable for answering questions that compare two different images. Modifications to the architecture would be necessary to add this feature. For such a model, you can check HuggingFaceM4/idefics2-8b (English only).
The model has not been trained for tasks such as character and text recognition (OCR), segmentation, and multi-object detection. To achieve acceptable performance in these tasks, visual language models like google/paligemma-3b-pt-224 and microsoft/Florence-2-large have been trained on billions of documents and images.

Türkçe: Kullanım Alanları

Aşağıda TraVisionLM görsel dil modelinin, hangi görevler için doğrudan ve dolaylı kullanılabileceği durumlar verilmiştir. Ayrıca alan dışı kullanımlar kısmına da göz atmayı unutmayın.

Doğrudan Kullanım Alanları

Kısa Açıklama

Bu görev için modele "Açıkla", "Kısaca açıkla", "Görseli özetle", "Çok kısa özetle" ve benzeri görev talimatları verebilirsiniz. Model verdiğiniz resmin kısa bir açıklamasını yapacaktır. Aşağıda modelin Transformer kütüphanesiyle kullanım kodları paylaşılmıştır.

Önemli hatırlatma: Model bu görev için daha az halüsinasyon görmektedir. Kullanırken üretim parametrelerini değiştirerek işinize en çok yarayacak cevabı ürettirmeyi deneyebilirsiniz.

Detaylı Açıklama

Bu görev için modele "Detaylı açıkla", "Çok detaylı açıkla", "Görseli detaylı anlat", "Görseli çok detaylı anlat" ve benzeri görev talimatları verebilirsiniz. Model verdiğiniz resmin çok detaylı bir açıklamasını yapacaktır. Aşağıda modelin Transformer kütüphanesiyle kullanım kodları paylaşılmıştır.

Önemli hatırlatma: Model bu görev için genellikle fazla halüsinasyon görmektedir. Genel olarak resimle alakalı cevaplar üretse de, resimde olmayan detaylar ve bilgiler verebilmektedir. Kullanırken üretim parametrelerini değiştirerek işinize en çok yarayacak cevabı ürettirmeyi deneyebilirsiniz.

Görsel Soru Cevaplama

Bu görev için modele "Resmin odağında ne var?", "Görselde adam ne yapıyor?", "Kaç zürafa var?", "Görselle ilgili ne söylenir?", "Görseldeki *obje* ne renk?" ve benzeri ucu açık sorular sorabilirsiniz. Model sorunuzu tamamlayacak cevaplar üretecektir. Aşağıda modelin Transformer kütüphanesiyle kullanım kodları paylaşılmıştır.

Önemli hatırlatma: Model bu görev için genellikle fazla halüsinasyon görebilmektedir. Genel olarak resimle ve sorulan soruyla alakalı cevaplar üretse de, resimde olmayan detaylar ve bilgiler verebilmektedir. Kullanırken üretim parametrelerini değiştirerek işinize en çok yarayacak cevabı ürettirmeyi deneyebilirsiniz.

Dolaylı Kullanım Alanları

(Video-Text-to-Text) Model videolarınızla ilgili soru cevap görevi için adapte edilebilir. Mimariye hiçbir değişiklik yapmadan, video kareleri örneklenerek, her bir kare üzerinden modele cevap ürettirilebilir.
(Retrieval) Metne dayalı en uygun görüntü alma görevi için model, herhangi bir değişiklik yapılmadan doğrudan kullanılabilir.
(Finetuning) Model mimarisini destekleyen görsel sınıflandırma gibi geri kalan bütün görevler için model Transformers kütüphanesiyle uyumlu bir şekilde eğitilebilir. Bir örnek için ucsahin/TraVisionLM-Object-Detection-ft adresine bakabilirsiniz.

Zaman buldukça bu dolaylı kullanım uygulamaları ile paylaşımlar yapmayı planlıyorum. Bu sürede topluluktan da destek ya da işbirliği isteklerini dört gözle bekliyorum 🤝💪

Alan-dışı Kullanımlar

Bu modelin aşağıdaki senaryolar için kullanımı uygun değildir:

Model, resimlerinizle ilgili basit sorulara cevap verse de, çok turlu kompleks chat senaryoları için uygun değildir. Geçmiş bilgisi tutulmamaktadır, model daha önce sorduğunuz soruları kontekst olarak kullanmamaktadır. Fakat bu görev için, bir chat şablonu hazırlayıp bu doğrultuda modeli kolayca eğitebilirsiniz.
Model çoklu görsel girdi kabul etmemektedir. Örneğin, iki farklı resmi karşılaştıran sorulara cevap vermeye uygun değildir. Bu özelliği kazandırmak için mimariye değişiklikler yapmak gerekmektedir. Bu tarz bir model için HuggingFaceM4/idefics2-8b (sadece ingilizce) modeline bakabilirsiniz.
Model, karakter ve yazı tanıma (OCR), segmentasyon ve çoklu obje tespit etme görevleri için eğitilmemiştir. Bu görevlerde kabul edilebilir başarılar alabilmek için google/paligemma-3b-pt-224 ve microsoft/Florence-2-large gibi görsel dil modelleri milyarlarca doküman ve resimle eğitilmiştir.

How to Get Started with the Model

In Transformers, you can load the model and inference as follows:

IMPORTANT NOTE: TraVisionLM model is not yet integrated into the Transformers library. So you need to set trust_remote_code=True when loading the model. It will download the configuration_travisionlm.py, modeling_travisionlm.py and processing_travisionlm.py files from the repo. You can check out the content of these files under the Files and Versions tab and pin the specific versions if you have any concerns regarding malicious code.

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
import requests 
from PIL import Image

model = AutoModelForCausalLM.from_pretrained('ucsahin/TraVisionLM-base', trust_remote_code=True, device_map="cuda")
# you can also load the model in bfloat16 or float16
# model = AutoModelForCausalLM.from_pretrained('ucsahin/TraVisionLM-base', trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
processor = AutoProcessor.from_pretrained('ucsahin/TraVisionLM-base', trust_remote_code=True)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

prompt = "Açıkla"  # short caption
# prompt = "Detaylı açıkla"  # detailed caption
# prompt = "Araba ne renktir?" # visual qa
# prompt = "Resmin odak noktası nedir?" # visual qa
# prompt = "Araba nerede duruyor?" # visual qa

inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.6, top_p=0.9, top_k=50, repetition_penalty=1.2)

output_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print("Model response: ", output_text)

You can also perform batch inference very easily as follows:

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
import requests 
from PIL import Image

model = AutoModelForCausalLM.from_pretrained('ucsahin/TraVisionLM-base', trust_remote_code=True, device_map="cuda")
# you can also load the model in bfloat16 or float16
# model = AutoModelForCausalLM.from_pretrained('ucsahin/TraVisionLM-base', trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
processor = AutoProcessor.from_pretrained('ucsahin/TraVisionLM-base', trust_remote_code=True)

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")

prompt_list = [
  'Açıkla',
  'Detaylı açıkla',
  'Araba nerede duruyor?',
  'Arabanın rengi nedir?',
]

inputs = processor(text=prompt_list, images=len(prompt_list)*[image], padding="longest", return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.6, top_p=0.9, top_k=50, repetition_penalty=1.2)

output_text_list = processor.batch_decode(outputs, skip_special_tokens=True)

for output_text in output_text_list:
  print(f"Model response: {output_text}\n\n\n")

"""
Model response: Açıkla
Bir binanın önünde, sokakta park halindeki mavi bir Volkswagen Beetle.



Model response: Detaylı açıkla
Bu görüntüde, bir taş döşeli sokakta park edilmiş yeşil ve mavi bir Volkswagen Beetle bulunmaktadır. Arka planda iki sarı bina vardır. Araba kameraya doğru bakmaktadır. Görüntü net odaklanmıştır ve renkler canlıdır. Görsel tarzı gerçekçidir.



Model response: Araba nerede duruyor?
Araba, sarı bir binanın yanında sokakta park edilmiş.



Model response: Arabanın rengi nedir?
Araba turkuaz veya limon yeşili renktedir.
"""

Training Details

Training Data

I plan to release the multimodal Turkish data used during the training of the model. But, the data is in a very messy format. Until then, in order to get the grasp of the dataset and for contributing to the open-source community, I am releasing the evaluation portion of the dataset at ucsahin/Turkish-VLM-Mix-Benchmark.

The dataset consists of predominantly translated versions of the well-known multimodal datasets in English to Turkish. More information on this will be shared in the future.

Training Procedure

The following training hyperparameters are used in feature alignment and task specific training stages respectively:

Feature Alignment

Data size	Global Batch Size	Learning Rate	Epochs	Max Length	Weight Decay
500K	128	1e-3	1	1024	0

Task Specific Training

Data size	Global Batch Size	Learning Rate	Epochs	Max Length	Weight Decay
1.1M	128	2e-5	3	1024	0

Evaluation

This section will be updated after I get some evaluation results on the ucsahin/Turkish-VLM-Mix-Benchmark.

Testing Data, Factors & Metrics

Testing Data

During the training, I used the following dataset ucsahin/Turkish-VLM-Mix-Benchmark as the evaluation split.

Compute Infrastructure

The following compute resources are used in feature alignment and task specific training stages respectively:

Feature Alignment

1xA100(40GB), took approximately 4 GPU hours.

Task Specific Training

1xH100(80GB), took approximately 18 GPU hours.

Citation

I am releasing TraVisionLM under the Apache 2.0 License. To the best of my knowledge after through research, this should comply with the datasets and unimodal vision and language models used during development.

However, if I receive any feedback indicating otherwise, I will promptly update the licensing information as needed.

If you use the TraVisionLM model in your research, work, or personal projects, please acknowledge this repository. 🙏

Finally, I reserve the right to publish this work in an academic setting if it reaches a mature state. In that case, I will provide the appropriate citations here so that any future work can appropriately cite it.

Model Card Contact

If you have questions or suggestions regarding the model, I prefer if you would reach me directly via Hugging Face (e.g. opening an issue). But if you have specific things in your mind or any ideas for collaboration on future projects, reach me at sahin.umitcan@gmail.com

Modelle ilgili sorularınız veya önerileriniz varsa, doğrudan bana Hugging Face üzerinden (örneğin, bir issue açarak) ulaşmanızı tercih ederim. Diğer konular veya gelecekteki projelerde işbirliği için herhangi bir fikriniz varsa, bana sahin.umitcan@gmail.com adresinden ulaşabilirsiniz.