ucsahin
/

TraVisionLM-base

@@ -97,24 +97,45 @@ First of all, thanks for your interest if you plan to use this model. I develope
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
 ### Direct Use
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
 ### Downstream Use [optional]
 <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
 ### Out-of-Scope Use
 <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
 ## Türkçe: Kullanım Alanları
@@ -122,13 +143,25 @@ First of all, thanks for your interest if you plan to use this model. I develope
 Aşağıda TraVisionLM görsel dil modelinin, hangi görevler için doğrudan ve dolaylı kullanılabileceği durumlar verilmiştir. Ayrıca alan dışı kullanımlar kısmına da göz atmayı unutmayın.
 ### Doğrudan Kullanım Alanları
  - **Kısa Açıklama**
  - **Detaylı Açıklama**
  - **Görsel Soru Cevaplama**
 ### Dolaylı Kullanım Alanları
  - (*Video-Text-to-Text*) Model videolarınızla ilgili soru cevap görevi için adapte edilebilir. Mimariye hiçbir değişiklik yapmadan, video kareleri örneklenerek, her bir kare üzerinden modele cevap ürettirilebilir.
  - (*Retrieval*) Metne dayalı en uygun görüntü alma görevi için model, herhangi bir değişiklik yapılmadan doğrudan kullanılabilir.
@@ -140,103 +173,161 @@ Aşağıda TraVisionLM görsel dil modelinin, hangi görevler için doğrudan ve
 Bu modelin aşağıdaki senaryolar için kullanımı uygun değildir:
  - Model, resimlerinizle ilgili basit sorulara cevap verse de, çok turlu kompleks chat senaryoları için uygun değildir. Geçmiş bilgisi tutulmamaktadır, model daha önce sorduğunuz soruları kontekst olarak kullanmamaktadır. Fakat bu görev için, bir chat şablonu hazırlayıp bu doğrultuda modeli kolayca eğitebilirsiniz.
  - Model çoklu görsel girdi kabul etmemektedir. Örneğin, iki farklı resmi karşılaştıran sorulara cevap vermeye uygun değildir. Bu özelliği kazandırmak için mimariye değişiklikler yapmak gerekmektedir. Bu tarz bir model için [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b) (sadece ingilizce) modeline bakabilirsiniz.
- - Model, karakter ve yazı tanıma (OCR), segmentasyon ve çoklu obje tanıma görevleri için eğitilmemiştir. Bu görevlerde kabul edilebilir başarılar alabilmek için [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) ve [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large) gibi görsel dil modelleri milyarlarca doküman ve resimle eğitilmiştir.
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-More information will come
-### Model Architecture and Objective
-[More Information Needed]
 ### Compute Infrastructure
-[More Information Needed]
-## Citation
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
 ## Model Card Contact

 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+Below are the scenarios where the TraVisionLM visual language model can be used directly or indirectly for various tasks. Also, don't forget to check out the section on out-of-scope uses.
 ### Direct Use
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+ - **Short Captioning**
+You can give the model task instructions like ```"Açıkla", "Kısaca açıkla", "Görseli özetle", "Çok kısa özetle"``` etc., for this task. The model will generate a short description of the image you provide. Below, the usage code with the Transformer library is shared.
+*Important reminder:* The model tends to hallucinate less for this task. You can try adjusting the generation parameters to produce the most useful answer for your needs.
+- **Detailed Captioning**
+You can give the model task instructions like ```"Detaylı açıkla", "Çok detaylı açıkla", "Görseli detaylı anlat", "Görseli çok detaylı anlat"``` etc., for this task. The model will generate a very detailed description of the image you provide.
+*Important reminder:*  The model tends to hallucinate more for this task. Although it generally produces responses related to the image, it may provide details and information that are not present in the image. You can try adjusting the generation parameters to produce the most useful answer for your needs.
+- **Visual Question Answering**
+You can ask the model open-ended questions like ```"Resmin odağında ne var?", "Görselde adam ne yapıyor?", "Kaç zürafa var?", "Görselle ilgili ne söylenir?", "Görseldeki *obje* ne renk?"``` etc., for this task. The model will generate responses that complement your question.
+*Important reminder:*  The model tends to hallucinate more for this task. Although it generally produces responses related to the image and the question, it may provide details and information that are not present in the image. You can try adjusting the generation parameters to produce the most useful answer for your needs.
 ### Downstream Use [optional]
 <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+- (Video-Text-to-Text) The model can be adapted for a question-answering task related to your videos. By sampling video frames and generating answers for each frame, the model can be used without any changes to the architecture.
+- (Image/Text Retrieval conditioned on Text/Image) For the task of most relevant image retrieval conditioned on text or vice versa, the model can be used directly without any modifications.
+- (Fine-tuning) For all other tasks that support the model's architecture, such as visual classification, the model can be fine-tuned using the Transformers library. For an example, check out [ucsahin/TraVisionLM-Object-Detection-ft](https://huggingface.co/ucsahin/TraVisionLM-Object-Detection-ft).
+```As time permits, I plan to share more applications for these indirect uses. Meanwhile, I eagerly await support or collaboration requests from the community ``` 🤝💪
 ### Out-of-Scope Use
 <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+This model is not suitable for the following scenarios:
+- Although the model can answer simple questions related to your images, it is not suitable for multi-turn complex chat scenarios. Past information is not retained; the model does not use previously asked questions as context. However, you can easily train the model for this task by preparing a chat template accordingly.
+- The model does not accept multiple image inputs. For instance, it is not suitable for answering questions that compare two different images. Modifications to the architecture would be necessary to add this feature. For such a model, you can check [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b) (English only).
+- The model has not been trained for tasks such as character and text recognition (OCR), segmentation, and multi-object detection. To achieve acceptable performance in these tasks, visual language models like [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) and [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large) have been trained on billions of documents and images.
 ## Türkçe: Kullanım Alanları
 Aşağıda TraVisionLM görsel dil modelinin, hangi görevler için doğrudan ve dolaylı kullanılabileceği durumlar verilmiştir. Ayrıca alan dışı kullanımlar kısmına da göz atmayı unutmayın.
 ### Doğrudan Kullanım Alanları
  - **Kısa Açıklama**
+Bu görev için modele ```"Açıkla", "Kısaca açıkla", "Görseli özetle", "Çok kısa özetle"``` ve benzeri görev talimatları verebilirsiniz. Model verdiğiniz resmin kısa bir açıklamasını yapacaktır. Aşağıda modelin Transformer kütüphanesiyle kullanım kodları paylaşılmıştır.
+*Önemli hatırlatma:* Model bu görev için daha az halüsinasyon görmektedir. Kullanırken üretim parametrelerini değiştirerek işinize en çok yarayacak cevabı ürettirmeyi deneyebilirsiniz.
  - **Detaylı Açıklama**
+Bu görev için modele ```"Detaylı açıkla", "Çok detaylı açıkla", "Görseli detaylı anlat", "Görseli çok detaylı anlat"``` ve benzeri görev talimatları verebilirsiniz. Model verdiğiniz resmin çok detaylı bir açıklamasını yapacaktır. Aşağıda modelin Transformer kütüphanesiyle kullanım kodları paylaşılmıştır.
+*Önemli hatırlatma:* Model bu görev için genellikle fazla halüsinasyon görmektedir. Genel olarak resimle alakalı cevaplar üretse de, resimde olmayan detaylar ve bilgiler verebilmektedir. Kullanırken üretim parametrelerini değiştirerek işinize en çok yarayacak cevabı ürettirmeyi deneyebilirsiniz.
  - **Görsel Soru Cevaplama**
+Bu görev için modele ```"Resmin odağında ne var?", "Görselde adam ne yapıyor?", "Kaç zürafa var?", "Görselle ilgili ne söylenir?", "Görseldeki *obje* ne renk?"``` ve benzeri ucu açık sorular sorabilirsiniz. Model sorunuzu tamamlayacak cevaplar üretecektir. Aşağıda modelin Transformer kütüphanesiyle kullanım kodları paylaşılmıştır.
+*Önemli hatırlatma:* Model bu görev için genellikle fazla halüsinasyon görebilmektedir. Genel olarak resimle ve sorulan soruyla alakalı cevaplar üretse de, resimde olmayan detaylar ve bilgiler verebilmektedir. Kullanırken üretim parametrelerini değiştirerek işinize en çok yarayacak cevabı ürettirmeyi deneyebilirsiniz.
 ### Dolaylı Kullanım Alanları
  - (*Video-Text-to-Text*) Model videolarınızla ilgili soru cevap görevi için adapte edilebilir. Mimariye hiçbir değişiklik yapmadan, video kareleri örneklenerek, her bir kare üzerinden modele cevap ürettirilebilir.
  - (*Retrieval*) Metne dayalı en uygun görüntü alma görevi için model, herhangi bir değişiklik yapılmadan doğrudan kullanılabilir.
 Bu modelin aşağıdaki senaryolar için kullanımı uygun değildir:
  - Model, resimlerinizle ilgili basit sorulara cevap verse de, çok turlu kompleks chat senaryoları için uygun değildir. Geçmiş bilgisi tutulmamaktadır, model daha önce sorduğunuz soruları kontekst olarak kullanmamaktadır. Fakat bu görev için, bir chat şablonu hazırlayıp bu doğrultuda modeli kolayca eğitebilirsiniz.
  - Model çoklu görsel girdi kabul etmemektedir. Örneğin, iki farklı resmi karşılaştıran sorulara cevap vermeye uygun değildir. Bu özelliği kazandırmak için mimariye değişiklikler yapmak gerekmektedir. Bu tarz bir model için [HuggingFaceM4/idefics2-8b](https://huggingface.co/HuggingFaceM4/idefics2-8b) (sadece ingilizce) modeline bakabilirsiniz.
+ - Model, karakter ve yazı tanıma (OCR), segmentasyon ve çoklu obje tespit etme görevleri için eğitilmemiştir. Bu görevlerde kabul edilebilir başarılar alabilmek için [google/paligemma-3b-pt-224](https://huggingface.co/google/paligemma-3b-pt-224) ve [microsoft/Florence-2-large](https://huggingface.co/microsoft/Florence-2-large) gibi görsel dil modelleri milyarlarca doküman ve resimle eğitilmiştir.
+---
+## How to Get Started with the Model
+In Transformers, you can load the model and inference as follows:
+**IMPORTANT NOTE:** TraVisionLM model is not yet integrated into the Transformers library. So you need to set ```trust_remote_code=True``` when loading the model. It will download the ```configuration_travisionlm.py```, ```modeling_travisionlm.py``` and ```processing_travisionlm.py``` files from the repo. You can check out the content of these files under the *Files and Versions* tab and pin the specific versions if you have any concerns regarding malicious code.
+```python
+from transformers import AutoModelForCausalLM, AutoProcessor
+import torch
+import requests
+from PIL import Image
+model = AutoModelForCausalLM.from_pretrained('ucsahin/TraVisionLM-base', trust_remote_code=True, device_map="cuda")
+# you can also load the model in bfloat16 or float16
+# model = AutoModelForCausalLM.from_pretrained('ucsahin/TraVisionLM-base', trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
+processor = AutoProcessor.from_pretrained('ucsahin/TraVisionLM-base', trust_remote_code=True)
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
+image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+prompt = "Açıkla"  # short caption
+# prompt = "Detaylı açıkla"  # detailed caption
+# prompt = "Araba ne renktir?" # visual qa
+# prompt = "Resmin odak noktası nedir?" # visual qa
+# prompt = "Araba nerede duruyor?" # visual qa
+inputs = processor(text=prompt, images=image, return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.6, top_p=0.9, top_k=50, repetition_penalty=1.2)
+output_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
+print("Model response: ", output_text)
+```
+You can also perform batch inference very easily as follows:
+```python
+from transformers import AutoModelForCausalLM, AutoProcessor
+import torch
+import requests
+from PIL import Image
+model = AutoModelForCausalLM.from_pretrained('ucsahin/TraVisionLM-base', trust_remote_code=True, device_map="cuda")
+# you can also load the model in bfloat16 or float16
+# model = AutoModelForCausalLM.from_pretrained('ucsahin/TraVisionLM-base', trust_remote_code=True, torch_dtype=torch.bfloat16, device_map="cuda")
+processor = AutoProcessor.from_pretrained('ucsahin/TraVisionLM-base', trust_remote_code=True)
+url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
+image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
+prompt_list = [
+  'Açıkla',
+  'Detaylı açıkla',
+  'Araba nerede duruyor?',
+  'Arabanın rengi nedir?',
+]
+inputs = processor(text=prompt_list, images=len(prompt_list)*[image], padding="longest", return_tensors="pt").to("cuda")
+outputs = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.6, top_p=0.9, top_k=50, repetition_penalty=1.2)
+output_text_list = processor.batch_decode(outputs, skip_special_tokens=True)
+for output_text in output_text_list:
+  print(f"Model response: {output_text}\n\n\n")
+"""
+Model response: Açıkla
+Bir binanın önünde, sokakta park halindeki mavi bir Volkswagen Beetle.
+Model response: Detaylı açıkla
+Bu görüntüde, bir taş döşeli sokakta park edilmiş yeşil ve mavi bir Volkswagen Beetle bulunmaktadır. Arka planda iki sarı bina vardır. Araba kameraya doğru bakmaktadır. Görüntü net odaklanmıştır ve renkler canlıdır. Görsel tarzı gerçekçidir.
+Model response: Araba nerede duruyor?
+Araba, sarı bir binanın yanında sokakta park edilmiş.
+Model response: Arabanın rengi nedir?
+Araba turkuaz veya limon yeşili renktedir.
+"""
+```
+---
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+I plan to release the multimodal Turkish data used during the training of the model. But, the data is in a very messy format. Until then, in order to get the grasp of the dataset and for contributing to the open-source community, I am releasing the evaluation portion of the dataset at [ucsahin/Turkish-VLM-Mix-Benchmark](https://huggingface.co/datasets/ucsahin/Turkish-VLM-Mix-Benchmark).
+The dataset consists of predominantly translated versions of the well-known multimodal datasets in English to Turkish. More information on this will be shared in the future.
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+The following training hyperparameters are used in feature alignment and task specific training stages respectively:
+- **Feature Alignment**
+| Data size    | Global Batch Size | Learning Rate | Epochs | Max Length | Weight Decay |
+|--------------|-------------------|---------------|--------|------------|--------------|
+| 500K         | 128               | 1e-3          | 1      | 1024       | 0            |
+- **Task Specific Training**
+| Data size    | Global Batch Size | Learning Rate | Epochs | Max Length | Weight Decay |
+|--------------|-------------------|---------------|--------|------------|--------------|
+| 1.1M         | 128               | 2e-5          | 3      | 1024       | 0            |
+## Evaluation
+This section will be updated after I get some evaluation results on the [ucsahin/Turkish-VLM-Mix-Benchmark](https://huggingface.co/datasets/ucsahin/Turkish-VLM-Mix-Benchmark).
+### Testing Data, Factors & Metrics
+More on this later...
+#### Testing Data
+During the training, I used the following dataset [ucsahin/Turkish-VLM-Mix-Benchmark](https://huggingface.co/datasets/ucsahin/Turkish-VLM-Mix-Benchmark) as the evaluation split.
 ### Compute Infrastructure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+The following compute resources are used in feature alignment and task specific training stages respectively:
+- **Feature Alignment**
+1xA100(40GB), took approximately 4 GPU hours.
+- **Task Specific Training**
+1xH100(80GB), took approximately 18 GPU hours.
+## Citation
+I am releasing TraVisionLM under the Apache 2.0 License. To the best of my knowledge after through research, this should comply with the datasets and unimodal vision and language models used during development.
+**However, if I receive any feedback indicating otherwise, I will promptly update the licensing information as needed.**
+If you use the TraVisionLM model in your research, work, or personal projects, please acknowledge this repository. 🙏
+Finally, I reserve the right to publish this work in an academic setting if it reaches a mature state. In that case, I will provide the appropriate citations here so that any future work can appropriately cite it.
 ## Model Card Contact