File size: 2,886 Bytes
1f07473
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27f3b48
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: cc-by-4.0
language:
- en
- tr
tags:
- VLM
- image2text
- lm
---
# TeLVE: Turkish efficient Language Vision Engine 🧿
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC%20BY%204.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)
[![Models: v1.0](https://img.shields.io/badge/Models-v1.0%2c%20v1.0dep-blue)](https://huggingface.co/outsu/TeLVE)
## First Turkish VLM ever!

TeLVE is the first Visual Language Model specifically designed for Turkish language understanding and image description generation. Built on Vision Transformer (ViT) and BERT pre-trained encoder architectures, it bridges the gap in Turkish visual-linguistic processing.
 No module named 'imagine'
![TeLVE logo](<teLVE_logo.png>)

## Model Description

TeLVE combines:
- 🖼️ Vision Transformer (ViT-base-patch16-224)
- 📝 Turkish BERT (dbmdz/bert-base-turkish-cased)
- 🔄 Cross-attention mechanism for vision-language fusion

### Version Logs
- **TeLVE v1.0**: Trained on Unsplash Lite dataset
- **TeLVE v1.0dep**: Dataset enhanced with selective images from Pexels images, the encoder problem with letter "ü" was fixed. *(Deprecated, performance was decreased because of dataset addressing problem. Not recommended to use.)*

## Usage

The model can be used in two ways:

### Inference (imagine.py)
```python
# Generate captions for images
python imagine.py
```
This script:
- Loads a trained TeLVE model
- Takes images from `images` directory
- Generates Turkish captions for each image
- Outputs the results to console

### Training (main.py)
Users can train their own models with ViT and BERT encoders.
```python
# Train a new model
python main.py
```

This script:
- Loads and preprocesses image-caption pairs
- Initializes ViT and BERT encoders
- Trains the combined model
- Saves the model and tokenizer


## Performance
Performance scores will be evaluated.
<!--
| Model Version | Dataset | BLEU-4 | METEOR | CIDEr |
|--------------|---------|---------|---------|--------|
| TeLVE v1.0   | Unsplash | *TBD*   | *TBD*   | *TBD*  |
| TeLVE v1.1   | Unsplash+Pexels | *TBD* | *TBD* | *TBD* |-->

## Citation

```bibtex
@software{telve2024,
    author = {Öğüt Su Karagün},
    title = {TeLVE: Turkish efficient Language Vision Engine},
    year = {2024},
    url = {https://huggingface.co/outsu/TeLVE}
}
```

## License
<p xmlns:cc="http://creativecommons.org/ns#" xmlns:dct="http://purl.org/dc/terms/"><a property="dct:title" rel="cc:attributionURL" href="https://huggingface.co/outsu/TeLVE">TeLVE</a> © 2024 by <a rel="cc:attributionURL dct:creator" property="cc:attributionName" href="https://outsu.github.io">Öğüt Su Karagün</a> is licensed under <a href="https://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;">Creative Commons Attribution 4.0 International</a></p>