pasindu
/

vit-swin-base-224-gpt2-image-captioning

Image-Text-to-Text

vision-encoder-decoder

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

Edit model card

vit-swin-base-224-gpt2-image-captioning

This model is a fine-tuned version of on an unknown dataset. It achieves the following results on the evaluation set:

Loss: 0.0001
Rouge1: 99.2148
Rouge2: 99.1824
Rougel: 99.22
Rougelsum: 99.2169
Bleu: 96.4656
Gen Len: 10.4161

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 5e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: linear
num_epochs: 50

Training results

Training Loss	Epoch	Step	Validation Loss	Rouge1	Rouge2	Rougel	Rougelsum	Bleu	Gen Len
0.622	11.36	2000	0.0330	91.0769	88.8333	90.7025	90.7277	84.8472	10.4161
0.0547	22.73	4000	0.0015	99.0694	98.9636	99.0615	99.0613	96.1312	10.4161
0.0238	34.09	6000	0.0007	99.1681	99.0942	99.167	99.1646	96.3754	10.4161
0.0046	45.45	8000	0.0001	99.2225	99.1781	99.217	99.2171	96.4412	10.4161

Framework versions

Transformers 4.35.2
Pytorch 2.1.0+cu121
Datasets 2.16.1
Tokenizers 0.15.0

Downloads last month: 0

Safetensors

Model size

240M params

Tensor type

I64

·

F32

·

Inference API

Image-Text-to-Text

Inference API (serverless) does not yet support transformers models for this pipeline type.

Evaluation results

Metadata error: specify a dataset to view leaderboard