yesidcanoc
/

image-captioning-swin-tiny-distilgpt2

vision-encoder-decoder

image-text-to-text

Inference Endpoints

Model card Files Files and versions Community

yesidcanoc commited on Oct 16, 2023

Commit

b0f6fc2

•

1 Parent(s): 3bc3692

Update README.md

Files changed (1) hide show

README.md +3 -1

README.md CHANGED Viewed

@@ -3,7 +3,9 @@ pipeline_tag: image-to-text
 ---
 # Image captioning model
-End-to-end Transformer based image captioning model, where both the encoder and decoder use standard pre-trained transformer architectures.
 ## Encoder
 The encoder uses the pre-trained Swin transformer (Liu et al., 2021) that is a general-purpose backbone for computer vision. It outperforms ViT, DeiT and ResNe(X)t models at tasks such as image classification, object detection and semantic segmentation. The fact that this model is not pre-trained to be a 'narrow expert'--- a model pre-trained to perform a specific task e.g., image classification --- makes it a good candidate for fine-tuning on a downstream task.

 ---
 # Image captioning model
+End-to-end Transformer based image captioning model, where both the encoder and decoder use standard pre-trained transformer architectures.
+### Repository: https://github.com/yesidc/image-captioning
 ## Encoder
 The encoder uses the pre-trained Swin transformer (Liu et al., 2021) that is a general-purpose backbone for computer vision. It outperforms ViT, DeiT and ResNe(X)t models at tasks such as image classification, object detection and semantic segmentation. The fact that this model is not pre-trained to be a 'narrow expert'--- a model pre-trained to perform a specific task e.g., image classification --- makes it a good candidate for fine-tuning on a downstream task.