gabrielmotablima commited on
Commit
bd54ed3
1 Parent(s): 3cd4590

update readme

Browse files
Files changed (1) hide show
  1. README.md +46 -11
README.md CHANGED
@@ -9,31 +9,66 @@ metrics:
9
  - rouge
10
  - meteor
11
  - bertscore
12
- base_model: microsoft/swin-base-patch4-window7-224
 
13
  pipeline_tag: text-generation
14
  ---
15
 
16
- # Model Card for Model ID
17
 
18
- <!-- Provide a quick summary of what the model is/does. -->
 
19
 
20
 
21
- ## Model Description
22
 
23
- <!-- Provide a longer summary of what this model is. -->
 
 
24
 
 
 
25
 
26
- ## How to Get Started with the Model
 
 
 
27
 
28
  Use the code below to get started with the model.
29
 
30
- [More Information Needed]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
 
32
 
33
- ### Results
34
 
35
- [More Information Needed]
 
 
 
36
 
37
- **BibTeX:**
38
 
39
- [More Information Needed]
 
 
 
9
  - rouge
10
  - meteor
11
  - bertscore
12
+ base_model:
13
+ - microsoft/swin-base-patch4-window7-224
14
  pipeline_tag: text-generation
15
  ---
16
 
17
+ # 🎉 Swin-GPorTuguese
18
 
19
+ Swin-GPorTuguese model trained on [Flickr30K Portuguese](https://huggingface.co/datasets/laicsiifes/flickr30k-pt-br) (translated version using Google Translator API)
20
+ at resolution 224x224 and max sequence length of 1024 tokens.
21
 
22
 
23
+ ## 🤖 Model Description
24
 
25
+ The Swin-GPorTuguese is a type of Vision Encoder Decoder which leverage the checkpoints of the [Swin Transformer](https://huggingface.co/microsoft/swin-base-patch4-window7-224)
26
+ as encoder and the checkpoints of the [GPorTuguese](pierreguillou/gpt2-small-portuguese) as decoder.
27
+ The encoder checkpoints come from Swin Trasnformer version pre-trained on ImageNet-1k at resolution 224x224.
28
 
29
+ The code used for training and evaluation is available at: https://github.com/laicsiifes/ved-transformer-caption-ptbr. In this work, Swin-GPorTuguese
30
+ was trained together with its buddy [Swin-DistilBERTimbau](https://huggingface.co/laicsiifes/swin-distilbert-flickr30k-pt-br).
31
 
32
+ Other models evaluated didn't achieve performance as high as Swin-DistilBERTimbau and Swin-GPorTuguese, namely: DeiT-BERTimbau,
33
+ DeiT-DistilBERTimbau, DeiT-GPorTuguese, Swin-BERTimbau, ViT-BERTimbau, ViT-DistilBERTimbau and ViT-GPorTuguese.
34
+
35
+ ## 🧑‍💻 How to Get Started with the Model
36
 
37
  Use the code below to get started with the model.
38
 
39
+ ```python
40
+ import requests
41
+ from PIL import Image
42
+
43
+ from transformers import AutoTokenizer, ViTImageProcessor, VisionEncoderDecoderModel
44
+
45
+ # load a fine-tuned image captioning model and corresponding tokenizer and image processor
46
+ model = VisionEncoderDecoderModel.from_pretrained("laicsiifes/swin-gpt2-flickr30k-pt-br")
47
+ tokenizer = GPT2TokenizerFast.from_pretrained("laicsiifes/swin-gpt2-flickr30k-pt-br")
48
+ image_processor = ViTImageProcessor.from_pretrained("laicsiifes/swin-gpt2-flickr30k-pt-br")
49
+
50
+ # perform inference on an image
51
+ url = "http://images.cocodataset.org/val2017/000000039769.jpg"
52
+ image = Image.open(requests.get(url, stream=True).raw)
53
+ pixel_values = image_processor(image, return_tensors="pt").pixel_values
54
+
55
+ # generate caption
56
+ generated_ids = model.generate(pixel_values)
57
+ generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
58
+ print(generated_text)
59
+ ```
60
 
61
+ ## 📈 Results
62
 
63
+ The evaluation metrics Cider-D, BLEU@4, ROUGE-L, METEOR and BERTScore are abbreviated as C, B@4, RL, M and BS, respectively.
64
 
65
+ |Model|Training|Evaluation|C|B@4|RL|M|BS|
66
+ |-----|--------|----------|-------|------|-------|------|---------|
67
+ |Swin-DistilBERTimbau|Flickr30K Portuguese|Flickr30K Portuguese|66.73|24.65|39.98|44.71|72.30|
68
+ |Swin-GPorTuguese|Flickr30K Portuguese|Flickr30K Portuguese|64.71|23.15|39.39|44.36|71.70|
69
 
70
+ ## 📋 BibTeX entry and citation info
71
 
72
+ ```bibtex
73
+ Coming Soon
74
+ ```