OFA-Sys
/

ofa-base-caption-fairseq-version

Model card Files Files and versions Community

Farseq -> Transformers conversion

by mys - opened Sep 10, 2022

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+50119

-7

Files changed (5) hide show

README.md +62 -4
config.json +52 -0
merges.txt +0 -0
caption_base_best.pt → pytorch_model.bin +2 -2
vocab.json +0 -0

README.md CHANGED Viewed

@@ -1,6 +1,64 @@
-# OFA-Base-Caption
-This is the official checkpoint (adaptive to the official code instead of Huggingface Transformers) of OFA-Base finetuned on the MSCOCO Caption dataset for image captioning. Specifically, the model was first trained with cross-entropy loss and then with CIDEr optimization.
-For more information, please refer to the official github ([https://github.com/OFA-Sys/OFA](https://github.com/OFA-Sys/OFA))
-Temporarily, we only provide the finetuned checkpoints based on the official code.

+---
+license: apache-2.0
+---
+# OFA-base-caption
+This is the **base** version of OFA model finetuned for the image captioning task. OFA is a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.) to a simple sequence-to-sequence learning framework.
+The directory includes 4 files, namely `config.json` which consists of model configuration, `vocab.json` and `merge.txt` for our OFA tokenizer, and lastly `pytorch_model.bin` which consists of model weights. There is no need to worry about the mismatch between Fairseq and transformers, since we have addressed the issue yet.
+To use it in transformers, please refer to https://github.com/OFA-Sys/OFA/tree/feature/add_transformers. Install the transformers and download the models as shown below.
+```
+git clone --single-branch --branch feature/add_transformers https://github.com/OFA-Sys/OFA.git
+pip install OFA/transformers/
+```
+After, prepare an image for the testing example below. Also, ensure that you have pillow and torchvision in your environment.
+```
+import re
+import time
+from PIL import Image
+from torchvision import transforms
+from transformers import OFATokenizer, OFAModel
+model_name = "OFA-sys/OFA-base-caption"
+mean, std = [0.5, 0.5, 0.5], [0.5, 0.5, 0.5]
+resolution = 256
+patch_resize_transform = transforms.Compose([
+        lambda image: image.convert("RGB"),
+        transforms.Resize((resolution, resolution), interpolation=Image.BICUBIC),
+        transforms.ToTensor(),
+        transforms.Normalize(mean=mean, std=std)
+    ])
+start = time.time()
+tokenizer = OFATokenizer.from_pretrained(model_name)
+model = OFAModel.from_pretrained(model_name, use_cache=False)
+alapsed = time.time() - start
+print(f"Loaded in {alapsed} secs")
+def caption_image(txt, img):
+    inputs = tokenizer([txt], return_tensors="pt").input_ids
+    patch_img = patch_resize_transform(img).unsqueeze(0)
+    gen = model.generate(inputs, patch_images=patch_img, num_beams=5, no_repeat_ngram_size=3)
+    results = tokenizer.batch_decode(gen, skip_special_tokens=True)
+    result = results[0].strip()
+    result = re.sub(r'[^\w\s]', '', result)
+    return result
+if __name__ == "__main__":
+    txt = "What does the image describe?"
+    img = Image.open('/path/to/input/image.jpg')
+    caption = caption_image(txt, img)
+    print(caption)
+```

config.json ADDED Viewed

	@@ -0,0 +1,52 @@

+{
+  "activation_dropout": 0.0,
+  "activation_function": "gelu",
+  "add_type_embedding": true,
+  "architectures": [
+    "OFAModel"
+  ],
+  "attention_dropout": 0.0,
+  "attn_scale_factor": 2.0,
+  "bos_token_id": 0,
+  "classifier_dropout": 0.0,
+  "code_image_size": 128,
+  "code_layernorm_embedding": true,
+  "d_model": 768,
+  "decoder_attention_heads": 12,
+  "decoder_drop_path_rate": 0.0,
+  "decoder_ffn_dim": 3072,
+  "decoder_layerdrop": 0.0,
+  "decoder_layers": 6,
+  "decoder_normalize_before": true,
+  "decoder_start_token_id": 0,
+  "dropout": 0.1,
+  "encoder_attention_heads": 12,
+  "encoder_drop_path_rate": 0.0,
+  "encoder_ffn_dim": 3072,
+  "encoder_layerdrop": 0.0,
+  "encoder_layers": 6,
+  "encoder_normalize_before": true,
+  "entangle_position_embedding": false,
+  "eos_token_id": 2,
+  "forced_eos_token_id": 2,
+  "image_bucket_size": 42,
+  "init_std": 0.02,
+  "is_encoder_decoder": true,
+  "layernorm_embedding": true,
+  "max_position_embeddings": 1024,
+  "model_type": "ofa",
+  "normformer": true,
+  "num_hidden_layers": 6,
+  "pad_token_id": 1,
+  "patch_layernorm_embedding": true,
+  "resnet_drop_path_rate": 0.0,
+  "resnet_model_path": null,
+  "resnet_type": "resnet101",
+  "scale_embedding": false,
+  "share_decoder_input_output_embed": true,
+  "token_bucket_size": 256,
+  "torch_dtype": "float32",
+  "transformers_version": "4.15.0",
+  "use_cache": false,
+  "vocab_size": 59457
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

caption_base_best.pt → pytorch_model.bin RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:0a243bed55b82bf6596255edae716d6b4262d7a2175d4e24ab6db372a97ed2d1
-size 2254237467

 version https://git-lfs.github.com/spec/v1
+oid sha256:521abbc85015e110be39ca7158579966b6e41101d012b961a5ea6aff18b3fe66
+size 1161554935

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff