πŸ–ΌοΈ Image Captioning Model β€” 100k Training Run

EfficientNet-V2-S Encoder + Transformer Decoder

This model takes an image as input and generates an English caption describing the image.

Task Framework Encoder Decoder Training


Model Overview

This repository contains a custom PyTorch image captioning model trained on a 100k-sample COCO-style image-caption dataset.

The model uses an encoder-decoder structure:

Input Image
    ↓
EfficientNet-V2-S Image Encoder
    ↓
Visual Feature Tokens
    ↓
Transformer Text Decoder
    ↓
Generated Caption
Component Description
Input RGB image
Encoder EfficientNet-V2-S pretrained on ImageNet
Decoder Transformer decoder
Output English image caption
Training samples 100,000
Validation samples 20,000
Vocabulary size 9,721 tokens
Checkpoint best_phase2.pt
Validation loss 3.4565

Architecture Details

Image Encoder

Setting Value
Backbone EfficientNet-V2-S
Pretraining ImageNet
Image size 224 Γ— 224
Visual tokens 49
Embedding dimension 256

Text Decoder

Setting Value
Decoder type Transformer Decoder
Vocabulary size 9,721
Embedding dimension 256
Transformer layers 6
Attention heads 8
Feed-forward dimension 1024
Maximum caption length 52
Dropout 0.1
Decoding methods Greedy search, Beam search

Repository Files

.
β”œβ”€β”€ best_phase2.pt        # PyTorch checkpoint
β”œβ”€β”€ Traning-100k.ipynb    # Training, loading, inference, and evaluation notebook
└── README.md             # Model card

Important Note About Vocabulary

This model uses a custom word-level vocabulary. The checkpoint stores the model weights, but it does not store the word-to-index and index-to-word mappings.

To reproduce captions correctly, the same vocabulary used during training is required.

Special tokens:

Token ID
<PAD> 0
<SOS> 1
<EOS> 2
<UNK> 3

The recommended vocabulary file is:

vocab.json

Without the correct vocabulary, the model may generate token IDs, but those IDs cannot be reliably converted back into English captions.


Training Details

The model was trained in two phases:

Phase Encoder Setting Purpose
Phase 1 Frozen EfficientNet encoder Train decoder and projection layers
Phase 2 Partially unfrozen EfficientNet encoder Fine-tune visual features
Setting Value
Dataset format COCO-style image-caption annotations
Training samples 100,000
Validation samples 20,000
Total captions used for vocabulary 414,113
Batch size 356
Image size 224 Γ— 224
Maximum caption length 52
Optimizer AdamW
Loss function Cross entropy
Label smoothing 0.1
LR schedule Warmup + cosine decay

Evaluation Results

Evaluation was performed on 2,000 validation samples using beam search with beam size 5.

Metric Score
BLEU-1 37.88
BLEU-4 9.36
CIDEr 0.8452
Validation loss 3.4565

Example prediction:

Type Caption
Ground truth a bicycle replica with a clock as the front wheel
Greedy decoding a bicycle is shown with a clock on it
Beam search a bicycle with a clock on the side of it

How to Use

This is a custom PyTorch model. It is not a standard Hugging Face Transformers model, so it cannot be loaded directly with:

AutoModel.from_pretrained(...)

Instead, use the architecture and loading code provided in:

Traning-100k.ipynb

The notebook includes:

Vocabulary class
COCOCaptionDataset class
EfficientNetEncoder
TransformerDecoder
ImageCaptioningModel
Checkpoint loading
Greedy decoding
Beam-search decoding
Evaluation code

Installation

Install the main dependencies:

pip install torch torchvision pillow numpy matplotlib nltk pycocotools pycocoevalcap einops

Image Preprocessing

Images are resized to 224 Γ— 224 and normalized using ImageNet statistics.

import torchvision.transforms as T

IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD  = [0.229, 0.224, 0.225]

transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(IMAGENET_MEAN, IMAGENET_STD),
])

Loading the Checkpoint

After defining the model architecture and loading the correct vocabulary, use:

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = ImageCaptioningModel(
    vocab_size=9721,
    embed_dim=256,
    num_heads=8,
    num_layers=6,
    ff_dim=1024,
    max_len=52,
    dropout=0.1
).to(device)

checkpoint = torch.load("best_phase2.pt", map_location=device)
model.load_state_dict(checkpoint["model"])
model.eval()

print("Checkpoint loaded")
print("Checkpoint epoch:", checkpoint["epoch"])
print("Validation loss:", checkpoint["val_loss"])

Checkpoint metadata:

checkpoint["epoch"] = 14
checkpoint["val_loss"] = 3.4565230486026866

Caption Generation

The notebook includes greedy decoding and beam-search decoding.

from PIL import Image

image = Image.open("example.jpg").convert("RGB")
image_tensor = transform(image)

caption = model.generate_beam(image_tensor, beam_size=5)
print("Generated caption:", caption)

Example output:

a bicycle with a clock on the side of it

Limitations

This model is experimental and has some limitations:

  • It uses a custom PyTorch architecture, not a standard Hugging Face Transformers architecture.
  • It requires the original model class definitions to load correctly.
  • It requires the same vocabulary used during training.
  • Caption quality may be limited by the 100k-sample training subset.
  • The model may generate generic captions for complex images.
  • The model may hallucinate objects that are not present in the image.
  • The tokenizer is word-level, so rare or unseen words are mapped to <UNK>.

Intended Use

This model is intended for:

  • Image caption generation
  • Educational deep learning experiments
  • Vision-language model learning
  • Encoder-decoder architecture demonstrations
  • COCO-style image captioning practice

Out-of-Scope Use

This model is not intended for:

  • Safety-critical computer vision systems
  • Medical image interpretation
  • Legal or forensic image analysis
  • Real-time production deployment without further validation

Citation

@misc{image_captioning_100k,
  title = {Image Captioning Model with EfficientNet-V2-S Encoder and Transformer Decoder},
  author = {Ali Sedghiye},
  year = {2026},
  note = {Custom PyTorch image captioning model trained on 100k COCO-style samples}
}

Author

Developed by Ali Sedghiye as a custom PyTorch image captioning model using an EfficientNet-V2-S image encoder and a Transformer text decoder.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support