🖼️ Image Captioning Model — 100k Training Run

EfficientNet-V2-S Encoder + Transformer Decoder

This model takes an image as input and generates an English caption describing the image.

Model Overview

This repository contains a custom PyTorch image captioning model trained on a 100k-sample COCO-style image-caption dataset.

The model uses an encoder-decoder structure:

Input Image
    ↓
EfficientNet-V2-S Image Encoder
    ↓
Visual Feature Tokens
    ↓
Transformer Text Decoder
    ↓
Generated Caption

Component	Description
Input	RGB image
Encoder	EfficientNet-V2-S pretrained on ImageNet
Decoder	Transformer decoder
Output	English image caption
Training samples	100,000
Validation samples	20,000
Vocabulary size	9,721 tokens
Checkpoint	`best_phase2.pt`
Validation loss	`3.4565`

Architecture Details

Image Encoder

Setting	Value
Backbone	EfficientNet-V2-S
Pretraining	ImageNet
Image size	224 × 224
Visual tokens	49
Embedding dimension	256

Text Decoder

Setting	Value
Decoder type	Transformer Decoder
Vocabulary size	9,721
Embedding dimension	256
Transformer layers	6
Attention heads	8
Feed-forward dimension	1024
Maximum caption length	52
Dropout	0.1
Decoding methods	Greedy search, Beam search

Repository Files

.
├── best_phase2.pt        # PyTorch checkpoint
├── Traning-100k.ipynb    # Training, loading, inference, and evaluation notebook
└── README.md             # Model card

Important Note About Vocabulary

This model uses a custom word-level vocabulary. The checkpoint stores the model weights, but it does not store the word-to-index and index-to-word mappings.

To reproduce captions correctly, the same vocabulary used during training is required.

Special tokens:

Token	ID
`<PAD>`	0
`<SOS>`	1
`<EOS>`	2
`<UNK>`	3

The recommended vocabulary file is:

vocab.json

Without the correct vocabulary, the model may generate token IDs, but those IDs cannot be reliably converted back into English captions.

Training Details

The model was trained in two phases:

Phase	Encoder Setting	Purpose
Phase 1	Frozen EfficientNet encoder	Train decoder and projection layers
Phase 2	Partially unfrozen EfficientNet encoder	Fine-tune visual features

Setting	Value
Dataset format	COCO-style image-caption annotations
Training samples	100,000
Validation samples	20,000
Total captions used for vocabulary	414,113
Batch size	356
Image size	224 × 224
Maximum caption length	52
Optimizer	AdamW
Loss function	Cross entropy
Label smoothing	0.1
LR schedule	Warmup + cosine decay

Evaluation Results

Evaluation was performed on 2,000 validation samples using beam search with beam size 5.

Metric	Score
BLEU-1	37.88
BLEU-4	9.36
CIDEr	0.8452
Validation loss	3.4565

Example prediction:

Type	Caption
Ground truth	`a bicycle replica with a clock as the front wheel`
Greedy decoding	`a bicycle is shown with a clock on it`
Beam search	`a bicycle with a clock on the side of it`

How to Use

This is a custom PyTorch model. It is not a standard Hugging Face Transformers model, so it cannot be loaded directly with:

AutoModel.from_pretrained(...)

Instead, use the architecture and loading code provided in:

Traning-100k.ipynb

The notebook includes:

Vocabulary class
COCOCaptionDataset class
EfficientNetEncoder
TransformerDecoder
ImageCaptioningModel
Checkpoint loading
Greedy decoding
Beam-search decoding
Evaluation code

Installation

Install the main dependencies:

pip install torch torchvision pillow numpy matplotlib nltk pycocotools pycocoevalcap einops

Image Preprocessing

Images are resized to 224 × 224 and normalized using ImageNet statistics.

import torchvision.transforms as T

IMAGENET_MEAN = [0.485, 0.456, 0.406]
IMAGENET_STD  = [0.229, 0.224, 0.225]

transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(IMAGENET_MEAN, IMAGENET_STD),
])

Loading the Checkpoint

After defining the model architecture and loading the correct vocabulary, use:

import torch

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model = ImageCaptioningModel(
    vocab_size=9721,
    embed_dim=256,
    num_heads=8,
    num_layers=6,
    ff_dim=1024,
    max_len=52,
    dropout=0.1
).to(device)

checkpoint = torch.load("best_phase2.pt", map_location=device)
model.load_state_dict(checkpoint["model"])
model.eval()

print("Checkpoint loaded")
print("Checkpoint epoch:", checkpoint["epoch"])
print("Validation loss:", checkpoint["val_loss"])

Checkpoint metadata:

checkpoint["epoch"] = 14
checkpoint["val_loss"] = 3.4565230486026866

Caption Generation

The notebook includes greedy decoding and beam-search decoding.

from PIL import Image

image = Image.open("example.jpg").convert("RGB")
image_tensor = transform(image)

caption = model.generate_beam(image_tensor, beam_size=5)
print("Generated caption:", caption)

Example output:

a bicycle with a clock on the side of it

Limitations

This model is experimental and has some limitations:

It uses a custom PyTorch architecture, not a standard Hugging Face Transformers architecture.
It requires the original model class definitions to load correctly.
It requires the same vocabulary used during training.
Caption quality may be limited by the 100k-sample training subset.
The model may generate generic captions for complex images.
The model may hallucinate objects that are not present in the image.
The tokenizer is word-level, so rare or unseen words are mapped to <UNK>.

Intended Use

This model is intended for:

Image caption generation
Educational deep learning experiments
Vision-language model learning
Encoder-decoder architecture demonstrations
COCO-style image captioning practice

Out-of-Scope Use

This model is not intended for:

Safety-critical computer vision systems
Medical image interpretation
Legal or forensic image analysis
Real-time production deployment without further validation

Citation

@misc{image_captioning_100k,
  title = {Image Captioning Model with EfficientNet-V2-S Encoder and Transformer Decoder},
  author = {Ali Sedghiye},
  year = {2026},
  note = {Custom PyTorch image captioning model trained on 100k COCO-style samples}
}

Author

Developed by Ali Sedghiye as a custom PyTorch image captioning model using an EfficientNet-V2-S image encoder and a Transformer text decoder.

Downloads last month: -; Downloads are not tracked for this model. How to track