Visual-Grounded Skip-Gram for CIFAR-100 (523 words, 128-dim)
A 128-dimensional Skip-Gram word embedding model trained on the Visual Genome scene-graph corpus, extended to cover all 100 CIFAR-100 class labels via synthetic corpus augmentation and a five-phase BERT-initialised Genetic Algorithm refinement pipeline.
Model Description
Standard Skip-Gram trained on Visual Genome covers only ~455 words, leaving 68 CIFAR-100 class labels without any embedding. This model closes that gap:
- Stage 1 — 408K synthetic sentences (8-tier templates) introduce 68 missing words into the VG embedding space with semantically motivated anchor words.
- Stage 2 — Five-phase evolutionary refinement (BERT-GA, hub repulsion, centroid blending, BERT-guided push/pull, orthogonal diversification) reduces cross-class contamination from 22% to 6%.
Final vocabulary: 523 words | Dimension: 128 | Context window: 5
Performance
| Metric | Value |
|---|---|
| Mean Reciprocal Rank (MRR) | 86.9% |
| Perfect-clustering words | 93 / 100 |
| Contaminated words | 6 / 100 |
| Excellent superclasses (MRR ≥ 0.8) | 16 / 20 |
| ImageNet transfer accuracy | 77.71% |
Outperforms GloVe 100d (68.6%), GloVe 300d (69.8%), fastText 300d (76.4%), SBERT MiniLM-L6 (69.8%), and SBERT mpnet-base (71.0%) on the CIFAR-100 visual semantic clustering task — despite using a 128-dim model trained on only ~5M tokens (120,000× smaller than fastText).
How to Use
import torch
import torch.nn as nn
import torch.nn.functional as F
from huggingface_hub import hf_hub_download
# --- Minimal model class ---
class SkipGramModel(nn.Module):
def __init__(self, vocab_size, embedding_dim, dropout=0.3):
super().__init__()
self.center_embeddings = nn.Embedding(vocab_size, embedding_dim)
self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
self.dropout = nn.Dropout(dropout)
# --- Download and load checkpoint ---
path = hf_hub_download(repo_id="haripra1112001/visual-skipgram-cifar100",
filename="best_skipgram_523words.pth")
checkpoint = torch.load(path, map_location='cpu', weights_only=False)
vocab = checkpoint['word_to_idx'] # dict: word -> int index
idx_to_word = {v: k for k, v in vocab.items()}
vocab_size = len(vocab) # 523
embedding_dim = 128
model = SkipGramModel(vocab_size, embedding_dim)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
embeddings = model.center_embeddings.weight.data # shape (523, 128)
# --- Lookup a word vector ---
def get_vector(word):
return embeddings[vocab[word]]
# --- Find nearest neighbours by cosine similarity ---
def nearest_neighbours(word, top_k=5):
vec = get_vector(word).unsqueeze(0) # (1, 128)
sims = F.normalize(embeddings, dim=1) @ F.normalize(vec, dim=1).T
sims = sims.squeeze()
sims[vocab[word]] = -1 # exclude self
top = sims.topk(top_k)
return [(idx_to_word[i.item()], round(s.item(), 3))
for i, s in zip(top.indices, top.values)]
print(nearest_neighbours('dolphin'))
# e.g. [('whale', 0.94), ('seal', 0.91), ('otter', 0.89), ...]
Checkpoint Contents
checkpoint.keys()
# ['model_state_dict', 'word_to_idx', 'config']
checkpoint['config']
# {
# 'embedding_dim': 128,
# 'context_size': 5,
# 'num_negative': 10,
# 'lr': 0.10,
# 'dropout': 0.35,
# 'label_smoothing': 0.10,
# 'epochs': 50,
# 'batch_size': 2048,
# 'patience': 6,
# 'rare_threshold': 0.00015
# }
Files in This Repository
| File | Description |
|---|---|
best_skipgram_523words.pth |
Model weights + vocabulary + config |
report.md |
Full technical report — training details, ablation study, baseline comparisons |
Source Code
Full training code, evaluation scripts, and the 5-phase evolutionary refinement pipeline are available on GitHub:
https://github.com/HARISHKUMAR1112001/cifar100-multimodal-embeddings
Citation
@misc{prajapati2026visual,
title = {Visual-Grounded Skip-Gram for CIFAR-100: Corpus Augmentation and
Evolutionary Refinement Outperform Transformer Sentence Encoders
on Visual Clustering},
author = {Prajapati, Harishkumar Kishorkumar},
year = {2026}
}
Training Details
- Base corpus: Visual Genome region descriptions (~108K sentences)
- Synthetic corpus: 408K sentences (8-tier templates, 68 target words)
- Combined corpus:
516K sentences (5M tokens) - Architecture: Skip-Gram with Negative Sampling (SGNS)
- Optimizer: AdamW + ReduceLROnPlateau
- Early stopping: patience = 6 epochs
- Evolutionary refinement: 5 phases (BERT+GA → hub correction → geometric fixes)
Limitations
- 6/100 CIFAR-100 words remain contaminated due to inherent geometric trade-offs in a 128-dim space with 20 tightly packed semantic classes.
- Words with strong polysemy in visual contexts (e.g.,
can,orange,mouse) score below their linguistic baselines. - Results represent a single training run; multi-seed variance is not reported.
Dataset used to train haripra1112001/visual-skipgram-cifar100
Evaluation results
- Mean Reciprocal Rank (MRR) on CIFAR-100self-reported0.869