Visual-Grounded Skip-Gram for CIFAR-100 (523 words, 128-dim)

A 128-dimensional Skip-Gram word embedding model trained on the Visual Genome scene-graph corpus, extended to cover all 100 CIFAR-100 class labels via synthetic corpus augmentation and a five-phase BERT-initialised Genetic Algorithm refinement pipeline.

Model Description

Standard Skip-Gram trained on Visual Genome covers only ~455 words, leaving 68 CIFAR-100 class labels without any embedding. This model closes that gap:

  • Stage 1 — 408K synthetic sentences (8-tier templates) introduce 68 missing words into the VG embedding space with semantically motivated anchor words.
  • Stage 2 — Five-phase evolutionary refinement (BERT-GA, hub repulsion, centroid blending, BERT-guided push/pull, orthogonal diversification) reduces cross-class contamination from 22% to 6%.

Final vocabulary: 523 words | Dimension: 128 | Context window: 5

Performance

Metric Value
Mean Reciprocal Rank (MRR) 86.9%
Perfect-clustering words 93 / 100
Contaminated words 6 / 100
Excellent superclasses (MRR ≥ 0.8) 16 / 20
ImageNet transfer accuracy 77.71%

Outperforms GloVe 100d (68.6%), GloVe 300d (69.8%), fastText 300d (76.4%), SBERT MiniLM-L6 (69.8%), and SBERT mpnet-base (71.0%) on the CIFAR-100 visual semantic clustering task — despite using a 128-dim model trained on only ~5M tokens (120,000× smaller than fastText).

How to Use

import torch
import torch.nn as nn
import torch.nn.functional as F
from huggingface_hub import hf_hub_download

# --- Minimal model class ---
class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim, dropout=0.3):
        super().__init__()
        self.center_embeddings  = nn.Embedding(vocab_size, embedding_dim)
        self.context_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.dropout = nn.Dropout(dropout)

# --- Download and load checkpoint ---
path = hf_hub_download(repo_id="haripra1112001/visual-skipgram-cifar100",
                       filename="best_skipgram_523words.pth")
checkpoint = torch.load(path, map_location='cpu', weights_only=False)

vocab        = checkpoint['word_to_idx']          # dict: word -> int index
idx_to_word  = {v: k for k, v in vocab.items()}
vocab_size   = len(vocab)                         # 523
embedding_dim = 128

model = SkipGramModel(vocab_size, embedding_dim)
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

embeddings = model.center_embeddings.weight.data  # shape (523, 128)

# --- Lookup a word vector ---
def get_vector(word):
    return embeddings[vocab[word]]

# --- Find nearest neighbours by cosine similarity ---
def nearest_neighbours(word, top_k=5):
    vec  = get_vector(word).unsqueeze(0)           # (1, 128)
    sims = F.normalize(embeddings, dim=1) @ F.normalize(vec, dim=1).T
    sims = sims.squeeze()
    sims[vocab[word]] = -1                          # exclude self
    top  = sims.topk(top_k)
    return [(idx_to_word[i.item()], round(s.item(), 3))
            for i, s in zip(top.indices, top.values)]

print(nearest_neighbours('dolphin'))
# e.g. [('whale', 0.94), ('seal', 0.91), ('otter', 0.89), ...]

Checkpoint Contents

checkpoint.keys()
# ['model_state_dict', 'word_to_idx', 'config']

checkpoint['config']
# {
#   'embedding_dim': 128,
#   'context_size': 5,
#   'num_negative': 10,
#   'lr': 0.10,
#   'dropout': 0.35,
#   'label_smoothing': 0.10,
#   'epochs': 50,
#   'batch_size': 2048,
#   'patience': 6,
#   'rare_threshold': 0.00015
# }

Files in This Repository

File Description
best_skipgram_523words.pth Model weights + vocabulary + config
report.md Full technical report — training details, ablation study, baseline comparisons

Source Code

Full training code, evaluation scripts, and the 5-phase evolutionary refinement pipeline are available on GitHub:

https://github.com/HARISHKUMAR1112001/cifar100-multimodal-embeddings

Citation

@misc{prajapati2026visual,
  title   = {Visual-Grounded Skip-Gram for CIFAR-100: Corpus Augmentation and
             Evolutionary Refinement Outperform Transformer Sentence Encoders
             on Visual Clustering},
  author  = {Prajapati, Harishkumar Kishorkumar},
  year    = {2026}
}

Training Details

  • Base corpus: Visual Genome region descriptions (~108K sentences)
  • Synthetic corpus: 408K sentences (8-tier templates, 68 target words)
  • Combined corpus: 516K sentences (5M tokens)
  • Architecture: Skip-Gram with Negative Sampling (SGNS)
  • Optimizer: AdamW + ReduceLROnPlateau
  • Early stopping: patience = 6 epochs
  • Evolutionary refinement: 5 phases (BERT+GA → hub correction → geometric fixes)

Limitations

  • 6/100 CIFAR-100 words remain contaminated due to inherent geometric trade-offs in a 128-dim space with 20 tightly packed semantic classes.
  • Words with strong polysemy in visual contexts (e.g., can, orange, mouse) score below their linguistic baselines.
  • Results represent a single training run; multi-seed variance is not reported.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train haripra1112001/visual-skipgram-cifar100

Evaluation results