Word2Vec_hindi

Welcome to Word2Vec_hindi

This project is my attempt at implementing the Word2Vec model completely from scratch, specifically for the Hindi language.

The primary goal of this project is learning by building — understanding how word embeddings work internally by implementing the entire pipeline myself instead of relying on high-level NLP libraries.

The project currently includes:

  • Dataset collection and preprocessing
  • Vocabulary generation
  • Skip-gram pair generation
  • Negative sampling
  • Custom PyTorch training pipeline
  • Embedding evaluation and visualization

Feel free to explore the project, experiment with it, and raise issues or suggestions. While I may not implement every suggestion, I genuinely appreciate feedback and ideas.


Project Status

This project has evolved from a small experimental implementation into a large-scale embedding training pipeline.

Current progress includes:

  • Training on a corpus containing over 82M Hindi tokens
  • Generating over 1.5 Billion skip-gram training pairs
  • Training multiple embedding models with dimensions ranging from 300–400
  • Evaluating embeddings using:
    • cosine similarity
    • nearest-neighbor retrieval
    • analogy testing
    • embedding visualization using PCA and t-SNE

The current best-performing model:

  • Embedding Size: 350
  • Training Loss: ~0.38
  • Validation Loss: ~0.47

The model is now producing meaningful semantic separation between positive and negative word pairs.


Latest Updates

  • Combined 5 large Hindi datasets into a single training corpus
  • Final corpus size reached approximately 82M tokens
  • Vocabulary built from words occurring atleast 2 times
  • Final vocabulary size exceeds 500K unique words
  • Context window size increased from 3 → 5
  • Generated approximately:
    • 1.5 Billion training skip-gram pairs
    • 40M validation pairs
    • 40M testing pairs
  • Implemented:
    • Skip-gram training
    • Negative sampling
    • BCEWithLogitsLoss training objective
    • Adagrad optimizer
  • Added support for:
    • PCA embedding visualization
    • t-SNE embedding visualization
    • cosine similarity search
    • analogy-based embedding evaluation

Datasets Used

1. Hindi Bible

Source: https://www.kaggle.com/datasets/kapilverma/hindi-bible

2. Hindi-English Corpora

Source: https://www.kaggle.com/datasets/aiswaryaramachandran/hindienglish-corpora

3. English-Hindi Dataset

Source: https://www.kaggle.com/datasets/preetviradiya/english-hindi-dataset

4. IIT Bombay English-Hindi Translation Dataset

Source: https://www.kaggle.com/datasets/vaibhavkumar11/hindi-english-parallel-corpus

5. Hindi Wikipedia Articles - 172k

Source: https://www.kaggle.com/datasets/disisbig/hindi-wikipedia-articles-172k


Dataset Preprocessing

The preprocessing pipeline currently includes:

  • Combining Hindi text from multiple datasets
  • Cleaning punctuation and noisy symbols
  • Tokenizing text into words
  • Building vocabulary mappings
  • Removing extremely rare words
  • Generating skip-gram training pairs
  • Generating negative samples

Vocabulary Pruning

Instead of keeping every unique token, only words appearing atleast 2 times are retained.

This helps:

  • Reduce vocabulary size
  • Improve training efficiency
  • Remove noisy and corrupted tokens
  • Improve embedding quality

Context Window

  • Previous context window size: 3
  • Current context window size: 5

With a window size of 5:

  • each center word can generate up to 10 positive pairs
  • broader semantic context can be captured
  • embeddings learn richer relationships

Training Data Generation

For each word:

  • The word is treated as the center/context word
  • Neighboring words within the context window are treated as positive target words

Example

Sentence:

आज सुबह मैंने अपने पुराने दोस्त के साथ बाजार में चाय पी

If the center word is:

दोस्त

Generated positive pairs:

[दोस्त, सुबह]
[दोस्त, मैंने]
[दोस्त, अपने]
[दोस्त, पुराने]
[दोस्त, के]
[दोस्त, साथ]
[दोस्त, बाजार]
[दोस्त, में]
[दोस्त, चाय]
[दोस्त, पी]

This process is repeated across the entire corpus to generate training pairs.


Negative Sampling

In addition to positive pairs, negative samples are generated.

Random vocabulary words that do not appear in the context window are paired with the center word.

Example

[दोस्त, कंप्यूटर]
[दोस्त, पहाड़]
[दोस्त, विज्ञान]

These represent unlikely co-occurrences.


Why Negative Sampling?

Negative sampling helps:

  • Learn meaningful semantic separation
  • Distinguish related vs unrelated words
  • Scale training efficiently to very large vocabularies
  • Avoid the computational cost of full softmax

Model Architecture

Current training setup:

  • Architecture: Skip-gram Word2Vec
  • Framework: PyTorch
  • Embedding dimensions tested:
    • 300
    • 350
    • 400
  • Best-performing embedding size so far: 350
  • Optimizer: Adagrad
  • Loss Function: BCEWithLogitsLoss
  • Training uses:
    • positive skip-gram pairs
    • negative sampled pairs

Current Results

The model now learns strong separation between positive and negative pairs.

Observed probability ranges:

  • Positive pairs: ~0.94
  • Negative pairs: ~0.07

The embeddings are beginning to capture:

  • semantic similarity
  • contextual relationships
  • syntactic structure

Embedding Evaluation

Current evaluation methods include:

1. Cosine Similarity

Used to retrieve semantically similar words.

Example goals:

राजा → रानी, सम्राट, शासक

2. Analogy Testing

Evaluating vector arithmetic relationships such as:

राजा - पुरुष + महिला ≈ रानी

3. Embedding Visualization

Using:

  • PCA
  • t-SNE

to visualize learned word clusters in 2D space.


Future Improvements

Planned improvements include:

  • Subsampling extremely frequent words
  • Improved negative sampling strategies

Contributions

This is primarily a learning and research-oriented project, but suggestions, ideas, and feedback are always welcome.


References


Author

Abhishek Biswas
Software Developer | Interested in AI, NLP, and Web Development
Downloads last month
72
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support