varshamishra's picture
Create README.md
4a214f1 verified

Sentence Similarity with Word2Vec

πŸ“Œ Overview

This repository contains a Word2Vec-based sentence similarity model designed to measure the semantic similarity between input sentences. The model is trained on a large text corpus to capture word embeddings and uses cosine similarity to compute sentence-level similarity scores.

🏰 Model Details

  • Model Architecture: Word2Vec
  • Task: Sentence and word Similarity Measurement
  • Training Data: Custom text corpus or pre-trained embeddings
  • Similarity Metric: Cosine Similarity
  • Embedding Size: 300-dimensional vector representation
  • Framework: Gensim (Python-based NLP library)

πŸš€ Usage

Installation

pip install gensim numpy

Loading the Pre-trained Model

from gensim.models import Word2Vec
import numpy as np
from scipy.spatial.distance import cosine

def load_model(model_path):
    return Word2Vec.load(model_path)

Sentence Similarity Calculation

def sentence_to_vector(sentence, model):
    words = [word for word in sentence.split() if word in model.wv]
    if not words:
        return np.zeros(model.vector_size)
    return np.mean([model.wv[word] for word in words], axis=0)

def compute_similarity(sentence1, sentence2, model):
    vec1 = sentence_to_vector(sentence1, model)
    vec2 = sentence_to_vector(sentence2, model)
    return 1 - cosine(vec1, vec2)

# πŸ‘‰ Test Example

sentence1 = "What is your name"
sentence2 = "My name is john"

vec1 = get_sentence_embedding(sentence1, word2vec_model)
vec2 = get_sentence_embedding(sentence2, word2vec_model)

cosine_similarity = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
print(f"Cosine similarity between sentences: {cosine_similarity:.4f}")

πŸ“Š Evaluation Metric: Cosine Similarity

A higher cosine similarity score (closer to 1) indicates that two sentences have similar meanings. The model is evaluated based on:

Similarity Score Interpretation
0.8 - 1.0 Strong semantic similarity
0.6 - 0.8 Moderate similarity
0.4 - 0.6 Weak similarity
Below 0.4 Unrelated sentences

⚑ Optimization & Fine-Tuning

  • Pre-trained embeddings can be fine-tuned on domain-specific data.
  • Stopword removal and lemmatization improve sentence representation.
  • Alternative similarity metrics (e.g., Euclidean distance) can be explored.

πŸ› οΈ Limitations

  • Performance depends on the quality of pre-trained embeddings.
  • Out-of-vocabulary (OOV) words may affect accuracy.
  • Works best with well-formed sentences and standard grammar.