Sentence Similarity with Word2Vec
π Overview
This repository contains a Word2Vec-based sentence similarity model designed to measure the semantic similarity between input sentences. The model is trained on a large text corpus to capture word embeddings and uses cosine similarity to compute sentence-level similarity scores.
π° Model Details
- Model Architecture: Word2Vec
- Task: Sentence and word Similarity Measurement
- Training Data: Custom text corpus or pre-trained embeddings
- Similarity Metric: Cosine Similarity
- Embedding Size: 300-dimensional vector representation
- Framework: Gensim (Python-based NLP library)
π Usage
Installation
pip install gensim numpy
Loading the Pre-trained Model
from gensim.models import Word2Vec
import numpy as np
from scipy.spatial.distance import cosine
def load_model(model_path):
return Word2Vec.load(model_path)
Sentence Similarity Calculation
def sentence_to_vector(sentence, model):
words = [word for word in sentence.split() if word in model.wv]
if not words:
return np.zeros(model.vector_size)
return np.mean([model.wv[word] for word in words], axis=0)
def compute_similarity(sentence1, sentence2, model):
vec1 = sentence_to_vector(sentence1, model)
vec2 = sentence_to_vector(sentence2, model)
return 1 - cosine(vec1, vec2)
# π Test Example
sentence1 = "What is your name"
sentence2 = "My name is john"
vec1 = get_sentence_embedding(sentence1, word2vec_model)
vec2 = get_sentence_embedding(sentence2, word2vec_model)
cosine_similarity = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
print(f"Cosine similarity between sentences: {cosine_similarity:.4f}")
π Evaluation Metric: Cosine Similarity
A higher cosine similarity score (closer to 1) indicates that two sentences have similar meanings. The model is evaluated based on:
Similarity Score | Interpretation |
---|---|
0.8 - 1.0 | Strong semantic similarity |
0.6 - 0.8 | Moderate similarity |
0.4 - 0.6 | Weak similarity |
Below 0.4 | Unrelated sentences |
β‘ Optimization & Fine-Tuning
- Pre-trained embeddings can be fine-tuned on domain-specific data.
- Stopword removal and lemmatization improve sentence representation.
- Alternative similarity metrics (e.g., Euclidean distance) can be explored.
π οΈ Limitations
- Performance depends on the quality of pre-trained embeddings.
- Out-of-vocabulary (OOV) words may affect accuracy.
- Works best with well-formed sentences and standard grammar.