YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Sentence Similarity with Word2Vec
π Overview
This repository contains a Word2Vec-based sentence similarity model designed to measure the semantic similarity between input sentences. The model is trained on a large text corpus to capture word embeddings and uses cosine similarity to compute sentence-level similarity scores.
π° Model Details
- Model Architecture: Word2Vec
- Task: Sentence and word Similarity Measurement
- Training Data: Custom text corpus or pre-trained embeddings
- Similarity Metric: Cosine Similarity
- Embedding Size: 300-dimensional vector representation
- Framework: Gensim (Python-based NLP library)
π Usage
Installation
pip install gensim numpy
Loading the Pre-trained Model
from gensim.models import Word2Vec
import numpy as np
from scipy.spatial.distance import cosine
def load_model(model_path):
return Word2Vec.load(model_path)
Sentence Similarity Calculation
def sentence_to_vector(sentence, model):
words = [word for word in sentence.split() if word in model.wv]
if not words:
return np.zeros(model.vector_size)
return np.mean([model.wv[word] for word in words], axis=0)
def compute_similarity(sentence1, sentence2, model):
vec1 = sentence_to_vector(sentence1, model)
vec2 = sentence_to_vector(sentence2, model)
return 1 - cosine(vec1, vec2)
# π Test Example
sentence1 = "What is your name"
sentence2 = "My name is john"
vec1 = get_sentence_embedding(sentence1, word2vec_model)
vec2 = get_sentence_embedding(sentence2, word2vec_model)
cosine_similarity = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
print(f"Cosine similarity between sentences: {cosine_similarity:.4f}")
π Evaluation Metric: Cosine Similarity
A higher cosine similarity score (closer to 1) indicates that two sentences have similar meanings. The model is evaluated based on:
Similarity Score | Interpretation |
---|---|
0.8 - 1.0 | Strong semantic similarity |
0.6 - 0.8 | Moderate similarity |
0.4 - 0.6 | Weak similarity |
Below 0.4 | Unrelated sentences |
β‘ Optimization & Fine-Tuning
- Pre-trained embeddings can be fine-tuned on domain-specific data.
- Stopword removal and lemmatization improve sentence representation.
- Alternative similarity metrics (e.g., Euclidean distance) can be explored.
π οΈ Limitations
- Performance depends on the quality of pre-trained embeddings.
- Out-of-vocabulary (OOV) words may affect accuracy.
- Works best with well-formed sentences and standard grammar.
Inference Providers
NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API:
The model has no library tag.