YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Sentence Similarity with Word2Vec

πŸ“Œ Overview

This repository contains a Word2Vec-based sentence similarity model designed to measure the semantic similarity between input sentences. The model is trained on a large text corpus to capture word embeddings and uses cosine similarity to compute sentence-level similarity scores.

🏰 Model Details

  • Model Architecture: Word2Vec
  • Task: Sentence and word Similarity Measurement
  • Training Data: Custom text corpus or pre-trained embeddings
  • Similarity Metric: Cosine Similarity
  • Embedding Size: 300-dimensional vector representation
  • Framework: Gensim (Python-based NLP library)

πŸš€ Usage

Installation

pip install gensim numpy

Loading the Pre-trained Model

from gensim.models import Word2Vec
import numpy as np
from scipy.spatial.distance import cosine

def load_model(model_path):
    return Word2Vec.load(model_path)

Sentence Similarity Calculation

def sentence_to_vector(sentence, model):
    words = [word for word in sentence.split() if word in model.wv]
    if not words:
        return np.zeros(model.vector_size)
    return np.mean([model.wv[word] for word in words], axis=0)

def compute_similarity(sentence1, sentence2, model):
    vec1 = sentence_to_vector(sentence1, model)
    vec2 = sentence_to_vector(sentence2, model)
    return 1 - cosine(vec1, vec2)

# πŸ‘‰ Test Example

sentence1 = "What is your name"
sentence2 = "My name is john"

vec1 = get_sentence_embedding(sentence1, word2vec_model)
vec2 = get_sentence_embedding(sentence2, word2vec_model)

cosine_similarity = np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
print(f"Cosine similarity between sentences: {cosine_similarity:.4f}")

πŸ“Š Evaluation Metric: Cosine Similarity

A higher cosine similarity score (closer to 1) indicates that two sentences have similar meanings. The model is evaluated based on:

Similarity Score Interpretation
0.8 - 1.0 Strong semantic similarity
0.6 - 0.8 Moderate similarity
0.4 - 0.6 Weak similarity
Below 0.4 Unrelated sentences

⚑ Optimization & Fine-Tuning

  • Pre-trained embeddings can be fine-tuned on domain-specific data.
  • Stopword removal and lemmatization improve sentence representation.
  • Alternative similarity metrics (e.g., Euclidean distance) can be explored.

πŸ› οΈ Limitations

  • Performance depends on the quality of pre-trained embeddings.
  • Out-of-vocabulary (OOV) words may affect accuracy.
  • Works best with well-formed sentences and standard grammar.
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.