metadata

title: Marathi Semantic Search
emoji: 📊
colorFrom: pink
colorTo: red
sdk: gradio
sdk_version: 5.1.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Semantic search for Marathi news using fine-tuned SBERT

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Marathi Semantic Search

This project is a semantic search engine for Marathi news articles, leveraging a fine-tuned Marathi SBERT model to retrieve contextually relevant articles based on user queries. The model has been trained using contrastive learning with generated positive pairs to capture the nuances of the Marathi language.

Demo

A working demo of the project is available on Hugging Face Spaces: Sru15/Marathi-Semantic-Search

Features

Fine-tuned SBERT Model: Uses a fine-tuned version of the Marathi SBERT model (l3cube-pune/marathi-sentence-similarity-sbert) for semantic similarity.
Positive Pair Generation: Utilizes contrastive learning with positive pairs to fine-tune the model.
UMAP and Agglomerative Clustering: Applies UMAP for dimensionality reduction and Agglomerative Clustering for grouping similar news articles.
Gradio Interface: A simple and interactive Gradio-based UI for querying news articles and retrieving semantically similar results.

Installation

To reproduce the code locally, follow the steps below:

Clone the Repository:

git clone https://huggingface.co/spaces/Sru15/Marathi-Semantic-Search
cd Marathi-Semantic-Search

Install Dependencies: Ensure you have Python 3.8+ installed. Install the required libraries using:
```
pip install -r requirements.txt
```

Download Pre-trained Model: Download the fine-tuned Marathi SBERT model from Hugging Face:

from transformers import AutoModel, AutoTokenizer

model_name = "Sru15/fine-tuned-marathi-sbert-with-synonyms"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

Download Precomputed Embeddings: If you have precomputed news embeddings saved as a .npy file, place them in the project directory:
```
/content/drive/My Drive/Marathi_SBERT_Model/news_embeddings.npy
```

Usage

Run the Gradio Interface: Start the Gradio interface locally:
```
python app.py
```
This will launch a web interface where you can enter a query in Marathi and retrieve semantically similar news articles.

Model Training and Evaluation

Preprocessing: Normalizes text, tokenizes it using the Marathi SBERT tokenizer, and generates embeddings.
Training: Fine-tunes the pre-trained SBERT model with generated positive pairs using the MultipleNegativesRankingLoss method.
Evaluation: Uses Agglomerative Clustering and UMAP for dimensionality reduction and calculates the Silhouette Score to determine the optimal number of clusters.

Results

Optimal Clusters: 7 clusters with a Silhouette Score of 0.6726.
Performance with UMAP: The UMAP-based clustering outperformed clustering without dimensionality reduction.

Acknowledgements

Model: The pre-trained Marathi SBERT model was sourced from l3cube-pune/marathi-sentence-similarity-sbert.
Frameworks: Built with Gradio for the user interface and UMAP for dimensionality reduction.