Spaces:
Sleeping
A newer version of the Gradio SDK is available:
5.18.0
title: Marathi Semantic Search
emoji: π
colorFrom: pink
colorTo: red
sdk: gradio
sdk_version: 5.1.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Semantic search for Marathi news using fine-tuned SBERT
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
Marathi Semantic Search
This project is a semantic search engine for Marathi news articles, leveraging a fine-tuned Marathi SBERT model to retrieve contextually relevant articles based on user queries. The model has been trained using contrastive learning with generated positive pairs to capture the nuances of the Marathi language.
Demo
A working demo of the project is available on Hugging Face Spaces: Sru15/Marathi-Semantic-Search
Features
- Fine-tuned SBERT Model: Uses a fine-tuned version of the Marathi SBERT model (
l3cube-pune/marathi-sentence-similarity-sbert
) for semantic similarity. - Positive Pair Generation: Utilizes contrastive learning with positive pairs to fine-tune the model.
- UMAP and Agglomerative Clustering: Applies UMAP for dimensionality reduction and Agglomerative Clustering for grouping similar news articles.
- Gradio Interface: A simple and interactive Gradio-based UI for querying news articles and retrieving semantically similar results.
Installation
To reproduce the code locally, follow the steps below:
Clone the Repository:
git clone https://huggingface.co/spaces/Sru15/Marathi-Semantic-Search cd Marathi-Semantic-Search
Install Dependencies: Ensure you have Python 3.8+ installed. Install the required libraries using:
pip install -r requirements.txt
Download Pre-trained Model: Download the fine-tuned Marathi SBERT model from Hugging Face:
from transformers import AutoModel, AutoTokenizer model_name = "Sru15/fine-tuned-marathi-sbert-with-synonyms" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModel.from_pretrained(model_name)
Download Precomputed Embeddings: If you have precomputed news embeddings saved as a
.npy
file, place them in the project directory:/content/drive/My Drive/Marathi_SBERT_Model/news_embeddings.npy
Usage
Run the Gradio Interface: Start the Gradio interface locally:
python app.py
This will launch a web interface where you can enter a query in Marathi and retrieve semantically similar news articles.
Model Training and Evaluation
- Preprocessing: Normalizes text, tokenizes it using the Marathi SBERT tokenizer, and generates embeddings.
- Training: Fine-tunes the pre-trained SBERT model with generated positive pairs using the
MultipleNegativesRankingLoss
method. - Evaluation: Uses Agglomerative Clustering and UMAP for dimensionality reduction and calculates the Silhouette Score to determine the optimal number of clusters.
Results
- Optimal Clusters: 7 clusters with a Silhouette Score of 0.6726.
- Performance with UMAP: The UMAP-based clustering outperformed clustering without dimensionality reduction.
Acknowledgements
- Model: The pre-trained Marathi SBERT model was sourced from l3cube-pune/marathi-sentence-similarity-sbert.
- Frameworks: Built with Gradio for the user interface and UMAP for dimensionality reduction.