Spaces:
Sleeping
Sleeping
File size: 3,636 Bytes
34aae1c c30e439 34aae1c 11b5a6f c30e439 11b5a6f c30e439 f0ed1e6 c30e439 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 |
---
title: Marathi Semantic Search
emoji: π
colorFrom: pink
colorTo: red
sdk: gradio
sdk_version: 5.1.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Semantic search for Marathi news using fine-tuned SBERT
---
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
# Marathi Semantic Search
This project is a semantic search engine for Marathi news articles, leveraging a fine-tuned Marathi SBERT model to retrieve contextually relevant articles based on user queries. The model has been trained using contrastive learning with generated positive pairs to capture the nuances of the Marathi language.
## Demo
A working demo of the project is available on Hugging Face Spaces: [Sru15/Marathi-Semantic-Search](https://huggingface.co/spaces/Sru15/Marathi-Semantic-Search)
## Features
- **Fine-tuned SBERT Model**: Uses a fine-tuned version of the Marathi SBERT model (`l3cube-pune/marathi-sentence-similarity-sbert`) for semantic similarity.
- **Positive Pair Generation**: Utilizes contrastive learning with positive pairs to fine-tune the model.
- **UMAP and Agglomerative Clustering**: Applies UMAP for dimensionality reduction and Agglomerative Clustering for grouping similar news articles.
- **Gradio Interface**: A simple and interactive Gradio-based UI for querying news articles and retrieving semantically similar results.
## Installation
To reproduce the code locally, follow the steps below:
1. **Clone the Repository**:
```bash
git clone https://huggingface.co/spaces/Sru15/Marathi-Semantic-Search
cd Marathi-Semantic-Search
```
2. **Install Dependencies**:
Ensure you have Python 3.8+ installed. Install the required libraries using:
```bash
pip install -r requirements.txt
```
3. **Download Pre-trained Model**:
Download the fine-tuned Marathi SBERT model from Hugging Face:
```python
from transformers import AutoModel, AutoTokenizer
model_name = "Sru15/fine-tuned-marathi-sbert-with-synonyms"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
```
4. **Download Precomputed Embeddings**:
If you have precomputed news embeddings saved as a `.npy` file, place them in the project directory:
```
/content/drive/My Drive/Marathi_SBERT_Model/news_embeddings.npy
```
## Usage
- **Run the Gradio Interface**:
Start the Gradio interface locally:
```bash
python app.py
```
This will launch a web interface where you can enter a query in Marathi and retrieve semantically similar news articles.
## Model Training and Evaluation
- **Preprocessing**: Normalizes text, tokenizes it using the Marathi SBERT tokenizer, and generates embeddings.
- **Training**: Fine-tunes the pre-trained SBERT model with generated positive pairs using the `MultipleNegativesRankingLoss` method.
- **Evaluation**: Uses Agglomerative Clustering and UMAP for dimensionality reduction and calculates the Silhouette Score to determine the optimal number of clusters.
## Results
- **Optimal Clusters**: 7 clusters with a Silhouette Score of 0.6726.
- **Performance with UMAP**: The UMAP-based clustering outperformed clustering without dimensionality reduction.
## Acknowledgements
- **Model**: The pre-trained Marathi SBERT model was sourced from [l3cube-pune/marathi-sentence-similarity-sbert](https://huggingface.co/l3cube-pune/marathi-sentence-similarity-sbert).
- **Frameworks**: Built with [Gradio](https://gradio.app/) for the user interface and [UMAP](https://umap-learn.readthedocs.io/en/latest/) for dimensionality reduction.
|