File size: 3,636 Bytes
34aae1c
 
 
 
 
 
 
 
 
 
 
 
c30e439
34aae1c
11b5a6f
c30e439
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11b5a6f
 
c30e439
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f0ed1e6
c30e439
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
---
title: Marathi Semantic Search
emoji: πŸ“Š
colorFrom: pink
colorTo: red
sdk: gradio
sdk_version: 5.1.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Semantic search for Marathi news using fine-tuned SBERT
---

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

# Marathi Semantic Search

This project is a semantic search engine for Marathi news articles, leveraging a fine-tuned Marathi SBERT model to retrieve contextually relevant articles based on user queries. The model has been trained using contrastive learning with generated positive pairs to capture the nuances of the Marathi language.

## Demo

A working demo of the project is available on Hugging Face Spaces: [Sru15/Marathi-Semantic-Search](https://huggingface.co/spaces/Sru15/Marathi-Semantic-Search)

## Features

- **Fine-tuned SBERT Model**: Uses a fine-tuned version of the Marathi SBERT model (`l3cube-pune/marathi-sentence-similarity-sbert`) for semantic similarity.
- **Positive Pair Generation**: Utilizes contrastive learning with positive pairs to fine-tune the model.
- **UMAP and Agglomerative Clustering**: Applies UMAP for dimensionality reduction and Agglomerative Clustering for grouping similar news articles.
- **Gradio Interface**: A simple and interactive Gradio-based UI for querying news articles and retrieving semantically similar results.

## Installation

To reproduce the code locally, follow the steps below:


   
1. **Clone the Repository**:
   ```bash
   git clone https://huggingface.co/spaces/Sru15/Marathi-Semantic-Search
   cd Marathi-Semantic-Search
   ```

2. **Install Dependencies**:
   Ensure you have Python 3.8+ installed. Install the required libraries using:
   ```bash
   pip install -r requirements.txt
   ```

3. **Download Pre-trained Model**:
   Download the fine-tuned Marathi SBERT model from Hugging Face:
   ```python
   from transformers import AutoModel, AutoTokenizer

   model_name = "Sru15/fine-tuned-marathi-sbert-with-synonyms"
   tokenizer = AutoTokenizer.from_pretrained(model_name)
   model = AutoModel.from_pretrained(model_name)
   ```

4. **Download Precomputed Embeddings**:
   If you have precomputed news embeddings saved as a `.npy` file, place them in the project directory:
   ```
   /content/drive/My Drive/Marathi_SBERT_Model/news_embeddings.npy
   ```

## Usage

- **Run the Gradio Interface**:
   Start the Gradio interface locally:
   ```bash
   python app.py
   ```

   This will launch a web interface where you can enter a query in Marathi and retrieve semantically similar news articles.



## Model Training and Evaluation

- **Preprocessing**: Normalizes text, tokenizes it using the Marathi SBERT tokenizer, and generates embeddings.
- **Training**: Fine-tunes the pre-trained SBERT model with generated positive pairs using the `MultipleNegativesRankingLoss` method.
- **Evaluation**: Uses Agglomerative Clustering and UMAP for dimensionality reduction and calculates the Silhouette Score to determine the optimal number of clusters.

## Results

- **Optimal Clusters**: 7 clusters with a Silhouette Score of 0.6726.
- **Performance with UMAP**: The UMAP-based clustering outperformed clustering without dimensionality reduction.


## Acknowledgements

- **Model**: The pre-trained Marathi SBERT model was sourced from [l3cube-pune/marathi-sentence-similarity-sbert](https://huggingface.co/l3cube-pune/marathi-sentence-similarity-sbert).
- **Frameworks**: Built with [Gradio](https://gradio.app/) for the user interface and [UMAP](https://umap-learn.readthedocs.io/en/latest/) for dimensionality reduction.