Sru15's picture
Update README.md
34aae1c verified

A newer version of the Gradio SDK is available: 5.18.0

Upgrade
metadata
title: Marathi Semantic Search
emoji: πŸ“Š
colorFrom: pink
colorTo: red
sdk: gradio
sdk_version: 5.1.0
app_file: app.py
pinned: false
license: apache-2.0
short_description: Semantic search for Marathi news using fine-tuned SBERT

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

Marathi Semantic Search

This project is a semantic search engine for Marathi news articles, leveraging a fine-tuned Marathi SBERT model to retrieve contextually relevant articles based on user queries. The model has been trained using contrastive learning with generated positive pairs to capture the nuances of the Marathi language.

Demo

A working demo of the project is available on Hugging Face Spaces: Sru15/Marathi-Semantic-Search

Features

  • Fine-tuned SBERT Model: Uses a fine-tuned version of the Marathi SBERT model (l3cube-pune/marathi-sentence-similarity-sbert) for semantic similarity.
  • Positive Pair Generation: Utilizes contrastive learning with positive pairs to fine-tune the model.
  • UMAP and Agglomerative Clustering: Applies UMAP for dimensionality reduction and Agglomerative Clustering for grouping similar news articles.
  • Gradio Interface: A simple and interactive Gradio-based UI for querying news articles and retrieving semantically similar results.

Installation

To reproduce the code locally, follow the steps below:

  1. Clone the Repository:

    git clone https://huggingface.co/spaces/Sru15/Marathi-Semantic-Search
    cd Marathi-Semantic-Search
    
  2. Install Dependencies: Ensure you have Python 3.8+ installed. Install the required libraries using:

    pip install -r requirements.txt
    
  3. Download Pre-trained Model: Download the fine-tuned Marathi SBERT model from Hugging Face:

    from transformers import AutoModel, AutoTokenizer
    
    model_name = "Sru15/fine-tuned-marathi-sbert-with-synonyms"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    
  4. Download Precomputed Embeddings: If you have precomputed news embeddings saved as a .npy file, place them in the project directory:

    /content/drive/My Drive/Marathi_SBERT_Model/news_embeddings.npy
    

Usage

  • Run the Gradio Interface: Start the Gradio interface locally:

    python app.py
    

    This will launch a web interface where you can enter a query in Marathi and retrieve semantically similar news articles.

Model Training and Evaluation

  • Preprocessing: Normalizes text, tokenizes it using the Marathi SBERT tokenizer, and generates embeddings.
  • Training: Fine-tunes the pre-trained SBERT model with generated positive pairs using the MultipleNegativesRankingLoss method.
  • Evaluation: Uses Agglomerative Clustering and UMAP for dimensionality reduction and calculates the Silhouette Score to determine the optimal number of clusters.

Results

  • Optimal Clusters: 7 clusters with a Silhouette Score of 0.6726.
  • Performance with UMAP: The UMAP-based clustering outperformed clustering without dimensionality reduction.

Acknowledgements