nemanjaPetrovic/legal-jerteh-125-sbert

Semantic Search of Legal Data Using SBERT

This repository contains a proof-of-concept model for semantic search of legal data, based on Sentence-BERT (SBERT) and fine-tuned using triplets. The model is designed to provide efficient and accurate semantic search capabilities for legal documents.

Model Overview

Base Model: Jerteh-125
Fine-tuning Technique: Triplet loss
Purpose: To enable semantic search within legal data

Installation

To use the model, you need to have Python 3.6 or higher installed. Additionally, install the necessary dependencies:

pip install transformers pip install sentence-transformers

Usage

Here's how you can use the model for semantic search:

Load the Model

from sentence_transformers import SentenceTransformer model = SentenceTransformer('nemanjaPetrovic/legal-jerteh-125-sbert')
Encode Sentences

sentences = ["Sankcije se propisuju u granicama zakonom utvrđenog minimuma i maksimuma.", "Vrste krivičnih sankcija određuju se samo krivičnim zakonom."]

sentence_embeddings = model.encode(sentences)

Perform Semantic Search

To perform a semantic search, you need to encode both your query and the documents you want to search through. You can then use cosine similarity to find the most relevant documents. You should use vector database for this, but for quick test, you can try code bellow

from sklearn.metrics.pairwise import cosine_similarity import numpy as np

query = "Objasni mi pojam sankcija."

query_embedding = model.encode([query])

cosine_similarities = cosine_similarity(query_embedding, sentence_embeddings)

most_similar_idx = np.argmax(cosine_similarities) most_similar_document = sentences[most_similar_idx]

print(f"The most similar document to the query is: {most_similar_document}")

Fine-tuning Details

The model was fine-tuned using triplet loss, a common technique for training embedding models to understand semantic similarity. The fine-tuning dataset consisted of triplets (anchor, positive, negative) to teach the model to distinguish between similar and dissimilar legal documents.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

I would like to acknowledge the author of Jerteh-125 model Mihailo Skoric and the creators of Sentence-BERT for their foundational work, which made this project possible.

Contact

For any questions or issues, please contact nemanja.nlp@gmail.com.