legal-multi-qa-mpnet-base-cos

Multi-qa-pment-base-cos is a domain-specific text embedding model tailored for legal applications. It converts legal documents, sentences, and queries into dense vector representations that capture nuanced semantic relationships in legal language.

Details

Training Approach: The model was fine-tuned using a multiple negative ranking loss strategy. This approach helps the model distinguish between relevant (positive) and irrelevant (negative) passages effectively. The dataset consists of roughly 400k rows of synthetically generated legal data (derived from the LEGALBENCH-RAG dataset). The dataset, along with details about its construction, will be available soon. Each document chunk was paired with one positive (golden) chunks and three generated positive and negative passages. The LLama3.18B Instruct model was used for builduing synthetic training data.

  • Developed by: [Yuriy Perezhohin]
  • Model type: [sentence-transformers]
  • Language(s) (NLP): [English]
  • Finetuned from model [optional]: [sentence-transformers/multi-qa-mpnet-base-cos-v1]

Uses

The intention of this fine-tuning is to create a model capable of retrieving correct chunks for RAG applications. Can be used for semantic search or sentence similarity

Usage Sentence-Transformers

from sentence_transformers import SentenceTransformer, util

query = "How many people live in London?"
docs = ["What is the legal status of the parties to this Distributor Agreement, as stated in the introductory paragraph ?", 
"The legal status of parties do not have anything to do with this contract.",
"EXHIBIT 10.6\n\n                              DISTRIBUTOR AGREEMENT\n\n         THIS  DISTRIBUTOR  AGREEMENT (the  \"Agreement\")  is made by and between Electric City Corp.,  a Delaware  corporation  (\"Company\")  and Electric City of Illinois LLC (\"Distributor\") this 7th day of September, 1999.\n\n                                    RECITALS\n\n         A. The  Company's  Business.  The Company is  presently  engaged in the business  of selling an energy  efficiency  device,  which is  referred to as an \Energy  Saver\  which may be improved  or  otherwise  changed  from its present composition (the \"Products\").  The Company may engage in the business of selling other  products  or  other  devices  other  than  the  Products,  which  will be considered  Products if Distributor  exercises its options pursuant to Section 7 hereof.\n\n         B. Representations.  As an inducement to the Company to enter into this Agreement,  the  Distributor  has  represented  that  it has or  will  have  the facilities,  personnel,  and financial capability to promote the sale and use of Products.  As an  inducement  to  Distributor  to enter into this  Agreement the Company has  represented  that it has the  facilities,  personnel  and financial capability to have the Products  produced and supplied as needed pursuant to the terms hereof.\n\n         C. The Distributor's  Objectives.  The Distributor  desires to become a distributor  for the Company and to develop  demand for and sell and  distribute Products  solely  for the use within the State of  Illinois,  including  but not limited to public  and  private  entities,  institutions,  corporations,  public schools, park districts,  corrections facilities,  airports,  government housing authorities and other government agencies and facilities (the \"Market\").\n\n         D. The Company's  Appointment.  The Company appoints the Distributor as an  exclusive  distributor  of Products in the Market,  subject to the terms and conditions of this Agreement"]

#Load the model
model = SentenceTransformer('yuriyvnv/legal-multi-qa-mpnet-base-cos')

#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)

#Compute cosine similarity between query and all document embeddings
scores = util.cos_sim(query_emb, doc_emb)[0].cpu().tolist()

#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))

#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

#Output passages & scores
for doc, score in doc_score_pairs:
    print(score, doc)

Bias, Risks, and Limitations

The model was fine-tuned on synthetic data, from the probabilistic nature of LLM's, the model could have inherited some potential bias, although not know for the autor.

Downloads last month
0
Safetensors
Model size
109M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support sentence-similarity models for transformers library.

Model tree for yuriyvnv/legal-multi-qa-mpnet-base-cos

Finetuned
(13)
this model