legal-multi-qa-mpnet-base-cos
Multi-qa-pment-base-cos is a domain-specific text embedding model tailored for legal applications. It converts legal documents, sentences, and queries into dense vector representations that capture nuanced semantic relationships in legal language.
Details
Training Approach: The model was fine-tuned using a multiple negative ranking loss strategy. This approach helps the model distinguish between relevant (positive) and irrelevant (negative) passages effectively. The dataset consists of roughly 400k rows of synthetically generated legal data (derived from the LEGALBENCH-RAG dataset). The dataset, along with details about its construction, will be available soon. Each document chunk was paired with one positive (golden) chunks and three generated positive and negative passages. The LLama3.18B Instruct model was used for builduing synthetic training data.
- Developed by: [Yuriy Perezhohin]
- Model type: [sentence-transformers]
- Language(s) (NLP): [English]
- Finetuned from model [optional]: [sentence-transformers/multi-qa-mpnet-base-cos-v1]
Uses
The intention of this fine-tuning is to create a model capable of retrieving correct chunks for RAG applications. Can be used for semantic search or sentence similarity
Usage Sentence-Transformers
from sentence_transformers import SentenceTransformer, util
query = "How many people live in London?"
docs = ["What is the legal status of the parties to this Distributor Agreement, as stated in the introductory paragraph ?",
"The legal status of parties do not have anything to do with this contract.",
"EXHIBIT 10.6\n\n DISTRIBUTOR AGREEMENT\n\n THIS DISTRIBUTOR AGREEMENT (the \"Agreement\") is made by and between Electric City Corp., a Delaware corporation (\"Company\") and Electric City of Illinois LLC (\"Distributor\") this 7th day of September, 1999.\n\n RECITALS\n\n A. The Company's Business. The Company is presently engaged in the business of selling an energy efficiency device, which is referred to as an \Energy Saver\ which may be improved or otherwise changed from its present composition (the \"Products\"). The Company may engage in the business of selling other products or other devices other than the Products, which will be considered Products if Distributor exercises its options pursuant to Section 7 hereof.\n\n B. Representations. As an inducement to the Company to enter into this Agreement, the Distributor has represented that it has or will have the facilities, personnel, and financial capability to promote the sale and use of Products. As an inducement to Distributor to enter into this Agreement the Company has represented that it has the facilities, personnel and financial capability to have the Products produced and supplied as needed pursuant to the terms hereof.\n\n C. The Distributor's Objectives. The Distributor desires to become a distributor for the Company and to develop demand for and sell and distribute Products solely for the use within the State of Illinois, including but not limited to public and private entities, institutions, corporations, public schools, park districts, corrections facilities, airports, government housing authorities and other government agencies and facilities (the \"Market\").\n\n D. The Company's Appointment. The Company appoints the Distributor as an exclusive distributor of Products in the Market, subject to the terms and conditions of this Agreement"]
#Load the model
model = SentenceTransformer('yuriyvnv/legal-multi-qa-mpnet-base-cos')
#Encode query and documents
query_emb = model.encode(query)
doc_emb = model.encode(docs)
#Compute cosine similarity between query and all document embeddings
scores = util.cos_sim(query_emb, doc_emb)[0].cpu().tolist()
#Combine docs & scores
doc_score_pairs = list(zip(docs, scores))
#Sort by decreasing score
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
#Output passages & scores
for doc, score in doc_score_pairs:
print(score, doc)
Bias, Risks, and Limitations
The model was fine-tuned on synthetic data, from the probabilistic nature of LLM's, the model could have inherited some potential bias, although not know for the autor.
- Downloads last month
- 0