Marsilia-Embeddings-EN-Base 🚀

Introduction 🌟

Marsilia-Embeddings-EN-Base is an English language embedding model specifically designed for financial domain tasks. This model serves as a proof of concept, demonstrating the critical importance of fine-tuning embedding models for specific tasks in Retrieval-Augmented Generation (RAG) applications.

By focusing on the financial domain, Marsilia-Embeddings-EN-Base achieves performance that surpasses even closed-source models like OpenAI's embeddings, while offering a more cost-effective solution. This showcases how targeted fine-tuning can dramatically enhance the capabilities of open-source models, making them competitive with or even superior to proprietary alternatives in specialized domains.

Model Details 📊

Model Type: Sentence Transformer
Language: English 🇬🇧
Base Model: BAAI/bge-base-en-v1.5
Maximum Sequence Length: 512 tokens
Output Dimensionality: 768
Similarity Function: Cosine Similarity

Usage 💻

To use this model with the Sentence Transformers library:

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sujet-ai/Marsilia-Embeddings-EN-Base")

# Run inference
sentences = [
    'What are the key factors affecting the performance of corporate bonds in the current market?',
    'The corporate bond market has been influenced by several factors in recent months. Interest rates set by central banks have a significant impact, as rising rates tend to decrease bond prices and increase yields. Economic indicators such as GDP growth, inflation rates, and employment figures also play a role in shaping investor sentiment and corporate financial health. Industry-specific trends and individual company performance are crucial, with factors like earnings reports, credit ratings, and debt levels affecting bond valuations. Global events, including geopolitical tensions and trade policies, can create market volatility. Liquidity in the bond market and overall investor risk appetite are additional considerations. It's important for investors to monitor these various factors when assessing corporate bond performance.',
    'CORPORATE BOND HOLDINGS (Continued) Principal Amount (000) Coupon Rate Maturity Date Market Value ($000) Vanguard Short-Term Corporate Bond ETF Bank of America Corp. 2,285 5.015% 1/22/24 2,285 JPMorgan Chase & Co. 2,250 3.875% 2/1/24 2,249 Goldman Sachs Group Inc. 2,200 3.750% 2/25/24 2,197 Morgan Stanley 2,190 3.875% 1/27/24 2,189 Citigroup Inc. 2,145 3.875% 3/26/24 2,141 Wells Fargo & Co. 2,100 3.750% 1/24/24 2,099 Bank of America Corp. 2,050 4.000% 4/1/24 2,047 Truist Bank 2,000 3.800% 10/30/23 2,000 PNC Bank NA 1,950 3.800% 7/25/23 1,950 U.S. Bancorp 1,900 3.375% 2/5/24 1,896 Bank of America Corp. 1,850 4.125% 1/22/24 1,850 Morgan Stanley 1,800 3.737% 4/24/24 1,795 Citigroup Inc. 1,750 3.668% 7/24/24 1,740 Goldman Sachs Group Inc. 1,700 3.625% 1/22/23 1,700 Wells Fargo & Co. 1,650 3.550% 8/14/23 1,650 JPMorgan Chase & Co. 1,600 3.875% 9/10/24 1,593'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Intended Use 🎯

This model is designed for generating sentence embeddings for English text, particularly in the financial domain. It can be used for various natural language processing tasks such as semantic search, clustering, and information retrieval.

Training Data 📚

The model was fine-tuned on the sujet-ai/Sujet-Financial-RAG-EN-Dataset. This dataset consists of question-context pairs in English, focusing on financial topics.

Training Procedure 🛠️

Training Hyperparameters

Loss Function: MultipleNegativesRankingLoss
- Scale: 20.0
- Similarity Function: Cosine Similarity
Evaluation Strategy: Steps
Per Device Train Batch Size: 200
Per Device Eval Batch Size: 200
Number of Train Epochs: 10
Batch Sampler: no_duplicates
Multi Dataset Batch Sampler: round_robin
Scheduler: Warmup cosine

Framework Versions

Python: 3.10.13
Sentence Transformers: 3.0.1
Transformers: 4.42.3
PyTorch: 2.5.0.dev20240704+cu124
Accelerate: 0.32.1
Datasets: 2.20.0
Tokenizers: 0.19.1

Evaluation 📈

The model was evaluated using the InformationRetrievalEvaluator on the test split of the sujet-ai/Sujet-Financial-RAG-EN-Dataset.

Limitations ⚠️

The model is specifically trained on English financial texts and may not perform optimally on other domains or languages. Users should be aware of potential biases present in the training data.

Citation 📄

If you use this model in your research or applications, please cite:

@software{Marsilia-Embeddings-EN-Base,
  author = {Sujet AI, Allaa Boutaleb, Hamed Rahimi},
  title = {Marsilia-Embeddings-EN-Base: A fine-tuned English embedding model for financial texts},
  year = {2024},
  url = {https://huggingface.co/sujet-ai/Marsilia-Embeddings-EN-Base}
}

Contact Information 📧

For questions, feedback, or collaborations, please reach out to us on LinkedIn or visit our website https://sujet.ai.

sujet-ai
/

Marsilia-Embeddings-EN-Base