RiskEmbed
RiskEmbed is a finetuned Snowflake embedding model (arctic-embed-m) optimized for financial risk management retrieval tasks.
This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space.
Model
Our finetuned embedding model achieves state-of-the-art performance (88%) among closed-source models. In particular, our model outperforms Google Text-Embedding-004 (84%), Cohere Embed-English-v3.0 (85%), OpenAI Text-Embedding-3-Large (86%), and MistralAI Mistral-Embed (87%), all of which were not finetuned in domain-specific data. This result highlights the advantage of finetuning on risk management data, as our model surpasses general-purpose embeddings in retrieval effectiveness. Furthermore, despite having the smallest embedding size (768 dimensions, equal to Google’s model but significantly smaller than OpenAI’s 3072 dimensions), our model efficiently encodes domain-specific information without requiring a larger vector space. Compared to VoyageAI's Voyage-Finance-2, which is also finetuned but on general financial data, our model achieves the same HR@5 (88%). The ability to achieve peak performance with a more compact representation (768 vs. 1024 dimensions for VoyageAI) suggests that our model captures risk-related semantics more effectively.
Model | HR@5 [%] | Improvement [%] | Embedding Size |
---|---|---|---|
Google Text-Embedding-004 | 84 | 5 | 768 |
Cohere Embed-English-v3.0 | 85 | 4 | 1024 |
OpenAI Text-Embedding-3-Large | 86 | 2 | 3072 |
MistralAI Mistral-Embed | 87 | 1 | 1024 |
VoyageAI Voyage-Finance-2 | 88 | 0 | 1024 |
Ours | 88 | - | 768 |
Usage
Using Sentence Transformers
You can use the sentence-transformers package to use the model, as shown below.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("aminhaeri/RiskEmbed")
queries = ['what is snowflake?', 'Where can I get the best tacos?']
documents = ['The Data Cloud!', 'Mexico City of Course!']
query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)
scores = query_embeddings @ document_embeddings.T
for query, query_scores in zip(queries, scores):
doc_score_pairs = list(zip(documents, query_scores))
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
# Output passages & scores
print("Query:", query)
for document, score in doc_score_pairs:
print(score, document)
Using Huggingface transformers
You can use the transformers package to use the model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).
import torch
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('aminhaeri/RiskEmbed')
model = AutoModel.from_pretrained('aminhaeri/RiskEmbed', add_pooling_layer=False)
model.eval()
query_prefix = 'Represent this sentence for searching relevant passages: '
queries = ['what is snowflake?', 'Where can I get the best tacos?']
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)
documents = ['The Data Cloud!', 'Mexico City of Course!']
document_tokens = tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)
# Compute token embeddings
with torch.no_grad():
query_embeddings = model(**query_tokens)[0][:, 0]
document_embeddings = model(**document_tokens)[0][:, 0]
# normalize embeddings
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)
scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
for query, query_scores in zip(queries, scores):
doc_score_pairs = list(zip(documents, query_scores))
doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
#Output passages & scores
print("Query:", query)
for document, score in doc_score_pairs:
print(score, document)
Contact
Feel free to open an issue or pull request if you have any questions or suggestions about this project. You also can email Amin Haeri(me@aminhaeri.com).
License
Arctic is licensed under the Apache-2. The released models can be used for commercial purposes free of charge.
Acknowledgement
The authors would like to acknowledge the valuable contributions of the Risk Management team at TD Bank for their expertise in regulatory frameworks, financial risk assessment, and compliance practices, which were instrumental in the finetuning of RiskEmbed.
- Downloads last month
- 11