RiskEmbed

RiskEmbed is a finetuned Snowflake embedding model (arctic-embed-m) optimized for financial risk management retrieval tasks.

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space.

Model

Our finetuned embedding model achieves state-of-the-art performance (88%) among closed-source models. In particular, our model outperforms Google Text-Embedding-004 (84%), Cohere Embed-English-v3.0 (85%), OpenAI Text-Embedding-3-Large (86%), and MistralAI Mistral-Embed (87%), all of which were not finetuned in domain-specific data. This result highlights the advantage of finetuning on risk management data, as our model surpasses general-purpose embeddings in retrieval effectiveness. Furthermore, despite having the smallest embedding size (768 dimensions, equal to Google’s model but significantly smaller than OpenAI’s 3072 dimensions), our model efficiently encodes domain-specific information without requiring a larger vector space. Compared to VoyageAI's Voyage-Finance-2, which is also finetuned but on general financial data, our model achieves the same HR@5 (88%). The ability to achieve peak performance with a more compact representation (768 vs. 1024 dimensions for VoyageAI) suggests that our model captures risk-related semantics more effectively.

Model	HR@5 [%]	Improvement [%]	Embedding Size
Google Text-Embedding-004	84	5	768
Cohere Embed-English-v3.0	85	4	1024
OpenAI Text-Embedding-3-Large	86	2	3072
MistralAI Mistral-Embed	87	1	1024
VoyageAI Voyage-Finance-2	88	0	1024
Ours	88	-	768

Usage

Using Sentence Transformers

You can use the sentence-transformers package to use the model, as shown below.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("aminhaeri/RiskEmbed")

queries = ['what is snowflake?', 'Where can I get the best tacos?']
documents = ['The Data Cloud!', 'Mexico City of Course!']

query_embeddings = model.encode(queries, prompt_name="query")
document_embeddings = model.encode(documents)

scores = query_embeddings @ document_embeddings.T
for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    # Output passages & scores
    print("Query:", query)
    for document, score in doc_score_pairs:
        print(score, document)

Using Huggingface transformers

You can use the transformers package to use the model, as shown below. For optimal retrieval quality, use the CLS token to embed each text portion and use the query prefix below (just on the query).

import torch
from transformers import AutoModel, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained('aminhaeri/RiskEmbed')
model = AutoModel.from_pretrained('aminhaeri/RiskEmbed', add_pooling_layer=False)
model.eval()

query_prefix = 'Represent this sentence for searching relevant passages: '
queries  = ['what is snowflake?', 'Where can I get the best tacos?']
queries_with_prefix = ["{}{}".format(query_prefix, i) for i in queries]
query_tokens = tokenizer(queries_with_prefix, padding=True, truncation=True, return_tensors='pt', max_length=512)

documents = ['The Data Cloud!', 'Mexico City of Course!']
document_tokens =  tokenizer(documents, padding=True, truncation=True, return_tensors='pt', max_length=512)

# Compute token embeddings
with torch.no_grad():
    query_embeddings = model(**query_tokens)[0][:, 0]
    document_embeddings = model(**document_tokens)[0][:, 0]

# normalize embeddings
query_embeddings = torch.nn.functional.normalize(query_embeddings, p=2, dim=1)
document_embeddings = torch.nn.functional.normalize(document_embeddings, p=2, dim=1)

scores = torch.mm(query_embeddings, document_embeddings.transpose(0, 1))
for query, query_scores in zip(queries, scores):
    doc_score_pairs = list(zip(documents, query_scores))
    doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
    #Output passages & scores
    print("Query:", query)
    for document, score in doc_score_pairs:
        print(score, document)

Contact

Feel free to open an issue or pull request if you have any questions or suggestions about this project. You also can email Amin Haeri(me@aminhaeri.com).

License

Arctic is licensed under the Apache-2. The released models can be used for commercial purposes free of charge.

Acknowledgement

The authors would like to acknowledge the valuable contributions of the Risk Management team at TD Bank for their expertise in regulatory frameworks, financial risk assessment, and compliance practices, which were instrumental in the finetuning of RiskEmbed.