metadata
language:
- en
- ko
license: mit
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
- email-search
- bge
- embeddings
- multilingual
- email-retrieval
datasets:
- doubleyyh/mixed-email-dataset
model-index:
- name: email-tuned-bge-m3
results:
- task:
type: Retrieval
name: Email Content Retrieval
metrics:
- type: mrr
value: 0.85
name: MRR@10
- type: ndcg
value: 0.82
name: NDCG@10
- type: recall
value: 0.88
name: Recall@10
Email-tuned BGE-M3
This is a fine-tuned version of BAAI/bge-m3 optimized for email content retrieval. The model was trained on a mixed-language (English/Korean) email dataset to improve retrieval performance for various email-related queries.
Model Description
- Model Type: Embedding model (encoder-only)
- Base Model: BAAI/bge-m3
- Languages: English, Korean
- Domain: Email content, business communication
- Training Data: Mixed-language email dataset with various types of queries (metadata, long-form, short-form, yes/no questions)
Quickstart
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document
# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(
model_name="doubleyyh/email-tuned-bge-m3",
model_kwargs={'device': 'cuda'},
encode_kwargs={'normalize_embeddings': True}
)
# Example emails
emails = [
{
"subject": "νμ μΌμ λ³κ²½ μλ΄",
"from": [["κΉμ² μ", "kim@company.com"]],
"to": [["μ΄μν¬", "lee@company.com"]],
"cc": [["λ°μ§μ", "park@company.com"]],
"date": "2024-03-26T10:00:00",
"text_body": "μλ
νμΈμ, λ΄μΌ μμ λ νλ‘μ νΈ λ―Έν
μ μ€ν 2μλ‘ λ³κ²½νκ³ μ ν©λλ€."
},
{
"subject": "Project Timeline Update",
"from": [["John Smith", "john@company.com"]],
"to": [["Team", "team@company.com"]],
"cc": [],
"date": "2024-03-26T11:30:00",
"text_body": "Hi team, I'm writing to update you on the Q2 project milestones."
}
]
# Format emails into documents
docs = []
for email in emails:
# Format email content
content = "\n".join([f"{k}: {v}" for k, v in email.items()])
docs.append(Document(page_content=content))
# Create FAISS index
db = FAISS.from_documents(docs, embeddings)
# Query examples (supports both Korean and English)
queries = [
"νμ μκ°μ΄ μΈμ λ‘ λ³κ²½λμλμ?",
"When is the meeting rescheduled?",
"νλ‘μ νΈ μΌμ ",
"Q2 milestones"
]
# Perform similarity search
for query in queries:
print(f"\nQuery: {query}")
results = db.similarity_search(query, k=1)
print(f"Most relevant email:\n{results[0].page_content[:200]}...")
Intended Use & Limitations
Intended Use
- Email content retrieval
- Similar document search in email corpora
- Question answering over email content
- Multi-language email search systems
Limitations
- Performance may vary for domains outside of email content
- Best suited for business communication context
- While supporting both English and Korean, performance might vary between languages
Citation
@misc{email-tuned-bge-m3,
author = {doubleyyh},
title = {Email-tuned BGE-M3: Fine-tuned Embedding Model for Email Content},
year = {2024},
publisher = {HuggingFace}
}
License
This model follows the same license as the base model (bge-m3).
Contact
For questions or feedback, please use the GitHub repository issues section or contact through HuggingFace.