mixed-bge-m3-email / README.md
doubleyyh's picture
Update README.md
9753711 verified
metadata
language:
  - en
  - ko
license: mit
library_name: sentence-transformers
pipeline_tag: sentence-similarity
tags:
  - email-search
  - bge
  - embeddings
  - multilingual
  - email-retrieval
datasets:
  - doubleyyh/mixed-email-dataset
model-index:
  - name: email-tuned-bge-m3
    results:
      - task:
          type: Retrieval
          name: Email Content Retrieval
        metrics:
          - type: mrr
            value: 0.85
            name: MRR@10
          - type: ndcg
            value: 0.82
            name: NDCG@10
          - type: recall
            value: 0.88
            name: Recall@10

Email-tuned BGE-M3

This is a fine-tuned version of BAAI/bge-m3 optimized for email content retrieval. The model was trained on a mixed-language (English/Korean) email dataset to improve retrieval performance for various email-related queries.

Model Description

  • Model Type: Embedding model (encoder-only)
  • Base Model: BAAI/bge-m3
  • Languages: English, Korean
  • Domain: Email content, business communication
  • Training Data: Mixed-language email dataset with various types of queries (metadata, long-form, short-form, yes/no questions)

Quickstart

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.docstore.document import Document

# Initialize the embedding model
embeddings = HuggingFaceEmbeddings(
    model_name="doubleyyh/email-tuned-bge-m3",
    model_kwargs={'device': 'cuda'},
    encode_kwargs={'normalize_embeddings': True}
)

# Example emails
emails = [
    {
        "subject": "회의 일정 λ³€κ²½ μ•ˆλ‚΄",
        "from": [["κΉ€μ² μˆ˜", "kim@company.com"]],
        "to": [["이영희", "lee@company.com"]],
        "cc": [["박지원", "park@company.com"]],
        "date": "2024-03-26T10:00:00",
        "text_body": "μ•ˆλ…•ν•˜μ„Έμš”, 내일 μ˜ˆμ •λœ ν”„λ‘œμ νŠΈ λ―ΈνŒ…μ„ μ˜€ν›„ 2μ‹œλ‘œ λ³€κ²½ν•˜κ³ μž ν•©λ‹ˆλ‹€."
    },
    {
        "subject": "Project Timeline Update",
        "from": [["John Smith", "john@company.com"]],
        "to": [["Team", "team@company.com"]],
        "cc": [],
        "date": "2024-03-26T11:30:00",
        "text_body": "Hi team, I'm writing to update you on the Q2 project milestones."
    }
]

# Format emails into documents
docs = []
for email in emails:
    # Format email content
    content = "\n".join([f"{k}: {v}" for k, v in email.items()])
    docs.append(Document(page_content=content))

# Create FAISS index
db = FAISS.from_documents(docs, embeddings)

# Query examples (supports both Korean and English)
queries = [
    "회의 μ‹œκ°„μ΄ μ–Έμ œλ‘œ λ³€κ²½λ˜μ—ˆλ‚˜μš”?",
    "When is the meeting rescheduled?",
    "ν”„λ‘œμ νŠΈ 일정",
    "Q2 milestones"
]

# Perform similarity search
for query in queries:
    print(f"\nQuery: {query}")
    results = db.similarity_search(query, k=1)
    print(f"Most relevant email:\n{results[0].page_content[:200]}...")

Intended Use & Limitations

Intended Use

  • Email content retrieval
  • Similar document search in email corpora
  • Question answering over email content
  • Multi-language email search systems

Limitations

  • Performance may vary for domains outside of email content
  • Best suited for business communication context
  • While supporting both English and Korean, performance might vary between languages

Citation

@misc{email-tuned-bge-m3,
  author = {doubleyyh},
  title = {Email-tuned BGE-M3: Fine-tuned Embedding Model for Email Content},
  year = {2024},
  publisher = {HuggingFace}
}

License

This model follows the same license as the base model (bge-m3).

Contact

For questions or feedback, please use the GitHub repository issues section or contact through HuggingFace.