BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
Paper • 2402.03216 • Published • 10
How to use YuxuanGong/lean-RAG with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("YuxuanGong/lean-RAG")
sentences = [
"theorem length_take_of_le_length | (s.take n).length = n",
"Keywords: less than or equal find if and only natural number lemma proposition | Name: Nat le_find_iff | Kind: lemma | Namespace: Nat",
"Keywords: length take le' list theorem proposition | Name: List length_take_le' | Kind: theorem | Namespace: List",
"Keywords: take length list theorem proposition | Name: List take_length | Kind: theorem | Namespace: List"
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]README_en:https://huggingface.co/YuxuanGong/lean-RAG/blob/main/README_en.md
基于 BAAI/bge-m3 微调的 Mathlib4 API 检索模型,专为 Lean 4 形式化证明场景设计。该模型能够根据当前证明目标(goal type)和数学关键词,从 Mathlib4 的 32.6 万条声明中检索最相关的引理/定理/定义。
在我的github主页可访问完整skill:https://github.com/Tangtaizong-BUAA/Lean-formulizer
该模型是 proof-formalizer 技能链路的检索核心。在形式化一个数学证明时:
⊢ 目标类型(goal conclusion)支持中英文混合查询,训练数据包含中文数学描述和 Lean 代码。
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("YuxuanGong/lean-RAG")
# 查询:Lean goal + 关键词
query = "x % y < y nat mod lt"
# 候选:Mathlib4 声明(从 declarations.jsonl 构建的 FAISS 索引)
passages = [
"Keywords: nat mod lt theorem | Name: Nat.mod_lt | Kind: theorem | ...",
"Keywords: nat mod add theorem | Name: Nat.ModEq.add | Kind: theorem | ...",
# ... 32.6 万条
]
query_emb = model.encode(query)
passage_embs = model.encode(passages)
similarities = model.similarity(query_emb, passage_embs)
# Skill 中的 mathlib_lookup.sh (Route 4) 使用本模型:
DENSE_QUERY="${GOAL_TYPE} ${KEYWORDS_TEXT}"
# 然后调用 search.py 用本模型编码查询 -> FAISS 搜索
基于 BGE-M3 (XLM-RoBERTa) 架构:
| 参数 | 值 |
|---|---|
| 基座模型 | BAAI/bge-m3 |
| 隐藏层维度 | 1024 |
| Transformer 层数 | 24 |
| 注意力头数 | 16 |
| 中间层维度 | 4096 |
| 词表大小 | 250,002 |
| 最大位置编码 | 8,194 |
| 输出维度 | 1024 |
| 池化方式 | CLS token |
| 相似度函数 | Cosine Similarity |
SentenceTransformer(
(0): Transformer(XLMRobertaModel, task='feature-extraction')
(1): Pooling(cls pooling, 1024d)
(2): Normalize()
)
从 Mathlib4 的 325,959 条声明中构造三元组训练数据:
| 角色 | 内容 |
|---|---|
| Query | theorem length_take_of_le_length | (s.take n).length = n |
| Positive | `Keywords: le find iff |
| Negative | `Keywords: length take le' list |
| 超参数 | 值 |
|---|---|
| 损失函数 | MultipleNegativesRankingLoss (cosine, scale=20.0) |
| 训练轮数 | 2 epochs |
| 训练步数 | 6,750 |
| Per-device Batch Size | 24 |
| 梯度累积 | 4 (有效 batch size = 96) |
| 学习率 | 1e-5 |
| LR 调度 | cosine |
| Warmup | 5% |
| 优化器 | AdamW (fused) |
| 精度 | FP16 |
| 随机种子 | 42 |
| GPU | RTX PRO 6000 Blackwell 98GB |
| 训练时间 | ~3.7 小时 |
| Epoch | Step | Loss |
|---|---|---|
| 0.61 | 2,050 | 0.146 |
| 0.89 | 3,000 | 0.167 |
| 1.20 | 4,050 | 0.145 |
| 1.48 | 5,000 | 0.131 |
| 1.78 | 6,000 | 0.136 |
| 2.00 | 6,750 | 0.140 |
给定声明名 + docstring,检索对应的 Mathlib 声明:
| 指标 | Baseline (BGE-M3) | Fine-tuned (本模型) |
|---|---|---|
| Recall@1 | - | 29.4% |
| Recall@5 | - | 58.6% |
| Recall@10 | - | 66.2% |
| MRR | - | 41.2% |
模拟真实使用场景 — "我有这个 goal,找到能证它的引理":
| 指标 | 值 |
|---|---|
| 查询数 | 626 |
| Recall@1 | 26.7% |
| Recall@5 | 61.5% |
| Recall@10 | 71.9% |
| Recall@20 | 80.5% |
| MRR | 42.1% |
模拟 "开始证这个定理,需要哪些引理"(更难任务):
| 指标 | 值 |
|---|---|
| 查询数 | 493 |
| Recall@10 | 13.8% |
| Recall@20 | 18.3% |
| MRR | 4.7% |
pip install sentence-transformers>=5.4.1
如果你使用了这个模型,请引用:
@misc{gong2025mathlib-rag,
title={Mathlib-RAG: Fine-tuned BGE-M3 for Lean 4 Mathlib API Retrieval},
author={Gong, Yuxuan},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/YuxuanGong/lean-RAG}}
}
基座模型引用:
@article{chen2024bge,
title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and others},
journal={arXiv preprint arXiv:2402.03216},
year={2024}
}
Apache 2.0
Base model
BAAI/bge-m3