BGE-M3 Mathlib-RAG

README_en:https://huggingface.co/YuxuanGong/lean-RAG/blob/main/README_en.md

基于 BAAI/bge-m3 微调的 Mathlib4 API 检索模型，专为 Lean 4 形式化证明场景设计。该模型能够根据当前证明目标（goal type）和数学关键词，从 Mathlib4 的 32.6 万条声明中检索最相关的引理/定理/定义。

在我的github主页可访问完整skill：https://github.com/Tangtaizong-BUAA/Lean-formulizer

模型用途

该模型是 proof-formalizer 技能链路的检索核心。在形式化一个数学证明时：

从 Lean 环境中提取当前 ⊢ 目标类型（goal conclusion）
结合人类提供的数学关键词
用本模型编码为 1024 维向量
在预构建的 FAISS 索引中检索 top-k 最相关的 Mathlib4 声明

支持中英文混合查询，训练数据包含中文数学描述和 Lean 代码。

使用方式

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("YuxuanGong/lean-RAG")

# 查询：Lean goal + 关键词
query = "x % y < y  nat mod lt"
# 候选：Mathlib4 声明（从 declarations.jsonl 构建的 FAISS 索引）
passages = [
    "Keywords: nat mod lt theorem | Name: Nat.mod_lt | Kind: theorem | ...",
    "Keywords: nat mod add theorem | Name: Nat.ModEq.add | Kind: theorem | ...",
    # ... 32.6 万条
]

query_emb = model.encode(query)
passage_embs = model.encode(passages)
similarities = model.similarity(query_emb, passage_embs)

与 proof-formalizer Skill 集成

# Skill 中的 mathlib_lookup.sh (Route 4) 使用本模型:
DENSE_QUERY="${GOAL_TYPE} ${KEYWORDS_TEXT}"
# 然后调用 search.py 用本模型编码查询 -> FAISS 搜索

模型架构

基于 BGE-M3 (XLM-RoBERTa) 架构：

参数	值
基座模型	BAAI/bge-m3
隐藏层维度	1024
Transformer 层数	24
注意力头数	16
中间层维度	4096
词表大小	250,002
最大位置编码	8,194
输出维度	1024
池化方式	CLS token
相似度函数	Cosine Similarity

SentenceTransformer(
  (0): Transformer(XLMRobertaModel, task='feature-extraction')
  (1): Pooling(cls pooling, 1024d)
  (2): Normalize()
)

训练数据

从 Mathlib4 的 325,959 条声明中构造三元组训练数据：

训练集: 323,977 条 (query, positive, negatives)
验证集: 500 条
格式: BGE 三元组格式 — 每条样本包含 1 个 query、1 个 positive passage、7 个 negative passages
Query: 定理名 + 类型签名（中英混合）
Positive: 对应引理的完整结构化描述（Keywords | Name | Kind | Namespace | Doc | Type）
Negatives: 随机采样 + 困难负样本

数据示例

角色	内容
Query	`theorem length_take_of_le_length \| (s.take n).length = n`
Positive	`Keywords: le find iff
Negative	`Keywords: length take le' list

训练细节

超参数	值
损失函数	MultipleNegativesRankingLoss (cosine, scale=20.0)
训练轮数	2 epochs
训练步数	6,750
Per-device Batch Size	24
梯度累积	4 (有效 batch size = 96)
学习率	1e-5
LR 调度	cosine
Warmup	5%
优化器	AdamW (fused)
精度	FP16
随机种子	42
GPU	RTX PRO 6000 Blackwell 98GB
训练时间	~3.7 小时

训练 Loss 曲线

Epoch	Step	Loss
0.61	2,050	0.146
0.89	3,000	0.167
1.20	4,050	0.145
1.48	5,000	0.131
1.78	6,000	0.136
2.00	6,750	0.140

评测结果

直接检索（Direct Retrieval）

给定声明名 + docstring，检索对应的 Mathlib 声明：

指标	Baseline (BGE-M3)	Fine-tuned (本模型)
Recall@1	-	29.4%
Recall@5	-	58.6%
Recall@10	-	66.2%
MRR	-	41.2%

Skill Mode F：引理检索（Lemma Conclusion → Find Lemma）

模拟真实使用场景 — "我有这个 goal，找到能证它的引理"：

指标	值
查询数	626
Recall@1	26.7%
Recall@5	61.5%
Recall@10	71.9%
Recall@20	80.5%
MRR	42.1%

Skill Mode E：证明初始检索（Initial Goal → Proof Lemmas）

模拟 "开始证这个定理，需要哪些引理"（更难任务）：

指标	值
查询数	493
Recall@10	13.8%
Recall@20	18.3%
MRR	4.7%

已知限制

Mode E 召回率较低：从零开始的证明检索仍具有挑战性，MRR 仅 4.7%
跨版本 Mathlib 兼容性：模型训练于特定 Mathlib4 版本，API 可能随版本变化
最大序列长度：虽然 BGE-M3 支持 8192 tokens，训练时截断至 400 tokens（适配 declaration 长度）
中英文偏好：英文查询效果通常优于中文

环境依赖

pip install sentence-transformers>=5.4.1

框架版本（训练时）

Python: 3.12.3
Sentence Transformers: 5.4.1
Transformers: 5.5.4
PyTorch: 2.8.0+cu128
Accelerate: 1.13.0

引用

如果你使用了这个模型，请引用：

@misc{gong2025mathlib-rag,
  title={Mathlib-RAG: Fine-tuned BGE-M3 for Lean 4 Mathlib API Retrieval},
  author={Gong, Yuxuan},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/YuxuanGong/lean-RAG}}
}

基座模型引用：

@article{chen2024bge,
  title={BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation},
  author={Chen, Jianlv and Xiao, Shitao and Zhang, Peitian and others},
  journal={arXiv preprint arXiv:2402.03216},
  year={2024}
}

许可

Apache 2.0

Downloads last month: 69

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for YuxuanGong/lean-RAG

Base model

BAAI/bge-m3

Finetuned

(502)

this model

Paper for YuxuanGong/lean-RAG

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation

Paper • 2402.03216 • Published Feb 5, 2024 • 10