ESG-BERT-Chinese

基于 bert-base-chinese，使用 2006–2023 年中国 A 股上市公司 ESG 报告文本进行领域自适应预训练（Domain-Adaptive Pre-Training, DAPT）的 MLM 模型。

模型简介

属性	值
基座模型	`bert-base-chinese`（110M 参数）
训练数据	2006–2023 年 A 股上市公司 ESG 报告
训练方法	MLM（掩码语言模型），learning rate 5e-5，batch size 8
训练步数	65,500 步
最终 Loss	0.6932
模型大小	409 MB (fp32)
框架	PyTorch + Transformers 5.5.4

用途

适用于中文 ESG 相关的下游 NLP 任务，作为 backbone 进行 fine-tune：

✅ ESG 报告情感/语调分析
✅ ESG 相关实体识别（NER）
✅ ESG 文本分类
✅ ESG 语义相似度

快速开始

from transformers import pipeline

fill_mask = pipeline("fill-mask", model="ZenZoran/esg-bert-chinese")

result = fill_mask("本公司高度[MASK]环境治理，全面推进绿色低碳转型。")
for r in result[:5]:
    print(f'{r["token_str"]}: {r["score"]:.3f}')
# 重: 0.600
# 视: 0.296
# 的: 0.060

训练数据

来源：A 股上市公司年度 ESG/社会责任报告
时间跨度：2006–2023 年
清洗：去英文、去空白、去目录页码、保留中文 ESG 术语与数字上下文
段落化：按 460 字切分为 512-token 段落，共约 65 万段

局限性

基于 bert-base-chinese 词表（21,128 个 token），不含英文子词
未经下游任务 fine-tune 验证，建议根据具体任务进行评估

开源协议

Apache 2.0

引用

@misc{sun2026esgbertchinese,
    author       = { Jingzhou Sun },
    title        = { ESG-BERT-Chinese: Domain-Adaptive Pre-training on Chinese ESG Reports },
    year         = {2026},
    url          = { https://huggingface.co/ZenZoran/esg-bert-chinese },
    doi          = { 10.57967/hf/8849 },
    publisher    = { Hugging Face },
    organization = { International School of Business \& Finance, Sun Yat-sen University }
}

Downloads last month: 186

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support