CNMBert-MoE / README.md

Update README.md

727ab8a verified 3 days ago

5 kB

	---
	license: agpl-3.0
	language:
	- zh
	base_model:
	- hfl/chinese-bert-wwm-ext
	pipeline_tag: fill-mask
	tags:
	- bert
	- Chinese
	library_name: transformers
	---

	CNMBert
	[Github](https://github.com/IgarashiAkatuki/zh-CN-Multi-Mask-Bert)

	# zh-CN-Multi-Mask-Bert (CNMBert)
	![image](https://github.com/user-attachments/assets/a888fde7-6766-43f1-a753-810399418bda)

	---

	一个用来翻译拼音缩写的模型

	此模型基于[Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)训练而来，通过修改其预训练任务来使其适配拼音缩写翻译任务，相较于微调过的GPT模型以及GPT-4o达到了sota

	---

	## 什么是拼音缩写

	形如:

	> "bhys" -> "不好意思"
	>
	> "ys" -> "原神"

	这样的，使用拼音首字母来代替汉字的缩写，我们姑且称之为拼音缩写。

	如果对拼音缩写感兴趣可以看看这个↓

	[大家为什么会讨厌缩写？ - 远方青木的回答 - 知乎](https://www.zhihu.com/question/269016377/answer/2654824753)

	### CNMBert

	\| Model \| 模型权重 \| Memory Usage (FP16) \| Model Size \| QPS \| MRR \| Acc \|
	\| --------------- \| ----------------------------------------------------------- \| ------------------- \| ---------- \| ----- \| ----- \| ----- \|
	\| CNMBert-Default \| [Huggingface](https://huggingface.co/Midsummra/CNMBert) \| 0.4GB \| 131M \| 12.56 \| 59.70 \| 49.74 \|
	\| CNMBert-MoE \| [Huggingface](https://huggingface.co/Midsummra/CNMBert-MoE) \| 0.8GB \| 329M \| 3.20 \| 61.53 \| 51.86 \|

	* 所有模型均在相同的200万条wiki以及知乎语料下训练
	* QPS 为 queries per second (由于没有使用c重写predict所以现在性能很糟...)
	* MRR 为平均倒数排名(mean reciprocal rank)
	* Acc 为准确率(accuracy)

	### Usage

	```python
	from transformers import AutoTokenizer, BertConfig

	from CustomBertModel import predict
	from MoELayer import BertWwmMoE
	```

	加载模型

	```python
	# use CNMBert with MoE
	# To use CNMBert without MoE, replace all "Midsummra/CNMBert-MoE" with "Midsummra/CNMBert" and use BertForMaskedLM instead of using BertWwmMoE
	tokenizer = AutoTokenizer.from_pretrained("Midsummra/CNMBert-MoE")
	config = BertConfig.from_pretrained('Midsummra/CNMBert-MoE')
	model = BertWwmMoE.from_pretrained('Midsummra/CNMBert-MoE', config=config).to('cuda')

	# model = BertForMaskedLM.from_pretrained('Midsummra/CNMBert').to('cuda')
	```

	预测词语

	```python
	print(predict("我有两千kq", "kq", model, tokenizer)[:5])
	print(predict("快去给魔理沙看b吧", "b", model, tokenizer[:5]))
	```

	> ['块钱', 1.2056937473156175], ['块前', 0.05837443749364857], ['开千', 0.0483869208528063], ['可千', 0.03996622172280445], ['口气', 0.037183335575008414]

	> ['病', 1.6893256306648254], ['吧', 0.1642467901110649], ['呗', 0.026976384222507477], ['包', 0.021441461518406868], ['报', 0.01396679226309061]

	---

	```python
	# 默认的predict函数使用束搜索
	def predict(sentence: str,
	predict_word: str,
	model,
	tokenizer,
	top_k=8,
	beam_size=16, # 束宽
	threshold=0.005, # 阈值
	fast_mode=True, # 是否使用快速模式
	strict_mode=True): # 是否对输出结果进行检查

	# 使用回溯的无剪枝暴力搜索
	def backtrack_predict(sentence: str,
	predict_word: str,
	model,
	tokenizer,
	top_k=10,
	fast_mode=True,
	strict_mode=True):
	```

	> 由于BERT的自编码特性，导致其在预测MASK时，顺序不同会导致预测结果不同，如果启用`fast_mode`，则会正向和反向分别对输入进行预测，可以提升一点准确率(2%左右)，但是会带来更大的性能开销。

	> `strict_mode`会对输入进行检查，以判断其是否为一个真实存在的汉语词汇。

	### 如何微调模型

	请参考[TrainExample.ipynb](https://github.com/IgarashiAkatuki/CNMBert/blob/main/TrainExample.ipynb),在数据集的格式上，只要保证csv的第一列为要训练的语料即可。

	### Q&A

	Q: 感觉这个东西准确度有点低啊

	A: 可以尝试设置`fast_mode`和`strict_mode`为`False`。模型是在很小的数据集(200w)上进行的预训练，所以泛化能力不足很正常，，，可以在更大数据集或者更加细分的领域进行微调，具体微调方式和[Chinese-BERT-wwm](https://github.com/ymcui/Chinese-BERT-wwm)差别不大，只需要将`DataCollactor`替换为`CustomBertModel.py`中的`DataCollatorForMultiMask`。

	### 引用
	如果您对CNMBert的具体实现感兴趣的话，可以参考
	```
	@misc{feng2024cnmbertmodelhanyupinyin,
	title={CNMBert: A Model For Hanyu Pinyin Abbreviation to Character Conversion Task},
	author={Zishuo Feng and Feng Cao},
	year={2024},
	eprint={2411.11770},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2411.11770},
	}
	```