|
--- |
|
license: apache-2.0 |
|
language: |
|
- zh |
|
pipeline_tag: sentence-similarity |
|
tags: |
|
- embedding |
|
- text-embedding |
|
--- |
|
|
|
<h1 align="center"> |
|
Long Bert Chinese |
|
<br> |
|
</h1> |
|
|
|
<h4 align="center"> |
|
<p> |
|
<b>简体中文</b> | |
|
<a href="https://github.com/OctopusMind/long-bert-chinese/blob/main/README_EN.md">English</a> |
|
</p> |
|
</h4> |
|
|
|
<p > |
|
<br> |
|
</p> |
|
|
|
**Long Bert**: 长文本相似度模型,支持8192token长度。 |
|
基于bert-base-chinese,将原始BERT位置编码更改成ALiBi位置编码,使BERT可以支持8192的序列长度。 |
|
|
|
### News |
|
* 支持`CoSENT`微调 |
|
* github仓库 [github](https://github.com/OctopusMind/longBert) |
|
|
|
|
|
### 使用 |
|
```python |
|
from numpy.linalg import norm |
|
from transformers import AutoModel |
|
|
|
model_path = "OctopusMind/longbert-embedding-8k-zh" |
|
model = AutoModel.from_pretrained(model_path, trust_remote_code=True) |
|
|
|
sentences = ['我是问蚂蚁借呗为什么不能提前结清欠款', "为什么借呗不能选择提前还款"] |
|
embeddings = model.encode(sentences) |
|
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b)) |
|
print(cos_sim(embeddings[0], embeddings[1])) |
|
``` |
|
|
|
## 微调 |
|
### 数据格式 |
|
|
|
```json |
|
[ |
|
{ |
|
"sentence1": "一个男人在吹一支大笛子。", |
|
"sentence2": "一个人在吹长笛。", |
|
"label": 3 |
|
}, |
|
{ |
|
"sentence1": "三个人在下棋。", |
|
"sentence2": "两个人在下棋。", |
|
"label": 2 |
|
}, |
|
{ |
|
"sentence1": "一个女人在写作。", |
|
"sentence2": "一个女人在游泳。", |
|
"label": 0 |
|
} |
|
] |
|
``` |
|
|
|
### CoSENT 微调 |
|
|
|
至`train/`路径下 |
|
```bash |
|
cd train/ |
|
``` |
|
进行 CoSENT 微调 |
|
```bash |
|
python cosent_finetune.py \ |
|
--data_dir ../data/train_data.json \ |
|
--output_dir ./outputs/my-model \ |
|
--max_seq_length 1024 \ |
|
--num_epochs 10 \ |
|
--batch_size 64 \ |
|
--learning_rate 2e-5 |
|
``` |
|
|
|
|
|
|
|
## 贡献 |
|
欢迎通过提交拉取请求或在仓库中提出问题来为此模块做出贡献。 |
|
|
|
## License |
|
本项目遵循[Apache-2.0开源协议](./LICENSE) |