Initial commit
Browse files- .gitattributes +1 -34
- README.md +147 -1
- README_zh.md +135 -0
- config.json +42 -0
- model.safetensors +3 -0
- special_tokens_map.json +37 -0
- tokenizer.json +0 -0
- tokenizer_config.json +56 -0
- training_args.json +32 -0
- vocab.txt +0 -0
.gitattributes
CHANGED
|
@@ -1,35 +1,2 @@
|
|
| 1 |
-
*.7z filter=lfs diff=lfs merge=lfs -text
|
| 2 |
-
*.arrow filter=lfs diff=lfs merge=lfs -text
|
| 3 |
-
*.bin filter=lfs diff=lfs merge=lfs -text
|
| 4 |
-
*.bz2 filter=lfs diff=lfs merge=lfs -text
|
| 5 |
-
*.ckpt filter=lfs diff=lfs merge=lfs -text
|
| 6 |
-
*.ftz filter=lfs diff=lfs merge=lfs -text
|
| 7 |
-
*.gz filter=lfs diff=lfs merge=lfs -text
|
| 8 |
-
*.h5 filter=lfs diff=lfs merge=lfs -text
|
| 9 |
-
*.joblib filter=lfs diff=lfs merge=lfs -text
|
| 10 |
-
*.lfs.* filter=lfs diff=lfs merge=lfs -text
|
| 11 |
-
*.mlmodel filter=lfs diff=lfs merge=lfs -text
|
| 12 |
-
*.model filter=lfs diff=lfs merge=lfs -text
|
| 13 |
-
*.msgpack filter=lfs diff=lfs merge=lfs -text
|
| 14 |
-
*.npy filter=lfs diff=lfs merge=lfs -text
|
| 15 |
-
*.npz filter=lfs diff=lfs merge=lfs -text
|
| 16 |
-
*.onnx filter=lfs diff=lfs merge=lfs -text
|
| 17 |
-
*.ot filter=lfs diff=lfs merge=lfs -text
|
| 18 |
-
*.parquet filter=lfs diff=lfs merge=lfs -text
|
| 19 |
-
*.pb filter=lfs diff=lfs merge=lfs -text
|
| 20 |
-
*.pickle filter=lfs diff=lfs merge=lfs -text
|
| 21 |
-
*.pkl filter=lfs diff=lfs merge=lfs -text
|
| 22 |
-
*.pt filter=lfs diff=lfs merge=lfs -text
|
| 23 |
-
*.pth filter=lfs diff=lfs merge=lfs -text
|
| 24 |
-
*.rar filter=lfs diff=lfs merge=lfs -text
|
| 25 |
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 26 |
-
|
| 27 |
-
*.tar.* filter=lfs diff=lfs merge=lfs -text
|
| 28 |
-
*.tar filter=lfs diff=lfs merge=lfs -text
|
| 29 |
-
*.tflite filter=lfs diff=lfs merge=lfs -text
|
| 30 |
-
*.tgz filter=lfs diff=lfs merge=lfs -text
|
| 31 |
-
*.wasm filter=lfs diff=lfs merge=lfs -text
|
| 32 |
-
*.xz filter=lfs diff=lfs merge=lfs -text
|
| 33 |
-
*.zip filter=lfs diff=lfs merge=lfs -text
|
| 34 |
-
*.zst filter=lfs diff=lfs merge=lfs -text
|
| 35 |
-
*tfevents* filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
*.safetensors filter=lfs diff=lfs merge=lfs -text
|
| 2 |
+
*.bin filter=lfs diff=lfs merge=lfs -text
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
README.md
CHANGED
|
@@ -9,4 +9,150 @@ base_model:
|
|
| 9 |
pipeline_tag: text-classification
|
| 10 |
tags:
|
| 11 |
- agent
|
| 12 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
pipeline_tag: text-classification
|
| 10 |
tags:
|
| 11 |
- agent
|
| 12 |
+
---
|
| 13 |
+
|
| 14 |
+
# vad-macbert
|
| 15 |
+
|
| 16 |
+
Chinese VAD (valence/arousal/dominance) regression based on `hfl/chinese-macbert-base`.
|
| 17 |
+
The model predicts 3 continuous values aligned to the VAD scale produced by
|
| 18 |
+
`RobroKools/vad-bert` (teacher model).
|
| 19 |
+
|
| 20 |
+
## Quickstart
|
| 21 |
+
|
| 22 |
+
```python
|
| 23 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 24 |
+
import torch
|
| 25 |
+
|
| 26 |
+
model_path = "Pectics/vad-macbert"
|
| 27 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 28 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_path)
|
| 29 |
+
model.eval()
|
| 30 |
+
|
| 31 |
+
text = "这部电影让我很感动。"
|
| 32 |
+
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
|
| 33 |
+
with torch.no_grad():
|
| 34 |
+
outputs = model(**inputs)
|
| 35 |
+
vad = outputs.logits.squeeze().tolist()
|
| 36 |
+
print("VAD:", vad)
|
| 37 |
+
```
|
| 38 |
+
|
| 39 |
+
## Model Details
|
| 40 |
+
|
| 41 |
+
- Base model: `hfl/chinese-macbert-base`
|
| 42 |
+
- Task: VAD regression (3 outputs: valence, arousal, dominance)
|
| 43 |
+
- Head: `AutoModelForSequenceClassification` with `num_labels=3`, `problem_type=regression`
|
| 44 |
+
|
| 45 |
+
## Data Sources & Labeling
|
| 46 |
+
|
| 47 |
+
### en-zh_cn_vad_clean.csv
|
| 48 |
+
- Source: OpenSubtitles EN-ZH parallel corpus.
|
| 49 |
+
- Labeling: English side fed into `RobroKools/vad-bert` to obtain VAD values,
|
| 50 |
+
then assigned to the paired Chinese text.
|
| 51 |
+
|
| 52 |
+
### en-zh_cn_vad_long.csv
|
| 53 |
+
- Derived from `en-zh_cn_vad_clean.csv` by filtering for longer texts using a
|
| 54 |
+
length threshold (original threshold was not recorded).
|
| 55 |
+
- Inferred from statistics: minimum length is 32 characters, so the filter
|
| 56 |
+
likely kept samples with length >= 32 chars.
|
| 57 |
+
|
| 58 |
+
### en-zh_cn_vad_long_clean.csv
|
| 59 |
+
- Cleaned from `en-zh_cn_vad_long.csv` by removing subtitle formatting noise:
|
| 60 |
+
- ASS/SSA tag blocks like `{\\fs..\\pos(..)}` (including broken `{` blocks)
|
| 61 |
+
- HTML-like tags (e.g. `<i>...</i>`)
|
| 62 |
+
- Escape codes like `\\N`, `\\n`, `\\h`, `\\t`
|
| 63 |
+
- Extra whitespace normalization
|
| 64 |
+
- Non-CJK rows were dropped.
|
| 65 |
+
|
| 66 |
+
### en-zh_cn_vad_mix.csv
|
| 67 |
+
- Mixed dataset created for replay training:
|
| 68 |
+
- 200k samples from `en-zh_cn_vad_clean.csv`
|
| 69 |
+
- 200k samples from `en-zh_cn_vad_long_clean.csv`
|
| 70 |
+
- Shuffled after sampling
|
| 71 |
+
|
| 72 |
+
## Training Summary
|
| 73 |
+
|
| 74 |
+
The final model (`vad-macbert-mix/best`) was obtained in three stages:
|
| 75 |
+
|
| 76 |
+
1. **Base training** on `en-zh_cn_vad_clean.csv`
|
| 77 |
+
2. **Long-text adaptation** on `en-zh_cn_vad_long_clean.csv`
|
| 78 |
+
3. **Replay mix** on `en-zh_cn_vad_mix.csv` (resume from stage 2)
|
| 79 |
+
|
| 80 |
+
### Final-stage Command (Replay Mix)
|
| 81 |
+
|
| 82 |
+
```
|
| 83 |
+
--model_name hfl/chinese-macbert-base
|
| 84 |
+
--output_dir train/vad-macbert-mix
|
| 85 |
+
--data_path train/en-zh_cn_vad_mix.csv
|
| 86 |
+
--epochs 4
|
| 87 |
+
--batch_size 32
|
| 88 |
+
--grad_accum_steps 4
|
| 89 |
+
--learning_rate 0.00001
|
| 90 |
+
--weight_decay 0.01
|
| 91 |
+
--warmup_ratio 0.1
|
| 92 |
+
--warmup_steps 0
|
| 93 |
+
--max_length 512
|
| 94 |
+
--eval_ratio 0.01
|
| 95 |
+
--eval_every 100
|
| 96 |
+
--eval_batches 200
|
| 97 |
+
--loss huber
|
| 98 |
+
--huber_delta 1.0
|
| 99 |
+
--shuffle_buffer 4096
|
| 100 |
+
--min_chars 2
|
| 101 |
+
--save_every 100
|
| 102 |
+
--log_every 1
|
| 103 |
+
--max_steps 5000
|
| 104 |
+
--seed 42
|
| 105 |
+
--dtype fp16
|
| 106 |
+
--num_rows 400000
|
| 107 |
+
--resume_from train/vad-macbert-long/best
|
| 108 |
+
--encoding utf-8
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
Training environment (conda `llm`):
|
| 112 |
+
|
| 113 |
+
- Python 3.10.19
|
| 114 |
+
- torch 2.9.1+cu130
|
| 115 |
+
- transformers 4.57.6
|
| 116 |
+
|
| 117 |
+
## Evaluation
|
| 118 |
+
|
| 119 |
+
Benchmark script: `train/vad_benchmark.py`
|
| 120 |
+
|
| 121 |
+
- Evaluation uses a fixed stride derived from `eval_ratio=0.01`
|
| 122 |
+
(roughly 1 out of 100 samples).
|
| 123 |
+
- Length buckets by character count: 0–20, 20–40, 40–80, 80–120, 120–200,
|
| 124 |
+
200–400, 400+
|
| 125 |
+
|
| 126 |
+
### Results (vad-macbert-mix/best)
|
| 127 |
+
|
| 128 |
+
**en-zh_cn_vad_clean.csv**
|
| 129 |
+
|
| 130 |
+
- mse_mean=0.043734
|
| 131 |
+
- mae_mean=0.149322
|
| 132 |
+
- pearson_mean=0.7335
|
| 133 |
+
|
| 134 |
+
**en-zh_cn_vad_long_clean.csv**
|
| 135 |
+
|
| 136 |
+
- mse_mean=0.031895
|
| 137 |
+
- mae_mean=0.131320
|
| 138 |
+
- pearson_mean=0.7565
|
| 139 |
+
|
| 140 |
+
Notes:
|
| 141 |
+
- `400+` bucket Pearson is unstable due to small sample size; interpret with care.
|
| 142 |
+
|
| 143 |
+
## Limitations
|
| 144 |
+
|
| 145 |
+
- Labels are derived from an English VAD teacher and transferred via parallel
|
| 146 |
+
alignment, so they reflect the teacher’s bias and may not match human Chinese
|
| 147 |
+
annotations.
|
| 148 |
+
- Subtitle corpora include translation artifacts and formatting noise; cleaned
|
| 149 |
+
versions mitigate but do not fully remove this.
|
| 150 |
+
- Extreme-length sentences are under-represented; performance on 400+ chars
|
| 151 |
+
is not reliable.
|
| 152 |
+
|
| 153 |
+
## Files in This Repo
|
| 154 |
+
|
| 155 |
+
- `config.json`
|
| 156 |
+
- `model.safetensors`
|
| 157 |
+
- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `vocab.txt`
|
| 158 |
+
- `training_args.json`
|
README_zh.md
ADDED
|
@@ -0,0 +1,135 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# vad-macbert
|
| 2 |
+
|
| 3 |
+
基于 `hfl/chinese-macbert-base` 的中文 VAD(valence/arousal/dominance)回归模型。
|
| 4 |
+
输出 3 个连续值,目标对齐到教师模型 `RobroKools/vad-bert` 的 VAD 空间。
|
| 5 |
+
|
| 6 |
+
## 快速上手
|
| 7 |
+
|
| 8 |
+
```python
|
| 9 |
+
from transformers import AutoTokenizer, AutoModelForSequenceClassification
|
| 10 |
+
import torch
|
| 11 |
+
|
| 12 |
+
model_path = "Pectics/vad-macbert"
|
| 13 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 14 |
+
model = AutoModelForSequenceClassification.from_pretrained(model_path)
|
| 15 |
+
model.eval()
|
| 16 |
+
|
| 17 |
+
text = "这部电影让我很感动。"
|
| 18 |
+
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
|
| 19 |
+
with torch.no_grad():
|
| 20 |
+
outputs = model(**inputs)
|
| 21 |
+
vad = outputs.logits.squeeze().tolist()
|
| 22 |
+
print("VAD:", vad)
|
| 23 |
+
```
|
| 24 |
+
|
| 25 |
+
## 模型信息
|
| 26 |
+
|
| 27 |
+
- 基座模型:`hfl/chinese-macbert-base`
|
| 28 |
+
- 任务:VAD 回归(3 维输出:valence, arousal, dominance)
|
| 29 |
+
- 头部:`AutoModelForSequenceClassification`,`num_labels=3`,`problem_type=regression`
|
| 30 |
+
|
| 31 |
+
## 数据来源与标注方式
|
| 32 |
+
|
| 33 |
+
### en-zh_cn_vad_clean.csv
|
| 34 |
+
- 来源:OpenSubtitles 英中平行语料。
|
| 35 |
+
- 标注:将英文句子输入 `RobroKools/vad-bert` 获取 VAD,再把该 VAD 赋给对应中文句子。
|
| 36 |
+
|
| 37 |
+
### en-zh_cn_vad_long.csv
|
| 38 |
+
- 由 `en-zh_cn_vad_clean.csv` 过滤长句得到(原始阈值未记录)。
|
| 39 |
+
- 根据长度统计推断最小长度为 32 字符,推测当时过滤条件为 `len >= 32`。
|
| 40 |
+
|
| 41 |
+
### en-zh_cn_vad_long_clean.csv
|
| 42 |
+
- 从 `en-zh_cn_vad_long.csv` 清洗得到,去掉字幕样式噪声:
|
| 43 |
+
- ASS/SSA 标签块(如 `{\\fs..\\pos(..)}`,含不完整 `{`)
|
| 44 |
+
- HTML 类标签(如 `<i>...</i>`)
|
| 45 |
+
- 转义标记(`\\N`、`\\n`、`\\h`、`\\t`)
|
| 46 |
+
- 多余空白归一化
|
| 47 |
+
- 非 CJK 内容已过滤。
|
| 48 |
+
|
| 49 |
+
### en-zh_cn_vad_mix.csv
|
| 50 |
+
- 回放混合数据:
|
| 51 |
+
- `en-zh_cn_vad_clean.csv` 抽样 200k
|
| 52 |
+
- `en-zh_cn_vad_long_clean.csv` 抽样 200k
|
| 53 |
+
- 合并后再随机打乱
|
| 54 |
+
|
| 55 |
+
## 训练过程
|
| 56 |
+
|
| 57 |
+
最终模型 `vad-macbert-mix/best` 由三阶段训练获得:
|
| 58 |
+
|
| 59 |
+
1. **基础训练**:`en-zh_cn_vad_clean.csv`
|
| 60 |
+
2. **长句适配**:`en-zh_cn_vad_long_clean.csv`
|
| 61 |
+
3. **回放混合**:`en-zh_cn_vad_mix.csv`(从阶段 2 继续训练)
|
| 62 |
+
|
| 63 |
+
### 最终阶段命令(回放混合)
|
| 64 |
+
|
| 65 |
+
```
|
| 66 |
+
--model_name hfl/chinese-macbert-base
|
| 67 |
+
--output_dir train/vad-macbert-mix
|
| 68 |
+
--data_path train/en-zh_cn_vad_mix.csv
|
| 69 |
+
--epochs 4
|
| 70 |
+
--batch_size 32
|
| 71 |
+
--grad_accum_steps 4
|
| 72 |
+
--learning_rate 0.00001
|
| 73 |
+
--weight_decay 0.01
|
| 74 |
+
--warmup_ratio 0.1
|
| 75 |
+
--warmup_steps 0
|
| 76 |
+
--max_length 512
|
| 77 |
+
--eval_ratio 0.01
|
| 78 |
+
--eval_every 100
|
| 79 |
+
--eval_batches 200
|
| 80 |
+
--loss huber
|
| 81 |
+
--huber_delta 1.0
|
| 82 |
+
--shuffle_buffer 4096
|
| 83 |
+
--min_chars 2
|
| 84 |
+
--save_every 100
|
| 85 |
+
--log_every 1
|
| 86 |
+
--max_steps 5000
|
| 87 |
+
--seed 42
|
| 88 |
+
--dtype fp16
|
| 89 |
+
--num_rows 400000
|
| 90 |
+
--resume_from train/vad-macbert-long/best
|
| 91 |
+
--encoding utf-8
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
训练环境(conda `llm`):
|
| 95 |
+
|
| 96 |
+
- Python 3.10.19
|
| 97 |
+
- torch 2.9.1+cu130
|
| 98 |
+
- transformers 4.57.6
|
| 99 |
+
|
| 100 |
+
## 评测
|
| 101 |
+
|
| 102 |
+
基准脚本:`train/vad_benchmark.py`
|
| 103 |
+
|
| 104 |
+
- 使用 `eval_ratio=0.01`(约 1/100 抽样)。
|
| 105 |
+
- 长度分桶(字符数):0–20、20–40、40–80、80–120、120–200、200–400、400+
|
| 106 |
+
|
| 107 |
+
### 结果(vad-macbert-mix/best)
|
| 108 |
+
|
| 109 |
+
**en-zh_cn_vad_clean.csv**
|
| 110 |
+
|
| 111 |
+
- mse_mean=0.043734
|
| 112 |
+
- mae_mean=0.149322
|
| 113 |
+
- pearson_mean=0.7335
|
| 114 |
+
|
| 115 |
+
**en-zh_cn_vad_long_clean.csv**
|
| 116 |
+
|
| 117 |
+
- mse_mean=0.031895
|
| 118 |
+
- mae_mean=0.131320
|
| 119 |
+
- pearson_mean=0.7565
|
| 120 |
+
|
| 121 |
+
备注:
|
| 122 |
+
- `400+` 分桶样本量很少,Pearson 不稳定,仅供参考。
|
| 123 |
+
|
| 124 |
+
## 限制与注意事项
|
| 125 |
+
|
| 126 |
+
- VAD 标签来自英文教师模型并通过平行语料对齐,可能带有教师偏差,不等同于人工中文标注。
|
| 127 |
+
- 字幕语料存在翻译误差和格式噪声,清洗后仍可能残留。
|
| 128 |
+
- 超长句样本较少,`400+` 的表现不稳定。
|
| 129 |
+
|
| 130 |
+
## 目录文件
|
| 131 |
+
|
| 132 |
+
- `config.json`
|
| 133 |
+
- `model.safetensors`
|
| 134 |
+
- `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `vocab.txt`
|
| 135 |
+
- `training_args.json`
|
config.json
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"BertForSequenceClassification"
|
| 4 |
+
],
|
| 5 |
+
"attention_probs_dropout_prob": 0.1,
|
| 6 |
+
"classifier_dropout": null,
|
| 7 |
+
"directionality": "bidi",
|
| 8 |
+
"dtype": "float32",
|
| 9 |
+
"gradient_checkpointing": false,
|
| 10 |
+
"hidden_act": "gelu",
|
| 11 |
+
"hidden_dropout_prob": 0.1,
|
| 12 |
+
"hidden_size": 768,
|
| 13 |
+
"id2label": {
|
| 14 |
+
"0": "LABEL_0",
|
| 15 |
+
"1": "LABEL_1",
|
| 16 |
+
"2": "LABEL_2"
|
| 17 |
+
},
|
| 18 |
+
"initializer_range": 0.02,
|
| 19 |
+
"intermediate_size": 3072,
|
| 20 |
+
"label2id": {
|
| 21 |
+
"LABEL_0": 0,
|
| 22 |
+
"LABEL_1": 1,
|
| 23 |
+
"LABEL_2": 2
|
| 24 |
+
},
|
| 25 |
+
"layer_norm_eps": 1e-12,
|
| 26 |
+
"max_position_embeddings": 512,
|
| 27 |
+
"model_type": "bert",
|
| 28 |
+
"num_attention_heads": 12,
|
| 29 |
+
"num_hidden_layers": 12,
|
| 30 |
+
"pad_token_id": 0,
|
| 31 |
+
"pooler_fc_size": 768,
|
| 32 |
+
"pooler_num_attention_heads": 12,
|
| 33 |
+
"pooler_num_fc_layers": 3,
|
| 34 |
+
"pooler_size_per_head": 128,
|
| 35 |
+
"pooler_type": "first_token_transform",
|
| 36 |
+
"position_embedding_type": "absolute",
|
| 37 |
+
"problem_type": "regression",
|
| 38 |
+
"transformers_version": "4.57.6",
|
| 39 |
+
"type_vocab_size": 2,
|
| 40 |
+
"use_cache": true,
|
| 41 |
+
"vocab_size": 21128
|
| 42 |
+
}
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:da4ec84f28cf329c6779d008fd023df99d08fe24c9340c1bb16229d7fb0fe9a0
|
| 3 |
+
size 409103316
|
special_tokens_map.json
ADDED
|
@@ -0,0 +1,37 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"cls_token": {
|
| 3 |
+
"content": "[CLS]",
|
| 4 |
+
"lstrip": false,
|
| 5 |
+
"normalized": false,
|
| 6 |
+
"rstrip": false,
|
| 7 |
+
"single_word": false
|
| 8 |
+
},
|
| 9 |
+
"mask_token": {
|
| 10 |
+
"content": "[MASK]",
|
| 11 |
+
"lstrip": false,
|
| 12 |
+
"normalized": false,
|
| 13 |
+
"rstrip": false,
|
| 14 |
+
"single_word": false
|
| 15 |
+
},
|
| 16 |
+
"pad_token": {
|
| 17 |
+
"content": "[PAD]",
|
| 18 |
+
"lstrip": false,
|
| 19 |
+
"normalized": false,
|
| 20 |
+
"rstrip": false,
|
| 21 |
+
"single_word": false
|
| 22 |
+
},
|
| 23 |
+
"sep_token": {
|
| 24 |
+
"content": "[SEP]",
|
| 25 |
+
"lstrip": false,
|
| 26 |
+
"normalized": false,
|
| 27 |
+
"rstrip": false,
|
| 28 |
+
"single_word": false
|
| 29 |
+
},
|
| 30 |
+
"unk_token": {
|
| 31 |
+
"content": "[UNK]",
|
| 32 |
+
"lstrip": false,
|
| 33 |
+
"normalized": false,
|
| 34 |
+
"rstrip": false,
|
| 35 |
+
"single_word": false
|
| 36 |
+
}
|
| 37 |
+
}
|
tokenizer.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,56 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"added_tokens_decoder": {
|
| 3 |
+
"0": {
|
| 4 |
+
"content": "[PAD]",
|
| 5 |
+
"lstrip": false,
|
| 6 |
+
"normalized": false,
|
| 7 |
+
"rstrip": false,
|
| 8 |
+
"single_word": false,
|
| 9 |
+
"special": true
|
| 10 |
+
},
|
| 11 |
+
"100": {
|
| 12 |
+
"content": "[UNK]",
|
| 13 |
+
"lstrip": false,
|
| 14 |
+
"normalized": false,
|
| 15 |
+
"rstrip": false,
|
| 16 |
+
"single_word": false,
|
| 17 |
+
"special": true
|
| 18 |
+
},
|
| 19 |
+
"101": {
|
| 20 |
+
"content": "[CLS]",
|
| 21 |
+
"lstrip": false,
|
| 22 |
+
"normalized": false,
|
| 23 |
+
"rstrip": false,
|
| 24 |
+
"single_word": false,
|
| 25 |
+
"special": true
|
| 26 |
+
},
|
| 27 |
+
"102": {
|
| 28 |
+
"content": "[SEP]",
|
| 29 |
+
"lstrip": false,
|
| 30 |
+
"normalized": false,
|
| 31 |
+
"rstrip": false,
|
| 32 |
+
"single_word": false,
|
| 33 |
+
"special": true
|
| 34 |
+
},
|
| 35 |
+
"103": {
|
| 36 |
+
"content": "[MASK]",
|
| 37 |
+
"lstrip": false,
|
| 38 |
+
"normalized": false,
|
| 39 |
+
"rstrip": false,
|
| 40 |
+
"single_word": false,
|
| 41 |
+
"special": true
|
| 42 |
+
}
|
| 43 |
+
},
|
| 44 |
+
"clean_up_tokenization_spaces": false,
|
| 45 |
+
"cls_token": "[CLS]",
|
| 46 |
+
"do_lower_case": true,
|
| 47 |
+
"extra_special_tokens": {},
|
| 48 |
+
"mask_token": "[MASK]",
|
| 49 |
+
"model_max_length": 1000000000000000019884624838656,
|
| 50 |
+
"pad_token": "[PAD]",
|
| 51 |
+
"sep_token": "[SEP]",
|
| 52 |
+
"strip_accents": null,
|
| 53 |
+
"tokenize_chinese_chars": true,
|
| 54 |
+
"tokenizer_class": "BertTokenizer",
|
| 55 |
+
"unk_token": "[UNK]"
|
| 56 |
+
}
|
training_args.json
ADDED
|
@@ -0,0 +1,32 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"batch_size": 32,
|
| 3 |
+
"data_path": "train/en-zh_cn_vad_mix.csv",
|
| 4 |
+
"dtype": "fp16",
|
| 5 |
+
"encoding": "utf-8",
|
| 6 |
+
"epochs": 4,
|
| 7 |
+
"errors": "ignore",
|
| 8 |
+
"eval_batch_size": 0,
|
| 9 |
+
"eval_batches": 200,
|
| 10 |
+
"eval_every": 100,
|
| 11 |
+
"eval_ratio": 0.01,
|
| 12 |
+
"grad_accum_steps": 4,
|
| 13 |
+
"huber_delta": 1.0,
|
| 14 |
+
"learning_rate": 1e-05,
|
| 15 |
+
"log_every": 1,
|
| 16 |
+
"loss": "huber",
|
| 17 |
+
"max_length": 512,
|
| 18 |
+
"max_rows": null,
|
| 19 |
+
"max_steps": 5000,
|
| 20 |
+
"min_chars": 2,
|
| 21 |
+
"model_name": "hfl/chinese-macbert-base",
|
| 22 |
+
"num_labels": 3,
|
| 23 |
+
"num_rows": 400000,
|
| 24 |
+
"output_dir": "train/vad-macbert-mix",
|
| 25 |
+
"resume_from": "train/vad-macbert-long/best",
|
| 26 |
+
"save_every": 100,
|
| 27 |
+
"seed": 42,
|
| 28 |
+
"shuffle_buffer": 4096,
|
| 29 |
+
"warmup_ratio": 0.1,
|
| 30 |
+
"warmup_steps": 0,
|
| 31 |
+
"weight_decay": 0.01
|
| 32 |
+
}
|
vocab.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|