Pectics commited on
Commit
16cbc1c
·
1 Parent(s): 979cfa2

Initial commit

Browse files
.gitattributes CHANGED
@@ -1,35 +1,2 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
  *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  *.safetensors filter=lfs diff=lfs merge=lfs -text
2
+ *.bin filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
README.md CHANGED
@@ -9,4 +9,150 @@ base_model:
9
  pipeline_tag: text-classification
10
  tags:
11
  - agent
12
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
  pipeline_tag: text-classification
10
  tags:
11
  - agent
12
+ ---
13
+
14
+ # vad-macbert
15
+
16
+ Chinese VAD (valence/arousal/dominance) regression based on `hfl/chinese-macbert-base`.
17
+ The model predicts 3 continuous values aligned to the VAD scale produced by
18
+ `RobroKools/vad-bert` (teacher model).
19
+
20
+ ## Quickstart
21
+
22
+ ```python
23
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
24
+ import torch
25
+
26
+ model_path = "Pectics/vad-macbert"
27
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
28
+ model = AutoModelForSequenceClassification.from_pretrained(model_path)
29
+ model.eval()
30
+
31
+ text = "这部电影让我很感动。"
32
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
33
+ with torch.no_grad():
34
+ outputs = model(**inputs)
35
+ vad = outputs.logits.squeeze().tolist()
36
+ print("VAD:", vad)
37
+ ```
38
+
39
+ ## Model Details
40
+
41
+ - Base model: `hfl/chinese-macbert-base`
42
+ - Task: VAD regression (3 outputs: valence, arousal, dominance)
43
+ - Head: `AutoModelForSequenceClassification` with `num_labels=3`, `problem_type=regression`
44
+
45
+ ## Data Sources & Labeling
46
+
47
+ ### en-zh_cn_vad_clean.csv
48
+ - Source: OpenSubtitles EN-ZH parallel corpus.
49
+ - Labeling: English side fed into `RobroKools/vad-bert` to obtain VAD values,
50
+ then assigned to the paired Chinese text.
51
+
52
+ ### en-zh_cn_vad_long.csv
53
+ - Derived from `en-zh_cn_vad_clean.csv` by filtering for longer texts using a
54
+ length threshold (original threshold was not recorded).
55
+ - Inferred from statistics: minimum length is 32 characters, so the filter
56
+ likely kept samples with length >= 32 chars.
57
+
58
+ ### en-zh_cn_vad_long_clean.csv
59
+ - Cleaned from `en-zh_cn_vad_long.csv` by removing subtitle formatting noise:
60
+ - ASS/SSA tag blocks like `{\\fs..\\pos(..)}` (including broken `{` blocks)
61
+ - HTML-like tags (e.g. `<i>...</i>`)
62
+ - Escape codes like `\\N`, `\\n`, `\\h`, `\\t`
63
+ - Extra whitespace normalization
64
+ - Non-CJK rows were dropped.
65
+
66
+ ### en-zh_cn_vad_mix.csv
67
+ - Mixed dataset created for replay training:
68
+ - 200k samples from `en-zh_cn_vad_clean.csv`
69
+ - 200k samples from `en-zh_cn_vad_long_clean.csv`
70
+ - Shuffled after sampling
71
+
72
+ ## Training Summary
73
+
74
+ The final model (`vad-macbert-mix/best`) was obtained in three stages:
75
+
76
+ 1. **Base training** on `en-zh_cn_vad_clean.csv`
77
+ 2. **Long-text adaptation** on `en-zh_cn_vad_long_clean.csv`
78
+ 3. **Replay mix** on `en-zh_cn_vad_mix.csv` (resume from stage 2)
79
+
80
+ ### Final-stage Command (Replay Mix)
81
+
82
+ ```
83
+ --model_name hfl/chinese-macbert-base
84
+ --output_dir train/vad-macbert-mix
85
+ --data_path train/en-zh_cn_vad_mix.csv
86
+ --epochs 4
87
+ --batch_size 32
88
+ --grad_accum_steps 4
89
+ --learning_rate 0.00001
90
+ --weight_decay 0.01
91
+ --warmup_ratio 0.1
92
+ --warmup_steps 0
93
+ --max_length 512
94
+ --eval_ratio 0.01
95
+ --eval_every 100
96
+ --eval_batches 200
97
+ --loss huber
98
+ --huber_delta 1.0
99
+ --shuffle_buffer 4096
100
+ --min_chars 2
101
+ --save_every 100
102
+ --log_every 1
103
+ --max_steps 5000
104
+ --seed 42
105
+ --dtype fp16
106
+ --num_rows 400000
107
+ --resume_from train/vad-macbert-long/best
108
+ --encoding utf-8
109
+ ```
110
+
111
+ Training environment (conda `llm`):
112
+
113
+ - Python 3.10.19
114
+ - torch 2.9.1+cu130
115
+ - transformers 4.57.6
116
+
117
+ ## Evaluation
118
+
119
+ Benchmark script: `train/vad_benchmark.py`
120
+
121
+ - Evaluation uses a fixed stride derived from `eval_ratio=0.01`
122
+ (roughly 1 out of 100 samples).
123
+ - Length buckets by character count: 0–20, 20–40, 40–80, 80–120, 120–200,
124
+ 200–400, 400+
125
+
126
+ ### Results (vad-macbert-mix/best)
127
+
128
+ **en-zh_cn_vad_clean.csv**
129
+
130
+ - mse_mean=0.043734
131
+ - mae_mean=0.149322
132
+ - pearson_mean=0.7335
133
+
134
+ **en-zh_cn_vad_long_clean.csv**
135
+
136
+ - mse_mean=0.031895
137
+ - mae_mean=0.131320
138
+ - pearson_mean=0.7565
139
+
140
+ Notes:
141
+ - `400+` bucket Pearson is unstable due to small sample size; interpret with care.
142
+
143
+ ## Limitations
144
+
145
+ - Labels are derived from an English VAD teacher and transferred via parallel
146
+ alignment, so they reflect the teacher’s bias and may not match human Chinese
147
+ annotations.
148
+ - Subtitle corpora include translation artifacts and formatting noise; cleaned
149
+ versions mitigate but do not fully remove this.
150
+ - Extreme-length sentences are under-represented; performance on 400+ chars
151
+ is not reliable.
152
+
153
+ ## Files in This Repo
154
+
155
+ - `config.json`
156
+ - `model.safetensors`
157
+ - `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `vocab.txt`
158
+ - `training_args.json`
README_zh.md ADDED
@@ -0,0 +1,135 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # vad-macbert
2
+
3
+ 基于 `hfl/chinese-macbert-base` 的中文 VAD(valence/arousal/dominance)回归模型。
4
+ 输出 3 个连续值,目标对齐到教师模型 `RobroKools/vad-bert` 的 VAD 空间。
5
+
6
+ ## 快速上手
7
+
8
+ ```python
9
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
10
+ import torch
11
+
12
+ model_path = "Pectics/vad-macbert"
13
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
14
+ model = AutoModelForSequenceClassification.from_pretrained(model_path)
15
+ model.eval()
16
+
17
+ text = "这部电影让我很感动。"
18
+ inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=128)
19
+ with torch.no_grad():
20
+ outputs = model(**inputs)
21
+ vad = outputs.logits.squeeze().tolist()
22
+ print("VAD:", vad)
23
+ ```
24
+
25
+ ## 模型信息
26
+
27
+ - 基座模型:`hfl/chinese-macbert-base`
28
+ - 任务:VAD 回归(3 维输出:valence, arousal, dominance)
29
+ - 头部:`AutoModelForSequenceClassification`,`num_labels=3`,`problem_type=regression`
30
+
31
+ ## 数据来源与标注方式
32
+
33
+ ### en-zh_cn_vad_clean.csv
34
+ - 来源:OpenSubtitles 英中平行语料。
35
+ - 标注:将英文句子输入 `RobroKools/vad-bert` 获取 VAD,再把该 VAD 赋给对应中文句子。
36
+
37
+ ### en-zh_cn_vad_long.csv
38
+ - 由 `en-zh_cn_vad_clean.csv` 过滤长句得到(原始阈值未记录)。
39
+ - 根据长度统计推断最小长度为 32 字符,推测当时过滤条件为 `len >= 32`。
40
+
41
+ ### en-zh_cn_vad_long_clean.csv
42
+ - 从 `en-zh_cn_vad_long.csv` 清洗得到,去掉字幕样式噪声:
43
+ - ASS/SSA 标签块(如 `{\\fs..\\pos(..)}`,含不完整 `{`)
44
+ - HTML 类标签(如 `<i>...</i>`)
45
+ - 转义标记(`\\N`、`\\n`、`\\h`、`\\t`)
46
+ - 多余空白归一化
47
+ - 非 CJK 内容已过滤。
48
+
49
+ ### en-zh_cn_vad_mix.csv
50
+ - 回放混合数据:
51
+ - `en-zh_cn_vad_clean.csv` 抽样 200k
52
+ - `en-zh_cn_vad_long_clean.csv` 抽样 200k
53
+ - 合并后再随机打乱
54
+
55
+ ## 训练过程
56
+
57
+ 最终模型 `vad-macbert-mix/best` 由三阶段训练获得:
58
+
59
+ 1. **基础训练**:`en-zh_cn_vad_clean.csv`
60
+ 2. **长句适配**:`en-zh_cn_vad_long_clean.csv`
61
+ 3. **回放混合**:`en-zh_cn_vad_mix.csv`(从阶段 2 继续训练)
62
+
63
+ ### 最终阶段命令(回放混合)
64
+
65
+ ```
66
+ --model_name hfl/chinese-macbert-base
67
+ --output_dir train/vad-macbert-mix
68
+ --data_path train/en-zh_cn_vad_mix.csv
69
+ --epochs 4
70
+ --batch_size 32
71
+ --grad_accum_steps 4
72
+ --learning_rate 0.00001
73
+ --weight_decay 0.01
74
+ --warmup_ratio 0.1
75
+ --warmup_steps 0
76
+ --max_length 512
77
+ --eval_ratio 0.01
78
+ --eval_every 100
79
+ --eval_batches 200
80
+ --loss huber
81
+ --huber_delta 1.0
82
+ --shuffle_buffer 4096
83
+ --min_chars 2
84
+ --save_every 100
85
+ --log_every 1
86
+ --max_steps 5000
87
+ --seed 42
88
+ --dtype fp16
89
+ --num_rows 400000
90
+ --resume_from train/vad-macbert-long/best
91
+ --encoding utf-8
92
+ ```
93
+
94
+ 训练环境(conda `llm`):
95
+
96
+ - Python 3.10.19
97
+ - torch 2.9.1+cu130
98
+ - transformers 4.57.6
99
+
100
+ ## 评测
101
+
102
+ 基准脚本:`train/vad_benchmark.py`
103
+
104
+ - 使用 `eval_ratio=0.01`(约 1/100 抽样)。
105
+ - 长度分桶(字符数):0–20、20–40、40–80、80–120、120–200、200–400、400+
106
+
107
+ ### 结果(vad-macbert-mix/best)
108
+
109
+ **en-zh_cn_vad_clean.csv**
110
+
111
+ - mse_mean=0.043734
112
+ - mae_mean=0.149322
113
+ - pearson_mean=0.7335
114
+
115
+ **en-zh_cn_vad_long_clean.csv**
116
+
117
+ - mse_mean=0.031895
118
+ - mae_mean=0.131320
119
+ - pearson_mean=0.7565
120
+
121
+ 备注:
122
+ - `400+` 分桶样本量很少,Pearson 不稳定,仅供参考。
123
+
124
+ ## 限制与注意事项
125
+
126
+ - VAD 标签来自英文教师模型并通过平行语料对齐,可能带有教师偏差,不等同于人工中文标注。
127
+ - 字幕语料存在翻译误差和格式噪声,清洗后仍可能残留。
128
+ - 超长句样本较少,`400+` 的表现不稳定。
129
+
130
+ ## 目录文件
131
+
132
+ - `config.json`
133
+ - `model.safetensors`
134
+ - `tokenizer.json`, `tokenizer_config.json`, `special_tokens_map.json`, `vocab.txt`
135
+ - `training_args.json`
config.json ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "BertForSequenceClassification"
4
+ ],
5
+ "attention_probs_dropout_prob": 0.1,
6
+ "classifier_dropout": null,
7
+ "directionality": "bidi",
8
+ "dtype": "float32",
9
+ "gradient_checkpointing": false,
10
+ "hidden_act": "gelu",
11
+ "hidden_dropout_prob": 0.1,
12
+ "hidden_size": 768,
13
+ "id2label": {
14
+ "0": "LABEL_0",
15
+ "1": "LABEL_1",
16
+ "2": "LABEL_2"
17
+ },
18
+ "initializer_range": 0.02,
19
+ "intermediate_size": 3072,
20
+ "label2id": {
21
+ "LABEL_0": 0,
22
+ "LABEL_1": 1,
23
+ "LABEL_2": 2
24
+ },
25
+ "layer_norm_eps": 1e-12,
26
+ "max_position_embeddings": 512,
27
+ "model_type": "bert",
28
+ "num_attention_heads": 12,
29
+ "num_hidden_layers": 12,
30
+ "pad_token_id": 0,
31
+ "pooler_fc_size": 768,
32
+ "pooler_num_attention_heads": 12,
33
+ "pooler_num_fc_layers": 3,
34
+ "pooler_size_per_head": 128,
35
+ "pooler_type": "first_token_transform",
36
+ "position_embedding_type": "absolute",
37
+ "problem_type": "regression",
38
+ "transformers_version": "4.57.6",
39
+ "type_vocab_size": 2,
40
+ "use_cache": true,
41
+ "vocab_size": 21128
42
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:da4ec84f28cf329c6779d008fd023df99d08fe24c9340c1bb16229d7fb0fe9a0
3
+ size 409103316
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,56 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_lower_case": true,
47
+ "extra_special_tokens": {},
48
+ "mask_token": "[MASK]",
49
+ "model_max_length": 1000000000000000019884624838656,
50
+ "pad_token": "[PAD]",
51
+ "sep_token": "[SEP]",
52
+ "strip_accents": null,
53
+ "tokenize_chinese_chars": true,
54
+ "tokenizer_class": "BertTokenizer",
55
+ "unk_token": "[UNK]"
56
+ }
training_args.json ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "batch_size": 32,
3
+ "data_path": "train/en-zh_cn_vad_mix.csv",
4
+ "dtype": "fp16",
5
+ "encoding": "utf-8",
6
+ "epochs": 4,
7
+ "errors": "ignore",
8
+ "eval_batch_size": 0,
9
+ "eval_batches": 200,
10
+ "eval_every": 100,
11
+ "eval_ratio": 0.01,
12
+ "grad_accum_steps": 4,
13
+ "huber_delta": 1.0,
14
+ "learning_rate": 1e-05,
15
+ "log_every": 1,
16
+ "loss": "huber",
17
+ "max_length": 512,
18
+ "max_rows": null,
19
+ "max_steps": 5000,
20
+ "min_chars": 2,
21
+ "model_name": "hfl/chinese-macbert-base",
22
+ "num_labels": 3,
23
+ "num_rows": 400000,
24
+ "output_dir": "train/vad-macbert-mix",
25
+ "resume_from": "train/vad-macbert-long/best",
26
+ "save_every": 100,
27
+ "seed": 42,
28
+ "shuffle_buffer": 4096,
29
+ "warmup_ratio": 0.1,
30
+ "warmup_steps": 0,
31
+ "weight_decay": 0.01
32
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff