luancy1208 commited on
Commit
67d959b
·
verified ·
1 Parent(s): b8ca7a8

v0.2 initial

Browse files
README_HF.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: CHIP — Chinese High-density Instruction Protocol
3
+ emoji: 🀄
4
+ colorFrom: blue
5
+ colorTo: orange
6
+ sdk: gradio
7
+ sdk_version: 4.44.0
8
+ app_file: app.py
9
+ pinned: true
10
+ license: apache-2.0
11
+ short_description: 数据驱动的中文 prompt 协议化压缩工具
12
+ tags:
13
+ - chinese
14
+ - prompt-engineering
15
+ - llm
16
+ - tokenizer
17
+ - compression
18
+ ---
19
+
20
+ # CHIP · 中文高密度提示协议
21
+
22
+ 把啰嗦的中文 prompt 自动压成结构化高密度形式 — **数据驱动,不是品味**。
23
+
24
+ ## 🎯 核心发现
25
+
26
+ 基于 9 个主流 tokenizer × 200 句 FLORES-200 平行语料的 1800 行实测:
27
+
28
+ - **6 个国产 tokenizer 上中文 prompt token 数 ≤ 等价英文**
29
+ (Baichuan2: 中文省 12.5%,DeepSeek-V3: 省 8.4%,GLM-4: 省 7.6%)
30
+ - **OpenAI cl100k 上中文比英文贵 73%**
31
+ - **`###` 标签在所有 9 个 tokenizer 上都是 1 token**,完爆方括号方案
32
+
33
+ ## 🔧 怎么用
34
+
35
+ 在左侧粘贴你的中文 prompt,选择目标模型,点压缩。右侧会展示:
36
+
37
+ 1. **压缩后的 prompt**(可一键复制)
38
+ 2. **Token 统计**(在你选的 tokenizer 上节省了多少)
39
+ 3. **命中的规则**(audit trail,可追溯每条改动)
40
+
41
+ ## 📦 GitHub / pip
42
+
43
+ ```bash
44
+ pip install chip-prompt
45
+ ```
46
+
47
+ ```python
48
+ from chip import compress
49
+ compress("请你帮我对下面这段文字进行一个全面的分析")
50
+ # → '分析下面这段文字'
51
+ ```
52
+
53
+ 🔗 [GitHub repo](https://github.com/marcuscw/CHIP) · [SPEC.md](https://github.com/marcuscw/CHIP/blob/main/SPEC.md) · [Datasets](https://github.com/marcuscw/CHIP/tree/main/results)
54
+
55
+ ## ⚖️ License
56
+
57
+ Apache-2.0
app.py ADDED
@@ -0,0 +1,665 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ app.py — CHIP HuggingFace Space 入口(美观版)
3
+
4
+ 部署: https://huggingface.co/spaces/<user>/CHIP
5
+ """
6
+ from __future__ import annotations
7
+
8
+ import os
9
+ import sys
10
+ from pathlib import Path
11
+
12
+ sys.path.insert(0, str(Path(__file__).parent))
13
+
14
+ import gradio as gr
15
+ import tiktoken
16
+
17
+ from chip import Compressor
18
+
19
+ # ============================================================
20
+ # Tokenizer 计数器
21
+ # ============================================================
22
+ TOKENIZERS = {}
23
+ try:
24
+ TOKENIZERS["GPT-4 (cl100k)"] = tiktoken.get_encoding("cl100k_base")
25
+ TOKENIZERS["GPT-4o (o200k)"] = tiktoken.get_encoding("o200k_base")
26
+ except Exception as e:
27
+ print(f"[warn] tiktoken load failed: {e}")
28
+
29
+
30
+ def _lazy_load_qwen():
31
+ from transformers import AutoTokenizer
32
+ return AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B", trust_remote_code=True)
33
+
34
+
35
+ def _lazy_load_deepseek():
36
+ from transformers import AutoTokenizer
37
+ return AutoTokenizer.from_pretrained("deepseek-ai/DeepSeek-V2-Lite", trust_remote_code=True)
38
+
39
+
40
+ _LAZY = {
41
+ "Qwen2.5": _lazy_load_qwen,
42
+ "DeepSeek-V2/V3": _lazy_load_deepseek,
43
+ }
44
+
45
+
46
+ def count_tokens(text: str, tokenizer_name: str) -> int:
47
+ if tokenizer_name in TOKENIZERS:
48
+ enc = TOKENIZERS[tokenizer_name]
49
+ elif tokenizer_name in _LAZY:
50
+ try:
51
+ tok = _LAZY[tokenizer_name]()
52
+ TOKENIZERS[tokenizer_name] = tok
53
+ enc = tok
54
+ except Exception:
55
+ return -1
56
+ else:
57
+ return -1
58
+
59
+ if hasattr(enc, "encode") and not hasattr(enc, "tokenize"):
60
+ return len(enc.encode(text))
61
+ return len(enc.encode(text, add_special_tokens=False))
62
+
63
+
64
+ # ============================================================
65
+ # 主压缩函数
66
+ # ============================================================
67
+ def run(text: str, target: str, use_l1: bool, use_l2: bool, use_l3: bool,
68
+ use_l4: bool, tokenizer_name: str, use_jieba: bool):
69
+ if not text.strip():
70
+ empty_card = "<div class='stat-card empty'>👈 请在左侧输入 prompt</div>"
71
+ return "", empty_card, "", ""
72
+
73
+ layers = []
74
+ for flag, name in [(use_l1, "L1"), (use_l2, "L2"), (use_l3, "L3"), (use_l4, "L4")]:
75
+ if flag:
76
+ layers.append(name)
77
+ if not layers:
78
+ layers = ["L1"]
79
+
80
+ if use_jieba:
81
+ os.environ["CHIP_USE_JIEBA"] = "1"
82
+ else:
83
+ os.environ.pop("CHIP_USE_JIEBA", None)
84
+ import chip.compressor as cc
85
+ cc._default_compressor = None
86
+
87
+ compressor = Compressor(target=target, layers=layers)
88
+ result = compressor.compress(text)
89
+
90
+ n_orig = count_tokens(text, tokenizer_name)
91
+ n_comp = count_tokens(result.compressed, tokenizer_name)
92
+
93
+ # ---- 漂亮的统计卡片 ----
94
+ if n_orig > 0 and n_comp > 0:
95
+ saving_pct = (1 - n_comp / n_orig) * 100
96
+ n_diff = n_orig - n_comp
97
+ if saving_pct > 0:
98
+ badge_class = "saving-badge good"
99
+ arrow = "↓"
100
+ verb = "节省"
101
+ elif saving_pct < 0:
102
+ badge_class = "saving-badge bad"
103
+ arrow = "↑"
104
+ verb = "增加"
105
+ else:
106
+ badge_class = "saving-badge neutral"
107
+ arrow = "→"
108
+ verb = "持平"
109
+
110
+ char_orig, char_comp = len(text), len(result.compressed)
111
+ char_pct = (1 - char_comp / max(char_orig, 1)) * 100
112
+
113
+ stats_html = f"""
114
+ <div class="stats-grid">
115
+ <div class="stat-card primary">
116
+ <div class="stat-label">Token 节省</div>
117
+ <div class="stat-value {badge_class}">
118
+ <span class="arrow">{arrow}</span> {abs(saving_pct):.1f}%
119
+ </div>
120
+ <div class="stat-sub">{verb} {abs(n_diff)} token · 在 {tokenizer_name}</div>
121
+ </div>
122
+ <div class="stat-card">
123
+ <div class="stat-label">原文</div>
124
+ <div class="stat-value muted">{n_orig}</div>
125
+ <div class="stat-sub">token · {char_orig} 字符</div>
126
+ </div>
127
+ <div class="stat-card">
128
+ <div class="stat-label">压缩后</div>
129
+ <div class="stat-value emphasis">{n_comp}</div>
130
+ <div class="stat-sub">token · {char_comp} 字符</div>
131
+ </div>
132
+ <div class="stat-card">
133
+ <div class="stat-label">字符压缩</div>
134
+ <div class="stat-value muted">{char_pct:+.0f}%</div>
135
+ <div class="stat-sub">字符数变化(供参考,token 才重要)</div>
136
+ </div>
137
+ </div>"""
138
+ else:
139
+ stats_html = (
140
+ "<div class='stat-card empty'>"
141
+ "⚠️ Token 计数不可用(可能是 tokenizer 加载失败)"
142
+ "</div>"
143
+ )
144
+
145
+ # ---- 漂亮的规则面板 ----
146
+ if result.applied_rules:
147
+ rules_items = []
148
+ for r in result.applied_rules:
149
+ # 解析 rule id 和命中次数
150
+ parts = r.split("×")
151
+ rid = parts[0]
152
+ count = parts[1] if len(parts) > 1 else "1"
153
+ # 按层着色
154
+ layer_color = {"L1": "#3B7DD8", "L2": "#E6883C", "L3": "#7C3AED", "L4": "#10B981"}
155
+ color = "#888"
156
+ for l, c in layer_color.items():
157
+ if l in rid:
158
+ color = c
159
+ break
160
+ badge = (
161
+ f"<span class='rule-pill' style='--rule-color:{color}'>"
162
+ f"<span class='rule-id'>{rid}</span>"
163
+ f"<span class='rule-count'>×{count}</span>"
164
+ f"</span>"
165
+ )
166
+ rules_items.append(badge)
167
+ rules_html = (
168
+ "<div class='rules-header'>🎯 命中规则 "
169
+ f"<span class='rules-count'>{len(result.applied_rules)} 条</span></div>"
170
+ "<div class='rules-pills'>" + " ".join(rules_items) + "</div>"
171
+ )
172
+ else:
173
+ rules_html = (
174
+ "<div class='rules-header empty-rules'>"
175
+ "📭 没有规则匹配 — 这段 prompt 已经够紧凑了"
176
+ "</div>"
177
+ )
178
+
179
+ return result.compressed, stats_html, rules_html, result.compressed
180
+
181
+
182
+ # ============================================================
183
+ # 示例
184
+ # ============================================================
185
+ EXAMPLES = [
186
+ [
187
+ "请你帮我对下面这段文字进行一个全面的分析,如果可以的话麻烦你给出一些改进的建议",
188
+ "qwen2.5", True, True, False, True, "GPT-4 (cl100k)", False,
189
+ ],
190
+ [
191
+ "请你扮演一位资深 Python 工程师,对下面的代码进行 code review,并以 JSON 格式输出结果",
192
+ "qwen2.5", True, True, False, True, "GPT-4 (cl100k)", True,
193
+ ],
194
+ [
195
+ "因为最近在下雨,所以路面变得很滑,因此开车的时候需要特别注意安全",
196
+ "qwen2.5", True, True, False, True, "GPT-4 (cl100k)", False,
197
+ ],
198
+ [
199
+ "Role: 资深产品经理\n任务: 评估这个需求,大家都知道产品决策需要数据支持",
200
+ "deepseek_v3", True, True, True, True, "GPT-4 (cl100k)", False,
201
+ ],
202
+ ]
203
+
204
+
205
+ # ============================================================
206
+ # 自定义 CSS — 让 demo 看起来不像默认的 gradio
207
+ # ============================================================
208
+ CUSTOM_CSS = """
209
+ /* ===== 全局 ===== */
210
+ .gradio-container {
211
+ max-width: 1280px !important;
212
+ margin: 0 auto !important;
213
+ }
214
+
215
+ /* ===== Hero 区 ===== */
216
+ .hero {
217
+ text-align: center;
218
+ padding: 36px 16px 12px;
219
+ background: linear-gradient(135deg, rgba(59,125,216,0.08) 0%, rgba(230,136,60,0.08) 100%);
220
+ border-radius: 16px;
221
+ margin-bottom: 24px;
222
+ border: 1px solid rgba(0,0,0,0.05);
223
+ }
224
+ .hero h1 {
225
+ font-size: 2.4em !important;
226
+ margin: 0 !important;
227
+ background: linear-gradient(135deg, #3B7DD8, #E6883C);
228
+ -webkit-background-clip: text;
229
+ -webkit-text-fill-color: transparent;
230
+ background-clip: text;
231
+ font-weight: 700 !important;
232
+ letter-spacing: -0.02em;
233
+ }
234
+ .hero .tagline {
235
+ font-size: 1.1em;
236
+ color: #555;
237
+ margin-top: 8px;
238
+ }
239
+ .hero .subtitle {
240
+ font-size: 0.95em;
241
+ color: #777;
242
+ max-width: 760px;
243
+ margin: 12px auto 0;
244
+ line-height: 1.6;
245
+ }
246
+ .hero .badges {
247
+ margin-top: 16px;
248
+ display: flex;
249
+ gap: 8px;
250
+ justify-content: center;
251
+ flex-wrap: wrap;
252
+ }
253
+ .hero .badge {
254
+ display: inline-flex;
255
+ align-items: center;
256
+ padding: 4px 10px;
257
+ background: rgba(255,255,255,0.7);
258
+ border: 1px solid rgba(0,0,0,0.08);
259
+ border-radius: 6px;
260
+ font-size: 0.85em;
261
+ color: #444;
262
+ text-decoration: none;
263
+ transition: all 0.2s;
264
+ }
265
+ .hero .badge:hover {
266
+ transform: translateY(-1px);
267
+ box-shadow: 0 2px 8px rgba(0,0,0,0.08);
268
+ }
269
+
270
+ /* ===== Key facts ===== */
271
+ .key-facts {
272
+ display: grid;
273
+ grid-template-columns: repeat(auto-fit, minmax(200px, 1fr));
274
+ gap: 12px;
275
+ margin: 24px 0;
276
+ }
277
+ .fact {
278
+ padding: 14px 18px;
279
+ background: white;
280
+ border-radius: 10px;
281
+ border-left: 3px solid var(--fact-color, #3B7DD8);
282
+ box-shadow: 0 1px 3px rgba(0,0,0,0.05);
283
+ }
284
+ .fact-label {
285
+ font-size: 0.75em;
286
+ text-transform: uppercase;
287
+ letter-spacing: 0.06em;
288
+ color: #888;
289
+ font-weight: 600;
290
+ }
291
+ .fact-value {
292
+ font-size: 1.5em;
293
+ font-weight: 700;
294
+ color: var(--fact-color, #3B7DD8);
295
+ margin-top: 4px;
296
+ }
297
+ .fact-detail {
298
+ font-size: 0.85em;
299
+ color: #666;
300
+ margin-top: 2px;
301
+ }
302
+
303
+ /* ===== 统计卡片 ===== */
304
+ .stats-grid {
305
+ display: grid;
306
+ grid-template-columns: 2fr 1fr 1fr 1fr;
307
+ gap: 12px;
308
+ margin: 8px 0;
309
+ }
310
+ .stat-card {
311
+ background: white;
312
+ border-radius: 10px;
313
+ padding: 14px 16px;
314
+ border: 1px solid rgba(0,0,0,0.06);
315
+ transition: all 0.2s;
316
+ }
317
+ .stat-card:hover {
318
+ box-shadow: 0 2px 12px rgba(0,0,0,0.08);
319
+ transform: translateY(-1px);
320
+ }
321
+ .stat-card.primary {
322
+ background: linear-gradient(135deg, #FAFBFF, #F5F7FB);
323
+ border: 1px solid #DDE5F0;
324
+ }
325
+ .stat-card.empty {
326
+ text-align: center;
327
+ padding: 28px;
328
+ background: #F8F9FB;
329
+ color: #888;
330
+ border: 1px dashed #DDD;
331
+ font-size: 0.95em;
332
+ }
333
+ .stat-label {
334
+ font-size: 0.75em;
335
+ text-transform: uppercase;
336
+ letter-spacing: 0.06em;
337
+ color: #888;
338
+ font-weight: 600;
339
+ }
340
+ .stat-value {
341
+ font-size: 1.6em;
342
+ font-weight: 700;
343
+ margin-top: 4px;
344
+ display: flex;
345
+ align-items: baseline;
346
+ gap: 6px;
347
+ }
348
+ .stat-value.muted { color: #999; font-size: 1.4em; }
349
+ .stat-value.emphasis { color: #10B981; font-size: 1.8em; }
350
+ .saving-badge.good { color: #10B981; }
351
+ .saving-badge.bad { color: #E6883C; }
352
+ .saving-badge.neutral { color: #888; }
353
+ .arrow { font-weight: 700; }
354
+ .stat-sub {
355
+ font-size: 0.78em;
356
+ color: #777;
357
+ margin-top: 2px;
358
+ }
359
+
360
+ /* ===== 规则面板 ===== */
361
+ .rules-header {
362
+ font-size: 0.95em;
363
+ font-weight: 600;
364
+ color: #444;
365
+ margin: 16px 0 8px;
366
+ display: flex;
367
+ align-items: center;
368
+ gap: 8px;
369
+ }
370
+ .rules-count {
371
+ font-size: 0.78em;
372
+ background: #EFF2F7;
373
+ padding: 2px 10px;
374
+ border-radius: 10px;
375
+ color: #555;
376
+ font-weight: 500;
377
+ }
378
+ .rules-pills {
379
+ display: flex;
380
+ gap: 6px;
381
+ flex-wrap: wrap;
382
+ padding: 4px 0;
383
+ }
384
+ .rule-pill {
385
+ display: inline-flex;
386
+ align-items: center;
387
+ gap: 4px;
388
+ padding: 4px 10px;
389
+ background: white;
390
+ border: 1px solid var(--rule-color);
391
+ color: var(--rule-color);
392
+ border-radius: 999px;
393
+ font-size: 0.82em;
394
+ font-family: 'JetBrains Mono', 'Fira Code', monospace;
395
+ font-weight: 500;
396
+ transition: all 0.2s;
397
+ }
398
+ .rule-pill:hover {
399
+ background: var(--rule-color);
400
+ color: white;
401
+ }
402
+ .rule-count {
403
+ font-size: 0.85em;
404
+ opacity: 0.8;
405
+ }
406
+ .empty-rules {
407
+ font-style: italic;
408
+ color: #888;
409
+ background: #F8F9FB;
410
+ padding: 14px;
411
+ border-radius: 8px;
412
+ border: 1px dashed #DDD;
413
+ text-align: center;
414
+ }
415
+
416
+ /* ===== 主操作按钮 ===== */
417
+ button.primary {
418
+ background: linear-gradient(135deg, #3B7DD8 0%, #5B9BE5 100%) !important;
419
+ border: none !important;
420
+ font-size: 1.05em !important;
421
+ letter-spacing: 0.02em !important;
422
+ transition: all 0.2s !important;
423
+ }
424
+ button.primary:hover {
425
+ transform: translateY(-1px);
426
+ box-shadow: 0 4px 12px rgba(59,125,216,0.3) !important;
427
+ }
428
+
429
+ /* ===== Footer ===== */
430
+ .footer {
431
+ margin-top: 36px;
432
+ padding: 20px 0;
433
+ border-top: 1px solid rgba(0,0,0,0.06);
434
+ color: #777;
435
+ font-size: 0.88em;
436
+ text-align: center;
437
+ }
438
+ .footer a { color: #3B7DD8; text-decoration: none; }
439
+ .footer a:hover { text-decoration: underline; }
440
+
441
+ /* ===== 暗色模式适配 ===== */
442
+ .dark .hero {
443
+ background: linear-gradient(135deg, rgba(59,125,216,0.15), rgba(230,136,60,0.15));
444
+ border-color: rgba(255,255,255,0.05);
445
+ }
446
+ .dark .hero .tagline { color: #BBB; }
447
+ .dark .hero .subtitle { color: #999; }
448
+ .dark .hero .badge {
449
+ background: rgba(255,255,255,0.05);
450
+ border-color: rgba(255,255,255,0.1);
451
+ color: #CCC;
452
+ }
453
+ .dark .stat-card { background: rgba(255,255,255,0.03); border-color: rgba(255,255,255,0.08); }
454
+ .dark .stat-card.primary { background: rgba(59,125,216,0.08); }
455
+ .dark .stat-card.empty { background: rgba(255,255,255,0.03); color: #888; border-color: rgba(255,255,255,0.1); }
456
+ .dark .stat-sub { color: #888; }
457
+ .dark .fact { background: rgba(255,255,255,0.03); }
458
+ .dark .fact-detail { color: #888; }
459
+ .dark .rules-count { background: rgba(255,255,255,0.06); color: #BBB; }
460
+ .dark .rule-pill { background: rgba(255,255,255,0.03); }
461
+ .dark .empty-rules { background: rgba(255,255,255,0.03); border-color: rgba(255,255,255,0.1); }
462
+
463
+ /* ===== 响应式 ===== */
464
+ @media (max-width: 768px) {
465
+ .stats-grid { grid-template-columns: 1fr 1fr; }
466
+ .key-facts { grid-template-columns: 1fr; }
467
+ .hero h1 { font-size: 1.8em !important; }
468
+ }
469
+ """
470
+
471
+
472
+ # ============================================================
473
+ # UI
474
+ # ============================================================
475
+ HERO_HTML = """
476
+ <div class="hero">
477
+ <h1>🀄 CHIP</h1>
478
+ <div class="tagline">Chinese High-density Instruction Protocol · 中文高密度提示协议</div>
479
+ <div class="subtitle">
480
+ 把啰嗦的中文 prompt 自动压成结构化高密度形式 — <strong>数据驱动,不是品味</strong>
481
+ </div>
482
+ <div class="badges">
483
+ <a class="badge" href="https://github.com/luancy1208/CHIP" target="_blank">⭐ GitHub</a>
484
+ <a class="badge" href="https://github.com/luancy1208/CHIP/blob/main/SPEC.md" target="_blank">📄 SPEC</a>
485
+ <a class="badge" href="https://github.com/luancy1208/CHIP/tree/main/results" target="_blank">📊 实测数据</a>
486
+ <span class="badge">Apache-2.0</span>
487
+ <span class="badge">v0.2 · 23 tests passing</span>
488
+ </div>
489
+ </div>
490
+
491
+ <div class="key-facts">
492
+ <div class="fact" style="--fact-color:#3B7DD8">
493
+ <div class="fact-label">实测数据</div>
494
+ <div class="fact-value">1800+ 行</div>
495
+ <div class="fact-detail">9 tokenizer × 200 句 FLORES-200</div>
496
+ </div>
497
+ <div class="fact" style="--fact-color:#10B981">
498
+ <div class="fact-label">国产模型上中文</div>
499
+ <div class="fact-value">省 12.5%</div>
500
+ <div class="fact-detail">Baichuan2 token / 等价英文</div>
501
+ </div>
502
+ <div class="fact" style="--fact-color:#E6883C">
503
+ <div class="fact-label">cl100k 上中文</div>
504
+ <div class="fact-value">贵 73%</div>
505
+ <div class="fact-detail">所以 CHIP 的主战场是国产模型</div>
506
+ </div>
507
+ <div class="fact" style="--fact-color:#7C3AED">
508
+ <div class="fact-label">### 标签</div>
509
+ <div class="fact-value">1 token</div>
510
+ <div class="fact-detail">在所有 9 个 tokenizer 上 — 完爆方括号</div>
511
+ </div>
512
+ </div>
513
+ """
514
+
515
+
516
+ FOOTER_HTML = """
517
+ <div class="footer">
518
+ <p>
519
+ 🀄 CHIP v0.2 · Built with care for Chinese prompt engineering ·
520
+ <a href="https://github.com/luancy1208/CHIP" target="_blank">GitHub</a> ·
521
+ <a href="https://github.com/luancy1208/CHIP/issues" target="_blank">反馈 Issue</a>
522
+ </p>
523
+ <p style="margin-top:8px; font-size:0.82em; opacity:0.7">
524
+ 数据来源: FLORES-200 dev (n=200) · 200 个 HSK 5/6 高频成语 · 45 个常见标记符号 · 实测于 9 个 tokenizer
525
+ </p>
526
+ </div>
527
+ """
528
+
529
+
530
+ with gr.Blocks(
531
+ title="CHIP — 中文高密度提示协议",
532
+ theme=gr.themes.Soft(
533
+ primary_hue="blue",
534
+ secondary_hue="orange",
535
+ neutral_hue="slate",
536
+ font=["system-ui", "-apple-system", "Segoe UI", "Helvetica Neue", "sans-serif"],
537
+ ),
538
+ css=CUSTOM_CSS,
539
+ ) as demo:
540
+ gr.HTML(HERO_HTML)
541
+
542
+ with gr.Row(equal_height=True):
543
+ # ------- 输入侧 -------
544
+ with gr.Column(scale=1):
545
+ gr.Markdown("### 📝 输入 prompt")
546
+ inp = gr.Textbox(
547
+ show_label=False,
548
+ lines=10,
549
+ placeholder="把啰嗦的中文 prompt 粘贴进来,看看 CHIP 能压到多紧...\n\n💡 试试:'请你帮我对下面这段文字进行一个全面的分析,如果可以的话麻烦你给出一些建议'",
550
+ container=False,
551
+ )
552
+
553
+ with gr.Accordion("⚙️ 高级设置", open=False):
554
+ with gr.Row():
555
+ target = gr.Dropdown(
556
+ ["qwen2.5", "deepseek_v3", "glm4", "cl100k", "o200k"],
557
+ value="qwen2.5",
558
+ label="🎯 目标模型",
559
+ info="影响 L3 成语压缩等 target-aware 决策",
560
+ )
561
+ tokenizer_name = gr.Dropdown(
562
+ list(TOKENIZERS.keys()) + list(_LAZY.keys()),
563
+ value="GPT-4 (cl100k)",
564
+ label="🔢 计数 tokenizer",
565
+ info="决定 token 数怎么算(国产 tokenizer 首次需下载)",
566
+ )
567
+
568
+ gr.Markdown("**压缩层** — 默认 L1+L2+L4(保险);L3 仅在国产模型上有意义")
569
+ with gr.Row():
570
+ use_l1 = gr.Checkbox(value=True, label="L1 词法",
571
+ info="套话剪枝 (~1.3-1.5×)")
572
+ use_l2 = gr.Checkbox(value=True, label="L2 句法",
573
+ info="模式重排 (~2-3×)")
574
+ use_l3 = gr.Checkbox(value=False, label="L3 成语",
575
+ info="长描述→成语")
576
+ use_l4 = gr.Checkbox(value=True, label="L4 协议",
577
+ info="### 标签归一化")
578
+
579
+ use_jieba = gr.Checkbox(
580
+ value=False,
581
+ label="🚀 jieba NP 增强模式",
582
+ info="复杂角色提取场景效果更好(默认关闭)",
583
+ )
584
+
585
+ btn = gr.Button("🔥 压缩", variant="primary", size="lg",
586
+ elem_classes="primary")
587
+
588
+ # ------- 输出侧 -------
589
+ with gr.Column(scale=1):
590
+ gr.Markdown("### ✨ 压缩结果")
591
+ try:
592
+ out = gr.Textbox(
593
+ show_label=False,
594
+ lines=10,
595
+ interactive=False,
596
+ show_copy_button=True,
597
+ container=False,
598
+ )
599
+ except TypeError:
600
+ out = gr.Textbox(show_label=False, lines=10,
601
+ interactive=False, container=False)
602
+ stats_panel = gr.HTML(
603
+ "<div class='stat-card empty'>👈 在左侧输入并点击压缩按钮</div>"
604
+ )
605
+ rules_panel = gr.HTML()
606
+
607
+ # 隐藏的 textbox 用于触发示例(因为示例需要更新所有 input)
608
+ hidden_dup = gr.Textbox(visible=False)
609
+
610
+ gr.Examples(
611
+ examples=EXAMPLES,
612
+ inputs=[inp, target, use_l1, use_l2, use_l3, use_l4, tokenizer_name, use_jieba],
613
+ label="📌 试试这些示例(点击自动填充并运行)",
614
+ examples_per_page=4,
615
+ )
616
+
617
+ with gr.Accordion("📖 关于 CHIP / 协议设计要点", open=False):
618
+ gr.Markdown("""
619
+ **核心发现(基于 1800 行实测数据)**
620
+
621
+ | Tokenizer | ZH/EN ratio | 中文相对英文 |
622
+ |---|---|---|
623
+ | **baichuan2** 🟦 | 0.875 | 中文省 12.5% |
624
+ | **deepseek_v3** 🟦 | 0.916 | 中文省 8.4% |
625
+ | **glm4** 🟦 | 0.924 | 中文省 7.6% |
626
+ | qwen2.5/3 🟦 | 0.988 | 持平 |
627
+ | **o200k** (GPT-4o) 🟧 | 1.163 | 中文贵 16.3% |
628
+ | **cl100k** (GPT-4) 🟧 | 1.731 | 中文贵 73.1% |
629
+
630
+ 🟦 = 国产 tokenizer · 🟧 = OpenAI tokenizer
631
+
632
+ ---
633
+
634
+ **4 层压缩架构**
635
+
636
+ - **L1 词法层** · 啰嗦套话 → 紧凑动宾,纯正则,~1.3-1.5×
637
+ - **L2 句法层** · 模式重排 + 列表化,~2-3×
638
+ - **L3 成语层** · 长描��� → 成语(实测白名单),仅国产模型默认关闭
639
+ - **L4 协议层** · 标签归一化为 `### 标题`(实测全 tokenizer 1 token)
640
+
641
+ **不主张的事**(诚实声明)
642
+
643
+ - ❌ 不主张"中文因为信息密度高所以更省 token" — Mythbuster 2026 在 SWE-bench 上证伪
644
+ - ❌ 不主张"中文 prompt 让模型更聪明" — 在英文中心模型上常常相反
645
+ - ❌ 不主张"全程使用文言文压缩" — LLM 对生僻文言虚词理解不稳定
646
+
647
+ CHIP 主张的是:**在符合中文训练分布的国产模型上,通过协议化压缩可在 token 经济性、可读性、可审计性三个维度同时提供工程价值。**
648
+ """)
649
+
650
+ gr.HTML(FOOTER_HTML)
651
+
652
+ # 事件
653
+ btn.click(
654
+ run,
655
+ inputs=[inp, target, use_l1, use_l2, use_l3, use_l4, tokenizer_name, use_jieba],
656
+ outputs=[out, stats_panel, rules_panel, hidden_dup],
657
+ )
658
+
659
+
660
+ if __name__ == "__main__":
661
+ demo.launch(
662
+ server_name=os.getenv("HOST", "0.0.0.0"),
663
+ server_port=int(os.getenv("PORT", 7860)),
664
+ share=os.getenv("SHARE") == "1",
665
+ )
chip/__init__.py ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ chip — Chinese High-density Instruction Protocol
3
+
4
+ 主要 API:
5
+ >>> from chip import compress
6
+ >>> compress("请你帮我总结一下这段文字的核心观点是什么")
7
+ '总结此文核心观点'
8
+
9
+ >>> compress(text, target="qwen2.5", layers=("L1", "L2"))
10
+ """
11
+
12
+ from chip.compressor import compress, Compressor, CompressionResult
13
+
14
+ __version__ = "0.1.0"
15
+ __all__ = ["compress", "Compressor", "CompressionResult", "__version__"]
chip/__pycache__/__init__.cpython-311.pyc ADDED
Binary file (638 Bytes). View file
 
chip/__pycache__/cli.cpython-311.pyc ADDED
Binary file (2.95 kB). View file
 
chip/__pycache__/compressor.cpython-311.pyc ADDED
Binary file (14.6 kB). View file
 
chip/cli.py ADDED
@@ -0,0 +1,54 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ chip.cli — 命令行入口
3
+
4
+ 安装后:
5
+ chip "请你帮我总结一下这段文字"
6
+ echo "..." | chip
7
+ chip --target qwen2.5 --layers L1 L2 --diff "..."
8
+ """
9
+ from __future__ import annotations
10
+ import argparse, sys
11
+
12
+ from chip.compressor import Compressor
13
+
14
+
15
+ def main():
16
+ ap = argparse.ArgumentParser(prog="chip",
17
+ description="CHIP — Chinese High-density Instruction Protocol compressor")
18
+ ap.add_argument("text", nargs="?",
19
+ help="prompt text (or read from stdin if omitted)")
20
+ ap.add_argument("--target", default="qwen2.5",
21
+ choices=["qwen2.5", "cl100k", "o200k", "deepseek_v3", "glm4"],
22
+ help="target tokenizer (decides protocol track)")
23
+ ap.add_argument("--layers", nargs="+", default=["L1", "L2", "L4"],
24
+ choices=["L1", "L2", "L3", "L4"],
25
+ help="L1=词法 L2=句法 L3=成语(需国产模型) L4=协议归一化")
26
+ ap.add_argument("--diff", action="store_true",
27
+ help="show original / compressed / rules side-by-side")
28
+ ap.add_argument("--rules", default=None,
29
+ help="custom rules.yaml path")
30
+ args = ap.parse_args()
31
+
32
+ if args.text:
33
+ text = args.text
34
+ else:
35
+ text = sys.stdin.read()
36
+ if not text.strip():
37
+ ap.print_help()
38
+ sys.exit(1)
39
+
40
+ kwargs = {}
41
+ if args.rules:
42
+ kwargs["rules_path"] = args.rules
43
+ compressor = Compressor(target=args.target, layers=args.layers, **kwargs)
44
+ result = compressor.compress(text)
45
+
46
+ if args.diff:
47
+ print(result.diff())
48
+ print(f"\n字符压缩率: {result.char_ratio:.2%} ({len(result.original)} → {len(result.compressed)})")
49
+ else:
50
+ print(result.compressed)
51
+
52
+
53
+ if __name__ == "__main__":
54
+ main()
chip/compressor.py ADDED
@@ -0,0 +1,334 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ chip.compressor
3
+ ================
4
+ CHIP 主压缩器。设计原则:
5
+ 1. 协议是文本,不是模型 — 不依赖 LLM 调用,纯规则可跑
6
+ 2. 双轨 — Qwen 轨用中文方括号,cl100k 轨用 XML/Markdown
7
+ 3. 可逆 — 保留命名实体、数字、代码、URL 不动
8
+ 4. 可审计 — 每条改动可追溯到 rules.yaml 的某条规则
9
+
10
+ 当前实现层级:
11
+ L1 (lex) — 词法替换:啰嗦套话 → 紧凑动宾,纯正则,~1.3-1.5x 压缩
12
+ L2 (syn) — 句法重排:虚词替换、列表化,需 jieba 分词,~2-3x
13
+ L3 (idiom) — 成语压缩(基于实测白名单),需 target 是国产 tokenizer
14
+ L4 (proto) — 协议层归一化,统一为 ### 标签
15
+
16
+ NP-aware 角色提取(可选):
17
+ L2-022 默认用正则,在含空格的复合 NP 上偶有截断。
18
+ 设环境变量 CHIP_USE_JIEBA=1 启用 jieba 增强版。
19
+ """
20
+ from __future__ import annotations
21
+
22
+ import os
23
+ import re
24
+ from dataclasses import dataclass, field
25
+ from pathlib import Path
26
+ from typing import Iterable
27
+
28
+ import yaml
29
+
30
+
31
+ # ============ 数据类 ============
32
+ @dataclass
33
+ class Rule:
34
+ """一条 CHIP 转换规则。"""
35
+ id: str
36
+ layer: str # "L1" | "L2" | "L3" | "L4"
37
+ pattern: str # 正则
38
+ replacement: str
39
+ description: str = ""
40
+ saves: int = 0 # 在参考 tokenizer 上预估省多少 token
41
+ risk: str = "low" # low | mid | high
42
+ flags: int = 0
43
+ _compiled: re.Pattern = field(default=None, repr=False)
44
+
45
+ def compile(self):
46
+ if self._compiled is None:
47
+ self._compiled = re.compile(self.pattern, self.flags)
48
+ return self._compiled
49
+
50
+
51
+ @dataclass
52
+ class CompressionResult:
53
+ """压缩结果,带 audit trail。"""
54
+ original: str
55
+ compressed: str
56
+ applied_rules: list[str] # 命中的 rule id 列表
57
+ target: str # tokenizer 名
58
+ layers: tuple
59
+
60
+ @property
61
+ def char_ratio(self) -> float:
62
+ return len(self.compressed) / max(len(self.original), 1)
63
+
64
+ def diff(self) -> str:
65
+ """简单的并排展示。"""
66
+ return f"原: {self.original}\n压: {self.compressed}\n规则: {', '.join(self.applied_rules) or '(none)'}"
67
+
68
+
69
+ # ============ 规则加载 ============
70
+ DEFAULT_RULES_PATH = Path(__file__).parent / "rules" / "rules.yaml"
71
+
72
+
73
+ def load_rules(path: Path | str = DEFAULT_RULES_PATH) -> list[Rule]:
74
+ """从 yaml 加载规则。"""
75
+ path = Path(path)
76
+ with open(path, encoding="utf-8") as f:
77
+ data = yaml.safe_load(f)
78
+
79
+ rules = []
80
+ for item in data.get("rules", []):
81
+ flags = 0
82
+ for flag_name in item.get("flags", []):
83
+ flags |= getattr(re, flag_name.upper(), 0)
84
+ rules.append(Rule(
85
+ id=item["id"],
86
+ layer=item["layer"],
87
+ pattern=item["pattern"],
88
+ replacement=item.get("replacement", ""),
89
+ description=item.get("description", ""),
90
+ saves=item.get("saves", 0),
91
+ risk=item.get("risk", "low"),
92
+ flags=flags,
93
+ ))
94
+ return rules
95
+
96
+
97
+ # ============ 保护性 mask ============
98
+ # 这些 pattern 命中的子串会先被替换成占位符,跑完规则后再还原。
99
+ # 防止规则误改专有名词、URL、代码、数字。
100
+ PROTECT_PATTERNS = [
101
+ ("URL", re.compile(r"https?://\S+")),
102
+ ("CODE", re.compile(r"```[\s\S]*?```|`[^`\n]+`")),
103
+ ("NUM", re.compile(r"\d+(?:\.\d+)?(?:%|km|kg|m|s|°C)?")),
104
+ ("EMAIL", re.compile(r"[\w.+-]+@[\w-]+\.[\w.-]+")),
105
+ # 双引号包裹的引文(用户原话)
106
+ ("QUOTE", re.compile(r"[\"\u201c][^\"\u201d]+[\"\u201d]")),
107
+ ]
108
+
109
+
110
+ # 占位符前缀用一个不会出现在自然中文里、且不会被 PROTECT_PATTERNS 命中的 token
111
+ _PH_OPEN = "\u2983" # ⦃
112
+ _PH_CLOSE = "\u2984" # ⦄
113
+ _PH_RE = re.compile(rf"{_PH_OPEN}\d+{_PH_CLOSE}")
114
+
115
+
116
+ def _mask(text: str) -> tuple[str, list[tuple[str, str]]]:
117
+ """把不可压缩片段替换成 ⦃i⦄ 占位符,返回 (masked, mappings)。
118
+
119
+ 关键:每次 sub 时跳过已经 mask 过的占位符,避免嵌套替换。
120
+ """
121
+ mappings = []
122
+ masked = text
123
+
124
+ def make_sub():
125
+ def _sub(m):
126
+ # 如果 match 整体落在已有占位符内,跳过
127
+ content = m.group(0)
128
+ if _PH_RE.fullmatch(content):
129
+ return content
130
+ i = len(mappings)
131
+ placeholder = f"{_PH_OPEN}{i}{_PH_CLOSE}"
132
+ mappings.append((placeholder, content))
133
+ return placeholder
134
+ return _sub
135
+
136
+ for tag, pat in PROTECT_PATTERNS:
137
+ masked = pat.sub(make_sub(), masked)
138
+ return masked, mappings
139
+
140
+
141
+ def _unmask(text: str, mappings: list[tuple[str, str]]) -> str:
142
+ # 反向替换避免 ⦃1⦄ 误替换 ⦃10⦄
143
+ for placeholder, original in reversed(mappings):
144
+ text = text.replace(placeholder, original)
145
+ return text
146
+
147
+
148
+ # ============ 主类 ============
149
+ class Compressor:
150
+ """可重用的压缩器实例。"""
151
+
152
+ def __init__(self,
153
+ rules_path: Path | str = DEFAULT_RULES_PATH,
154
+ target: str = "qwen2.5",
155
+ layers: Iterable[str] = ("L1", "L2", "L4")):
156
+ """
157
+ Args:
158
+ target: 目标 tokenizer,影响成语压缩等 target-aware 决策
159
+ layers: 启用的压缩层
160
+ - L1: 词法层(套话剪枝),保险,默认开
161
+ - L2: 句法层(模式重排),保险,默认开
162
+ - L3: 成语层(语义压缩),需 target 是国产 tokenizer 才有意义,默认关
163
+ - L4: 协议层归一化(### 标题统一),无害,默认开
164
+ """
165
+ self.rules = load_rules(rules_path)
166
+ self.target = target
167
+ self.layers = tuple(layers)
168
+ # 预编译
169
+ for r in self.rules:
170
+ r.compile()
171
+
172
+ def compress(self, text: str) -> CompressionResult:
173
+ original = text
174
+
175
+ # 可选:jieba 增强角色提取 (pre-process,优先于 L2-022 的纯正则)
176
+ applied_pre = []
177
+ if os.getenv("CHIP_USE_JIEBA") == "1" and "L2" in self.layers:
178
+ text, jieba_applied = _jieba_role_extract(text)
179
+ if jieba_applied:
180
+ applied_pre.append("L2-022J(jieba)")
181
+
182
+ masked, mappings = _mask(text)
183
+ applied = list(applied_pre)
184
+
185
+ for rule in self.rules:
186
+ if rule.layer not in self.layers:
187
+ continue
188
+ new_text, n = rule._compiled.subn(rule.replacement, masked)
189
+ if n > 0:
190
+ applied.append(f"{rule.id}×{n}")
191
+ masked = new_text
192
+
193
+ # 收尾:多余空白、连续标点
194
+ masked = re.sub(r"[ \t]+", " ", masked)
195
+ masked = re.sub(r"\s*\n\s*\n\s*\n+", "\n\n", masked)
196
+
197
+ # 协议层留下的孤立标点清理(L2-022 等会留下 "\n,xxx")
198
+ masked = re.sub(r"\n[,,;;。.\s]+", "\n", masked)
199
+ masked = re.sub(r"^[,,;;]+\s*", "", masked, flags=re.MULTILINE)
200
+
201
+ masked = masked.strip()
202
+ compressed = _unmask(masked, mappings)
203
+
204
+ return CompressionResult(
205
+ original=original,
206
+ compressed=compressed,
207
+ applied_rules=applied,
208
+ target=self.target,
209
+ layers=self.layers,
210
+ )
211
+
212
+
213
+ # ============ 便捷函数 ============
214
+ _default_compressor = None
215
+
216
+
217
+ def compress(text: str,
218
+ target: str = "qwen2.5",
219
+ layers: Iterable[str] = ("L1", "L2", "L4"),
220
+ return_result: bool = False) -> str | CompressionResult:
221
+ """简便入口。
222
+
223
+ >>> compress("请帮我总结一下这段文字")
224
+ '总结一下这段文字'
225
+
226
+ >>> compress("...", layers=["L1","L2","L3","L4"]) # 启用所有层(包括成语)
227
+
228
+ >>> r = compress("...", return_result=True)
229
+ >>> print(r.diff())
230
+ """
231
+ global _default_compressor
232
+ key = (target, tuple(layers))
233
+ if _default_compressor is None or _default_compressor[0] != key:
234
+ _default_compressor = (key, Compressor(target=target, layers=layers))
235
+ result = _default_compressor[1].compress(text)
236
+ return result if return_result else result.compressed
237
+
238
+
239
+ # ============ jieba NP 提取(可选增强) ============
240
+ _jieba_loaded = False
241
+
242
+
243
+ def _ensure_jieba():
244
+ """懒加载 jieba。"""
245
+ global _jieba_loaded
246
+ if _jieba_loaded:
247
+ return True
248
+ try:
249
+ import jieba.posseg as pseg # noqa: F401
250
+ _jieba_loaded = True
251
+ return True
252
+ except ImportError:
253
+ return False
254
+
255
+
256
+ # 角色扮演的触发短语 — jieba 用它定位
257
+ _ROLE_PREFIX_RE = re.compile(
258
+ r"请\s*(?:你)?\s*扮演\s*(?:一(?:个|位))?\s*"
259
+ )
260
+
261
+
262
+ def _jieba_role_extract(text: str) -> tuple[str, bool]:
263
+ """用 jieba 词性标注提取最长名词短语作为角色描述。
264
+
265
+ 替换 L2-022 的纯正则 lookahead 实现 — 后者在以下场景失败:
266
+ - 角色描述非常长且无标点结尾
267
+ - 角色描述被句中的连词意外截断("...然后..." 这种)
268
+
269
+ 策略:
270
+ 1. 找到 "请你扮演[一位]" 触发短语
271
+ 2. 从触发短语后开始,jieba.posseg 切分
272
+ 3. 贪婪收集 NP token,直到遇到 hard-stop:
273
+ - 连词 c (然后/接着/以及)
274
+ - 介词 p (对/把/为)
275
+ - 动词 v (但 vn 动名词允许)
276
+ - 句末标点 w (。;,等)
277
+ 4. 助词 'uj/u/ul'(的/地/得)、空格、英文都允许进入 NP
278
+ """
279
+ if not _ensure_jieba():
280
+ return text, False
281
+
282
+ import jieba.posseg as pseg
283
+
284
+ m = _ROLE_PREFIX_RE.search(text)
285
+ if not m:
286
+ return text, False
287
+
288
+ head = text[:m.start()]
289
+ body = text[m.end():]
290
+ if not body:
291
+ return text, False
292
+
293
+ words = list(pseg.cut(body))
294
+
295
+ # NP 定义:最长前缀,直到遇到硬终止
296
+ # HARD_STOP:动词(非 vn)、连词、介词、标点
297
+ # ALLOW_IN_NP:名词、形容词、英文、数字、量词、助词(的/地/得)、空格
298
+ np_chars = []
299
+ cumlen = 0
300
+ rest_start = 0
301
+ found_np_core = False # 是否已经收到名词或形容词(NP 核心)
302
+
303
+ for w, flag in words:
304
+ # hard stop 条件
305
+ is_hard_stop = (
306
+ flag == "w" # 标点
307
+ or w in {",", ",", "。", ".", ";", ";", ":", ":", "、", "\n"}
308
+ or flag == "c" # 连词
309
+ or flag == "p" # 介词
310
+ or (flag.startswith("v") and flag != "vn") # 真动词(非动名词)
311
+ )
312
+ if is_hard_stop and found_np_core:
313
+ rest_start = cumlen
314
+ break
315
+
316
+ # 在 NP 内
317
+ np_chars.append(w)
318
+ cumlen += len(w)
319
+ if flag.startswith("n") or flag.startswith("a") or flag == "eng":
320
+ found_np_core = True
321
+ else:
322
+ # 遍历完了,整个 body 都是 NP
323
+ rest_start = cumlen
324
+
325
+ np_str = "".join(np_chars).strip()
326
+ if not np_str or len(np_str) < 2 or not found_np_core:
327
+ return text, False
328
+
329
+ rest = body[rest_start:]
330
+ new_text = f"{head}\n### 角色\n{np_str}\n{rest}"
331
+ # 清理紧跟在角色块后的孤立标点
332
+ new_text = re.sub(r"\n[,,;;。.]+", "\n", new_text)
333
+ new_text = new_text.strip()
334
+ return new_text, True
chip/rules/rules.yaml ADDED
@@ -0,0 +1,418 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # CHIP Compression Rules v0.2 (2025-05-01)
2
+ # ==========================================
3
+ # v0.2 vs v0.1 主要变化:
4
+ # - 标签层从 [角:X] / 【任】 改为 ### 角色 (实测全 tokenizer 1 token,完爆方括号)
5
+ # - 新增 L3 成语层(基于 idiom_whitelist.json 实测)
6
+ # - 新增 L4 协议层(归一化用户已有的标签)
7
+
8
+ rules:
9
+ # ============================================================
10
+ # L1: 词法层 — 啰嗦套话剪枝
11
+ # ============================================================
12
+ - id: L1-001
13
+ layer: L1
14
+ pattern: "请你?帮我?"
15
+ replacement: ""
16
+ saves: 2
17
+ risk: low
18
+ description: "客套语 '请你帮我' / '请帮我' → 空"
19
+
20
+ - id: L1-002
21
+ layer: L1
22
+ pattern: "麻烦你?"
23
+ replacement: ""
24
+ saves: 2
25
+ risk: low
26
+
27
+ - id: L1-003
28
+ layer: L1
29
+ pattern: "如果可以的话[,,]?"
30
+ replacement: ""
31
+ saves: 3
32
+ risk: low
33
+
34
+ - id: L1-004
35
+ layer: L1
36
+ pattern: "(?:能不能|可不可以|可以|能)(?=帮|告诉|解释|总结|分析)"
37
+ replacement: ""
38
+ saves: 2
39
+ risk: low
40
+
41
+ - id: L1-005
42
+ layer: L1
43
+ pattern: "辛苦你?"
44
+ replacement: ""
45
+ saves: 2
46
+ risk: low
47
+
48
+ - id: L1-006
49
+ layer: L1
50
+ pattern: "(?:谢谢|感谢)(?:你|了)?[!!.。]?"
51
+ replacement: ""
52
+ saves: 2
53
+ risk: low
54
+
55
+ # ---- 进行/做 + 动词性名词 → 单字动词 ----
56
+ - id: L1-010
57
+ layer: L1
58
+ pattern: "进行(?:一?(?:个|下|次)?)?分析"
59
+ replacement: "分析"
60
+ saves: 2
61
+ risk: low
62
+
63
+ - id: L1-011
64
+ layer: L1
65
+ pattern: "进行(?:一?(?:个|下|次)?)?总结"
66
+ replacement: "总结"
67
+ saves: 2
68
+ risk: low
69
+
70
+ - id: L1-012
71
+ layer: L1
72
+ pattern: "进行(?:一?(?:个|下|次)?)?处理"
73
+ replacement: "处理"
74
+ saves: 2
75
+ risk: low
76
+
77
+ - id: L1-013
78
+ layer: L1
79
+ pattern: "进行(?:一?(?:个|下|次)?)?解释"
80
+ replacement: "解释"
81
+ saves: 2
82
+ risk: low
83
+
84
+ - id: L1-014
85
+ layer: L1
86
+ pattern: "做(?:一?(?:个|下|次)?)?判断"
87
+ replacement: "判定"
88
+ saves: 3
89
+ risk: low
90
+
91
+ - id: L1-015
92
+ layer: L1
93
+ pattern: "做(?:一?(?:个|下|次)?)?解释"
94
+ replacement: "解释"
95
+ saves: 3
96
+ risk: low
97
+
98
+ - id: L1-016
99
+ layer: L1
100
+ pattern: "给(?:出|我)(?:一些|几个)?建议"
101
+ replacement: "建议"
102
+ saves: 2
103
+ risk: low
104
+
105
+ - id: L1-017
106
+ layer: L1
107
+ pattern: "提供(?:一些|相关|相对)?帮助"
108
+ replacement: "助"
109
+ saves: 2
110
+ risk: mid
111
+
112
+ - id: L1-018
113
+ layer: L1
114
+ pattern: "进行(?:一?(?:个|下|次)?)?检查"
115
+ replacement: "检查"
116
+ saves: 2
117
+ risk: low
118
+
119
+ - id: L1-019
120
+ layer: L1
121
+ pattern: "进行(?:一?(?:个|下|次)?)?优化"
122
+ replacement: "优化"
123
+ saves: 2
124
+ risk: low
125
+
126
+ # ---- 连接词 ----
127
+ - id: L1-020
128
+ layer: L1
129
+ pattern: "也就是说[,,]?"
130
+ replacement: "即"
131
+ saves: 3
132
+ risk: low
133
+
134
+ - id: L1-021
135
+ layer: L1
136
+ pattern: "换句话说[,,]?"
137
+ replacement: "即"
138
+ saves: 3
139
+ risk: low
140
+
141
+ - id: L1-022
142
+ layer: L1
143
+ pattern: "与此同时[,,]?"
144
+ replacement: "同时,"
145
+ saves: 2
146
+ risk: low
147
+
148
+ - id: L1-023
149
+ layer: L1
150
+ pattern: "在这种情况下[,,]?"
151
+ replacement: "此时,"
152
+ saves: 3
153
+ risk: low
154
+
155
+ - id: L1-024
156
+ layer: L1
157
+ pattern: "由此可见[,,]?"
158
+ replacement: "故"
159
+ saves: 3
160
+ risk: low
161
+
162
+ - id: L1-025
163
+ layer: L1
164
+ pattern: "因此(?:[,,]|说)?"
165
+ replacement: "故"
166
+ saves: 1
167
+ risk: low
168
+
169
+ - id: L1-026
170
+ layer: L1
171
+ pattern: "如果没有"
172
+ replacement: "若无"
173
+ saves: 2
174
+ risk: low
175
+
176
+ - id: L1-027
177
+ layer: L1
178
+ pattern: "通过(.+?)的方式"
179
+ replacement: "用\\1"
180
+ saves: 2
181
+ risk: mid
182
+
183
+ - id: L1-028
184
+ layer: L1
185
+ pattern: "(?:如上所述|前面提到的|刚才说的)"
186
+ replacement: "前述"
187
+ saves: 3
188
+ risk: low
189
+
190
+ # ---- 修饰副词 ----
191
+ - id: L1-030
192
+ layer: L1
193
+ pattern: "比较(?:简洁|清晰|详细)地?"
194
+ replacement: ""
195
+ saves: 3
196
+ risk: low
197
+
198
+ - id: L1-031
199
+ layer: L1
200
+ pattern: "相对(?:简洁|详细|完整)地?"
201
+ replacement: ""
202
+ saves: 3
203
+ risk: low
204
+
205
+ - id: L1-032
206
+ layer: L1
207
+ pattern: "尽可能(?:地)?"
208
+ replacement: "尽量"
209
+ saves: 1
210
+ risk: low
211
+
212
+ - id: L1-033
213
+ layer: L1
214
+ pattern: "非常(?:详细|详尽|全面)地?"
215
+ replacement: "详细"
216
+ saves: 2
217
+ risk: low
218
+
219
+ # ============================================================
220
+ # L2: 句法层
221
+ # ============================================================
222
+ - id: L2-001
223
+ layer: L2
224
+ pattern: "对(.+?)进行(?:一?(?:个|下|次)?(?:全面|详细|简要|认真|深入)?的?)?([\\u4e00-\\u9fff]{1,4})"
225
+ replacement: "\\2\\1"
226
+ saves: 2
227
+ risk: mid
228
+ description: "'对 X 进行 Y' → 'Y X'"
229
+
230
+ - id: L2-002
231
+ layer: L2
232
+ pattern: "把(.+?)作为(.+?)(?=[,,。.\\s])"
233
+ replacement: "视\\1为\\2"
234
+ saves: 2
235
+ risk: mid
236
+
237
+ - id: L2-003
238
+ layer: L2
239
+ pattern: "由于(.+?)所以"
240
+ replacement: "\\1故"
241
+ saves: 3
242
+ risk: low
243
+
244
+ - id: L2-004
245
+ layer: L2
246
+ pattern: "虽然(.+?)但是"
247
+ replacement: "\\1然"
248
+ saves: 3
249
+ risk: mid
250
+
251
+ - id: L2-005
252
+ layer: L2
253
+ pattern: "不仅(.+?)而且"
254
+ replacement: "\\1且"
255
+ saves: 3
256
+ risk: low
257
+
258
+ - id: L2-006
259
+ layer: L2
260
+ pattern: "因为(.+?)所以"
261
+ replacement: "\\1故"
262
+ saves: 3
263
+ risk: low
264
+
265
+ - id: L2-007
266
+ layer: L2
267
+ pattern: "如果(.+?)那么"
268
+ replacement: "若\\1则"
269
+ saves: 2
270
+ risk: low
271
+
272
+ # ---- 列表化 ----
273
+ - id: L2-010
274
+ layer: L2
275
+ pattern: "第一[,,]"
276
+ replacement: "1. "
277
+ saves: 1
278
+ risk: low
279
+
280
+ - id: L2-011
281
+ layer: L2
282
+ pattern: "第二[,,]"
283
+ replacement: "2. "
284
+ saves: 1
285
+ risk: low
286
+
287
+ - id: L2-012
288
+ layer: L2
289
+ pattern: "第三[,,]"
290
+ replacement: "3. "
291
+ saves: 1
292
+ risk: low
293
+
294
+ - id: L2-013
295
+ layer: L2
296
+ pattern: "第四[,,]"
297
+ replacement: "4. "
298
+ saves: 1
299
+ risk: low
300
+
301
+ - id: L2-014
302
+ layer: L2
303
+ pattern: "首先[,,]"
304
+ replacement: "1. "
305
+ saves: 1
306
+ risk: low
307
+
308
+ - id: L2-015
309
+ layer: L2
310
+ pattern: "其次[,,]"
311
+ replacement: "2. "
312
+ saves: 1
313
+ risk: low
314
+
315
+ # ============================================================
316
+ # L2 协议化重写 (v0.2 修订)
317
+ # 实测:### 在所有 9 个 tokenizer 上都是 1 token
318
+ # ============================================================
319
+ - id: L2-020
320
+ layer: L2
321
+ pattern: "请\\s*(?:用|以)?\\s*(?:JSON|json|Json)\\s*格式\\s*(?:输出|返回|回答)"
322
+ replacement: "\n### 输出\nJSON"
323
+ saves: 4
324
+ risk: low
325
+
326
+ - id: L2-021
327
+ layer: L2
328
+ pattern: "请\\s*(?:用|以)?\\s*中文\\s*(?:回答|回复|输出)"
329
+ replacement: "\n### 输出\n中文"
330
+ saves: 3
331
+ risk: low
332
+
333
+ - id: L2-022
334
+ layer: L2
335
+ pattern: "请\\s*(?:你)?\\s*扮演\\s*(?:一(?:个|位))?\\s*(.+?)(?=[,,。.\\n]|的角色|$)"
336
+ replacement: "\n### 角色\n\\1\n"
337
+ saves: 4
338
+ risk: high
339
+ description: |
340
+ '请你扮演一位 X' → '### 角色\nX'
341
+ 已知问题:含空格的复合 NP 可能被截断,Day 3 用 jieba 修复
342
+
343
+ # ============================================================
344
+ # L3: 成语层(默认 universal 11 条核心成语,需 layer=L3 显式启用)
345
+ # 在 ≥3 国产 tokenizer 上 1 token,基于 idiom_whitelist.json 实测
346
+ # ============================================================
347
+ - id: L3-001
348
+ layer: L3
349
+ pattern: "(?:大家都知道|每个人都知道|众人皆知)"
350
+ replacement: "众所周知"
351
+ saves: 2
352
+ risk: mid
353
+
354
+ - id: L3-002
355
+ layer: L3
356
+ pattern: "投入(?:全部|所有)?(?:精力|力量)(?:去做|做)?"
357
+ replacement: "全力以赴"
358
+ saves: 2
359
+ risk: mid
360
+
361
+ - id: L3-003
362
+ layer: L3
363
+ pattern: "(?:根据|结合|按照)(?:当地|实际)情况"
364
+ replacement: "因地制宜"
365
+ saves: 2
366
+ risk: mid
367
+
368
+ - id: L3-004
369
+ layer: L3
370
+ pattern: "(?:一步一步|一步步)(?:地)?(?:推进|进行)"
371
+ replacement: "循序渐进"
372
+ saves: 3
373
+ risk: mid
374
+
375
+ - id: L3-005
376
+ layer: L3
377
+ pattern: "(?:不断|持续|一直)(?:坚持|努力做)"
378
+ replacement: "持之以恒"
379
+ saves: 2
380
+ risk: mid
381
+
382
+ - id: L3-006
383
+ layer: L3
384
+ pattern: "认真(?:仔细)?(?:地)?对待"
385
+ replacement: "脚踏实地"
386
+ saves: 1
387
+ risk: mid
388
+
389
+ # ============================================================
390
+ # L4: 协议层归一化
391
+ # ============================================================
392
+ - id: L4-001
393
+ layer: L4
394
+ pattern: "(?:#+\\s*)?(?:任务|目标|Task|TASK)\\s*[::]\\s*"
395
+ replacement: "### 任务\n"
396
+ saves: 0
397
+ risk: low
398
+
399
+ - id: L4-002
400
+ layer: L4
401
+ pattern: "(?:#+\\s*)?(?:角色|身份|Role|ROLE)\\s*[::]\\s*"
402
+ replacement: "### 角色\n"
403
+ saves: 0
404
+ risk: low
405
+
406
+ - id: L4-003
407
+ layer: L4
408
+ pattern: "(?:#+\\s*)?(?:输出|返回|输出格式|Output|OUTPUT)\\s*[::]\\s*"
409
+ replacement: "### 输出\n"
410
+ saves: 0
411
+ risk: low
412
+
413
+ - id: L4-004
414
+ layer: L4
415
+ pattern: "(?:#+\\s*)?(?:约束|限制|要求|规则|Constraints|CONSTRAINTS)\\s*[::]\\s*"
416
+ replacement: "### 约束\n"
417
+ saves: 0
418
+ risk: low