yueyulin
/

entity_type_extraction

Model card Files Files and versions Community

yueyulin commited on Sep 25, 2024

Commit

2e522f4

•

1 Parent(s): 120bcd3

Update README.md

Browse files

Files changed (1) hide show

README.md +57 -16

README.md CHANGED Viewed

@@ -1,16 +1,57 @@
----
-language:
-- zh
-pipeline_tag: text-generation
----
-Using this states file referring:https://github.com/yynil/rwkv_lm_ext_runner/blob/main/tests/test_states_generator.py
-```bash
- python tests/test_states_generator.py --states_file /media/yueyulin/data_4t/models/states_tuning/entity_type_extraction/entity_type_extraction/20240730-023600/trainable_model/epoch_0/RWKV-x060-World-7B-v2.1-20240507-ctx4096.pth.pth
-Please input instruction根据input中的领域和任务，协助用户识别input文本中存在的实体类型。 实体类型必须与用户任务相关。 避免使用诸如“其他”或“未知”的通用实体类型。 非常重要的是：不要生成冗余或重叠的实体类型。用JSON格式输出。
-Please input text:{"input": "吴承恩十多岁时就以文才出众而享有盛名。嘉靖八年（1529年），就读于龙溪书院，成为“法筵人”，虽才华出众，但多次名落孙山。吴承恩约于嘉靖二十一年（1542年）完成小说《西游记》初稿，之后继续走科考之路，但仍仕途坎坷，直到嘉靖二十九年（1550年），才补为岁贡生，后落榜，在国子监就读。嘉靖四十五年（1566年），年逾六十岁的吴承恩终以贡生的资格升为长兴县丞，但在任仅两年便被人诬为贪污入狱，被释放后罢官而去。隆庆二年（1568年），得以平反，应召赴湖北，任荆王朱翊钜府第纪善。隆庆四年（1570年），辞官回乡。约于万历十年（1582年）去世。无后人。", "task": {"领域": "历史人物分析", "专家":"文学史专家", "任务": ["分析吴承恩的生平", "分析吴承恩的文学成就", "分析吴承恩的历史地位"]}}
-Output is :{"entity_types": ["历史人物", "文学作品", "科考", "官职", "教育阶段", "时间范围", "地理位置"]}
-Please input text:{"input": \"《红楼梦》也称《石头记》，中国古典长篇章回小说，是中国大陆定义的四大小说名著之一。《红楼梦》书内提及的别名，还有《情僧录》、《风月宝鉴》[1]、《金陵十二钗》。故事是从女娲补天时所剩下的一块石头讲起，因无才补天而随神瑛侍者（即后来的贾宝玉）入世，幻化为贾宝玉降世时口衔的美玉以游历世间，因此又名《石头记》。乾隆四十九年甲辰（1784年）梦觉主人序本题为《红楼梦》（甲辰梦序抄本）。1791年在第一次活字印刷（程甲本）后，《红楼梦》便取代《石头记》成为通行的书名", \"task\": {\"领域\": \"文学分析\", \"专家\": \"文学家\", \"任务\": [\"分析红楼梦历史背景\", \"分析红楼梦名字来源\", \"分析红楼梦历史地位\"]}}
-Output is :{"entity_types": ["文学作品", "历史背景", "名字来源", "书名", "印刷版本", "文学家", "小说主题"]}
-```

+# This is a state for rwkv6_7b_v2.1 that generates entity_types given domain, expert role in this domain and specific tasks that this export can do that are generated from persona_domain agent.
+* The input is solely the context that you want this model to analyze
+* The output are domain, expert role in this domain and specific tasks that this export can do in a jsonl format.
+# Please refer to the following demo as test code:
+```python
+from rwkv.model import RWKV
+from rwkv.utils import PIPELINE, PIPELINE_ARGS
+import torch
+# download models: https://huggingface.co/BlinkDL
+model = RWKV(model='/home/rwkv/Peter/model/base/RWKV-x060-World-7B-v2.1-20240507-ctx4096.pth', strategy='cuda fp16')
+print(model.args)
+pipeline = PIPELINE(model, "rwkv_vocab_v20230424") # 20B_tokenizer.json is in https://github.com/BlinkDL/ChatRWKV
+# use pipeline = PIPELINE(model, "rwkv_vocab_v20230424") for rwkv "world" models
+states_file = '/home/rwkv/Peter/rwkv_graphrag/agents/persona_domain_states/RWKV-x060-World-7B-v2.1-20240507-ctx4096.pth.pth'
+states = torch.load(states_file)
+states_value = []
+device = 'cuda'
+n_head = model.args.n_head
+head_size = model.args.n_embd//model.args.n_head
+for i in range(model.args.n_layer):
+    key = f'blocks.{i}.att.time_state'
+    value = states[key]
+    prev_x = torch.zeros(model.args.n_embd,device=device,dtype=torch.float16)
+    prev_states = value.clone().detach().to(device=device,dtype=torch.float16).transpose(1,2)
+    prev_ffn = torch.zeros(model.args.n_embd,device=device,dtype=torch.float16)
+    states_value.append(prev_x)
+    states_value.append(prev_states)
+    states_value.append(prev_ffn)
+cat_char = '🐱'
+bot_char = '🤖'
+instruction ='根据input中的领域和任务，协助用户识别input文本中存在的实体类型。 实体类型必须与用户任务相关。 避免使用诸如“其他”或“未知”的通用实体类型。 非常重要的是：不要生成冗余或重叠的实体类型。用JSON格式输出。'
+input_text = '{"领域": "文学与神话", "专家": "文学史学者/神话学家", "任务": ["分析《石头记》的历史背景和影响", "研究《红楼梦》与《金陵��二钗》之间的关系", "探讨东鲁孔梅溪对《石头记》的改编过程", "解析吴玉峰在《红楼梦》中的角色和贡献", "评估曹雪芹在《悼红轩中披阅十五间》中的写作技巧"]}'
+ctx = f'{cat_char}:{instruction}\n{input_text}\n{bot_char}:'
+print(ctx)
+def my_print(s):
+    print(s, end='', flush=True)
+args = PIPELINE_ARGS(temperature = 1, top_p = 0.2, top_k = 0, # top_k = 0 then ignore
+                     alpha_frequency = 0.5,
+                     alpha_presence = 0.5,
+                     alpha_decay = 0.998, # gradually decay the penalty
+                     token_ban = [0], # ban the generation of some tokens
+                     token_stop = [0,1], # stop generation whenever you see any token here
+                     chunk_len = 256) # split input into chunks to save VRAM (shorter -> slower)
+pipeline.generate(ctx, token_count=1000, args=args, callback=my_print,state=states_value)
+print('\n')
+```
+# The final printed input and output:
+![](./entity_type_demo.png)