|
--- |
|
license: mit |
|
base_model: roberta-large |
|
datasets: |
|
- KomeijiForce/MetaIE-Pretrain |
|
language: |
|
- en |
|
metrics: |
|
- f1 |
|
pipeline_tag: token-classification |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# MetaIE |
|
|
|
This is a meta-model distilled from ChatGPT-4 for information extraction. This is an intermediate checkpoint that can be well-transferred to all kinds of downstream information extraction tasks. This model can also be tested by different label-to-span matching as shown in the following example: |
|
|
|
```python |
|
from transformers import AutoModelForTokenClassification, AutoTokenizer |
|
import torch |
|
|
|
device = torch.device("cuda:0") |
|
path = f"KomeijiForce/roberta-large-metaie-gpt4" |
|
tokenizer = AutoTokenizer.from_pretrained(path) |
|
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device) |
|
|
|
def find_sequences(lst): |
|
sequences = [] |
|
i = 0 |
|
while i < len(lst): |
|
if lst[i] == 0: |
|
start = i |
|
end = i |
|
i += 1 |
|
while i < len(lst) and lst[i] == 1: |
|
end = i |
|
i += 1 |
|
sequences.append((start, end+1)) |
|
else: |
|
i += 1 |
|
return sequences |
|
|
|
def is_sublst(lst1, lst2): |
|
for idx in range(len(lst1)-len(lst2)+1): |
|
if lst1[idx:idx+len(lst2)] == lst2: |
|
return True |
|
return False |
|
|
|
words = ["John", "Smith", "loves", "his", "hometown", ",", "Los", "Angeles", "."] |
|
|
|
for prefix in ["Person", "Location", "John Smith births in", "Positive opinion"]: |
|
|
|
sentence = " ".join([prefix, ":"]+words) |
|
|
|
inputs = tokenizer(sentence, return_tensors="pt").to(device) |
|
tag_predictions = tagger(**inputs).logits[0].argmax(-1) |
|
|
|
predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)] |
|
predictions = [prediction for prediction in predictions if is_sublst(words, prediction.split())] |
|
|
|
print(prefix, predictions) |
|
``` |
|
|
|
The output will be |
|
|
|
```python |
|
"Person" ['John Smith'] |
|
"Location" ['Los Angeles'] |
|
"John Smith births in" ['Los Angeles'] |
|
"Positive opinion" ['loves his hometown'] |
|
``` |