|
--- |
|
license: mit |
|
base_model: xlm-roberta-large |
|
tags: |
|
- generated_from_trainer |
|
model-index: |
|
- name: xlm-roberta-large-metaie |
|
results: [] |
|
--- |
|
|
|
<!-- This model card has been generated automatically according to the information the Trainer had access to. You |
|
should probably proofread and complete it, then remove this comment. --> |
|
|
|
# MetaIE |
|
|
|
This is a multilingual meta-model distilled from ChatGPT-3.5-turbo for information extraction. This is an intermediate checkpoint that can be well-transferred to all kinds of downstream information extraction tasks. This model can also be tested by different label-to-span matching as shown in the following example: |
|
|
|
Ten languages are supported: |
|
- English |
|
- Français |
|
- Español |
|
- Italiano |
|
- Deutsch |
|
- Polski |
|
- Pусский |
|
- 中文 |
|
- 日本語 |
|
- 한국어 |
|
|
|
```python |
|
from transformers import AutoModelForTokenClassification, AutoTokenizer |
|
import torch |
|
|
|
device = torch.device("cuda:0") |
|
path = f"KomeijiForce/xlm-roberta-large-metaie" |
|
tokenizer = AutoTokenizer.from_pretrained(path) |
|
tagger = AutoModelForTokenClassification.from_pretrained(path).to(device) |
|
|
|
def find_sequences(lst): |
|
sequences = [] |
|
i = 0 |
|
while i < len(lst): |
|
if lst[i] == 0: |
|
start = i |
|
end = i |
|
i += 1 |
|
while i < len(lst) and lst[i] == 1: |
|
end = i |
|
i += 1 |
|
sequences.append((start, end+1)) |
|
else: |
|
i += 1 |
|
return sequences |
|
|
|
examples = [ |
|
"Fire volleys at the command happens: The soldiers were expected to fire volleys at the command of officers, but in practice this happened only in the first minutes of the battle .", |
|
"Historische Ereignisse: Siebenjährigen Krieg von 1756 bis 1763, war Preußen als fünfte Großmacht neben Frankreich, Großbritannien, Österreich und Russland in der europäischen Pentarchie anerkannt .", |
|
"高度: 东方明珠自落成后便为上海天际线的组成部分之一,总高468米。", |
|
"倒れた場所: カフカは高松の私立図書館に通うようになるが、ある日目覚めると、自分が森の中で血だらけで倒れていた。", |
|
] |
|
|
|
for example in examples: |
|
inputs = tokenizer(example, return_tensors="pt").to(device) |
|
tag_predictions = tagger(**inputs).logits[0].argmax(-1) |
|
|
|
predictions = [tokenizer.decode(inputs.input_ids[0, seq[0]:seq[1]]).strip() for seq in find_sequences(tag_predictions)] |
|
|
|
print(example) |
|
print(predictions) |
|
``` |
|
|
|
The output will be |
|
|
|
```python |
|
Fire volleys at the command happens: The soldiers were expected to fire volleys at the command of officers, but in practice this happened only in the first minutes of the battle . |
|
['first minutes of the battle'] |
|
Historische Ereignisse: Siebenjährigen Krieg von 1756 bis 1763, war Preußen als fünfte Großmacht neben Frankreich, Großbritannien, Österreich und Russland in der europäischen Pentarchie anerkannt . |
|
['Siebenjährigen Krieg'] |
|
高度: 东方明珠自落成后便为上海天际线的组成部分之一,总高468米。 |
|
['468米'] |
|
倒れた場所: カフカは高松の私立図書館に通うようになるが、ある日目覚めると、自分が森の中で血だらけで倒れていた。 |
|
['森'] |
|
``` |