Translation and Fusion Improves Zero-shot Cross-lingual Information Extraction

Summary

We propose TransFusion, a framework in which models are fine-tuned to use English translations of low-resource language data, enabling more precise predictions through annotation fusion. Based on TransFusion, we introduce GoLLIE-TF, a cross-lingual instruction-tuned LLM for IE tasks, designed to close the performance gap between high and low-resource languages.

Important: This is based on GoLLIE README (Our flash attention implementation has small numerical differences compared to the attention implementation in Huggingface. You must use the flag trust_remote_code=True or you will get inferior results. Flash attention requires an available CUDA GPU. Running GOLLIE pre-trained models on a CPU is not supported. We plan to address this in future releases. First, install flash attention 2:)

pip install flash-attn --no-build-isolation
pip install git+https://github.com/HazyResearch/flash-attention.git#subdirectory=csrc/rotary

Then you can load the model using

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("ychenNLP/GoLLIE-7B-TF")
model = AutoModelForCausalLM.from_pretrained("HiTZ/GoLLIE-7B-TF", trust_remote_code=True, torch_dtype=torch.bfloat16)
model.to("cuda")

test_input = r'''# The following lines describe the task definition
@dataclass
class LLM(Entity):
    """Large language model names or model names. This is used for deep learning and NLP tasks."""

    span: str  # Such as: "GPT-3.5", "LLama=7B", "ChatGPT"

@dataclass
class Hyperparams(Entity):
    """Hyperparameter used for training large language  models. Including learning rate, scheduler, architecture"""

    span: str  # Such as: "layernorm", "cosine scheduler"

# This is the text to analyze
text = "GoLLIE-7B-TFが本日リリースされました! 1つのNVIDIA A100 GPUで推論が可能なサイズです 学習率は1e-4です 訓練にはLoRAが使用されています"

# This is the English translation of the text
eng_text = "GoLLIE-7B-TF is released today! * Sized for inference on 1 NVIDIA A100 GPUs * learning rate 1e-4 * LoRA is used for training"

# Using translation and fusion
# (1) generate annotation for eng_text
# (2) generate annotation for text

# The annotation instances that take place in the eng_text above are listed here
result = [
'''

model_input = tokenizer(test_input, return_tensors="pt")

print(model_input["input_ids"])

model_input["input_ids"] = model_input["input_ids"][:, :-1]
model_input["attention_mask"] = model_input["attention_mask"][:, :-1]

model_ouput = model.generate(
    **model_input.to(model.device),
    max_new_tokens=128,
    do_sample=False,
    min_new_tokens=0,
    num_beams=1,
    num_return_sequences=1,
)
print(tokenizer.batch_decode(model_ouput))
Downloads last month
13
Safetensors
Model size
6.61B params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.