|
--- |
|
license: cc-by-nc-sa-4.0 |
|
datasets: |
|
- wi_locness |
|
- matejklemen/falko_merlin |
|
- paws |
|
- paws-x |
|
- asset |
|
language: |
|
- en |
|
- de |
|
- es |
|
- ar |
|
- ja |
|
- ko |
|
- zh |
|
metrics: |
|
- bleu |
|
- rouge |
|
- sari |
|
- accuracy |
|
library_name: transformers |
|
--- |
|
|
|
# Model Card for mEdIT-xl |
|
|
|
The `medit-xl` model was obtained by fine-tuning the `MBZUAI/bactrian-x-llama-7b-lora` model on the mEdIT dataset. |
|
|
|
**Paper:** mEdIT: Multilingual Text Editing via Instruction Tuning |
|
|
|
**Authors:** Vipul Raheja, Dimitris Alikaniotis, Vivek Kulkarni, Bashar Alhafni, Dhruv Kumar |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
- **Language(s) (NLP)**: Arabic, Chinese, English, German, Japanese, Korean, Spanish |
|
- **Finetuned from model:** `MBZUAI/bactrian-x-llama-7b-lora` |
|
|
|
### Model Sources |
|
|
|
- **Repository:** https://github.com/vipulraheja/medit |
|
- **Paper:** https://arxiv.org/abs/2402.16472v1 |
|
|
|
## How to use |
|
|
|
Given an edit instruction and an original text, our model can generate the edited version of the text.<br> |
|
|
|
![task_specs](https://cdn-uploads.huggingface.co/production/uploads/60985a0547dc3dbf8a976607/816ZY2t0XPCpMMd6Z072K.png) |
|
|
|
Specifically, our models support both multi-lingual and cross-lingual text revision. Note that the input and output texts are always in the same language. The monolingual |
|
vs. cross-lingual setting is determined by comparing the language of the edit instruction in relation to the language of the input text. |
|
|
|
### Instruction format |
|
|
|
Adherence to the following instruction format is essential; failure to do so may result in the model producing less-than-ideal results. |
|
|
|
``` |
|
instruction_tokens = [ |
|
"Instruction", |
|
"Anweisung", |
|
... |
|
] |
|
|
|
input_tokens = [ |
|
"Input", |
|
"Aporte", |
|
... |
|
] |
|
|
|
output_tokens = [ |
|
"Output", |
|
"Produzione", |
|
... |
|
] |
|
|
|
task_descriptions = [ |
|
"Fix grammatical errors in this sentence", # <-- GEC task |
|
"Umschreiben Sie den Satz", # <-- Paraphrasing |
|
... |
|
] |
|
``` |
|
|
|
**The entire list of possible instructions, input/output tokens, and task descriptions can be found in the Appendix of our paper.** |
|
|
|
``` |
|
prompt_template = """### <instruction_token>:\n<task_description>\n### <input_token>:\n<input>\n### <output_token>:\n\n""" |
|
``` |
|
|
|
Note that the tokens and the task description need not be in the language of the input (in the case of cross-lingual revision). |
|
|
|
|
|
### Run the model |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
model_id = "grammarly/medit-xl" |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_id) |
|
|
|
# English GEC using Japanese instructions |
|
prompt = '### 命令:\n文章を文法的にする\n### 入力:\nI has small cat ,\n### 出力:\n\n' |
|
|
|
inputs = tokenizer(prompt, return_tensors='pt') |
|
|
|
outputs = model.generate(**inputs, max_new_tokens=20) |
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
|
# --> I have a small cat , |
|
|
|
# German GEC using Japanese instructions |
|
prompt = '### 命令:\n文章を文法的にする\n### 入力:\nIch haben eines kleines Katze ,\n### 出力:\n\n' |
|
|
|
# ... |
|
# --> Ich habe eine kleine Katze , |
|
``` |
|
|
|
#### Software |
|
https://github.com/vipulraheja/medit |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
``` |
|
@article{raheja2023medit, |
|
title={mEdIT: mEdIT: Multilingual Text Editing via Instruction Tuning}, |
|
author={Vipul Raheja and Dimitris Alikaniotis and Vivek Kulkarni and Bashar Alhafni and Dhruv Kumar}, |
|
year={2024}, |
|
eprint={2402.16472v1}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL} |
|
} |
|
``` |
|
|
|
**APA:** |
|
Raheja, V., Alikaniotis, D., Kulkarni, V., Alhafni, B., & Kumar, D. (2024). MEdIT: Multilingual Text Editing via Instruction Tuning. ArXiv. /abs/2402.16472 |
|
|