Create README.md
Browse files## **Model Overview**
This is the model presented in the ACL-ICJNLP 2023 paper ["Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification"](https://aclanthology.org/2022.acl-long.469/).
The model itself is [`mBART-large-50`](https://huggingface.co/facebook/mbart-large-50) model trained on parallel detoxification datasets [ParaDetox](https://huggingface.co/datasets/s-nlp/paradetox) and [RuDetox](https://github.com/s-nlp/russe_detox_2022) providing detoxification for both **Russian** and **English** languages. More details are presented in the paper.
## **How to use**
1. Load the model checkpoint.
```python
from transformers import MBartForConditionalGeneration, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("s-nlp/mBART_EN_RU")
model = BartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
```
2. Define helper function.
```python
def paraphrase(text, model, tokenizer, n=None, max_length="auto", beams=5):
texts = [text] if isinstance(text, str) else text
inputs = tokenizer(texts, return_tensors="pt", padding=True)["input_ids"].to(
model.device
)
if max_length == "auto":
max_length = inputs.shape[1] + 10
result = model.generate(
inputs,
num_return_sequences=n or 1,
do_sample=False,
temperature=1.0,
repetition_penalty=10.0,
max_length=max_length,
min_length=int(0.5 * max_length),
num_beams=beams,
forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang],
)
texts = [tokenizer.decode(r, skip_special_tokens=True) for r in result]
if not n and isinstance(text, str):
return texts[0]
return texts[0]
```
3. Generate.
**Citation**
```
TBD
```