Create README.md

## **Model Overview**

This is the model presented in the ACL-ICJNLP 2023 paper ["Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification"](https://aclanthology.org/2022.acl-long.469/).

The model itself is [`mBART-large-50`](https://huggingface.co/facebook/mbart-large-50) model trained on parallel detoxification datasets [ParaDetox](https://huggingface.co/datasets/s-nlp/paradetox) and [RuDetox](https://github.com/s-nlp/russe_detox_2022) providing detoxification for both **Russian** and **English** languages. More details are presented in the paper.

## **How to use**

1. Load the model checkpoint.

```python
from transformers import MBartForConditionalGeneration, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("s-nlp/mBART_EN_RU")
model = BartForConditionalGeneration.from_pretrained("facebook/mbart-large-50")
```

2. Define helper function.
```python
def paraphrase(text, model, tokenizer, n=None, max_length="auto", beams=5):
texts = [text] if isinstance(text, str) else text
inputs = tokenizer(texts, return_tensors="pt", padding=True)["input_ids"].to(
model.device
)

if max_length == "auto":
max_length = inputs.shape[1] + 10

result = model.generate(
inputs,
num_return_sequences=n or 1,
do_sample=False,
temperature=1.0,
repetition_penalty=10.0,
max_length=max_length,
min_length=int(0.5 * max_length),
num_beams=beams,
forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang],
)
texts = [tokenizer.decode(r, skip_special_tokens=True) for r in result]

if not n and isinstance(text, str):
return texts[0]
return texts[0]
```
3. Generate.

**Citation**
```
TBD
```

Files changed (1) hide show

README.md +9 -0

README.md ADDED Viewed

	@@ -0,0 +1,9 @@

+---
+datasets:
+- s-nlp/paradetox
+language:
+- ru
+- en
+library_name: transformers
+pipeline_tag: text2text-generation
+---