--- datasets: - s-nlp/paradetox - s-nlp/ru_paradetox language: - ru - en library_name: transformers pipeline_tag: text2text-generation license: openrail++ --- ## Model Description This is the model presented in the paper "Exploring Methods for Cross-lingual Text Style Transfer: The Case of Text Detoxification". The model is based on [mBART-large-50](https://huggingface.co/facebook/mbart-large-50) and trained on two parallel detoxification corpora: [ParaDetox](https://huggingface.co/datasets/s-nlp/paradetox) and [RuDetox](https://github.com/s-nlp/russe_detox_2022/tree/main/data). More details about this model are in the paper. ## Usage 1. Model loading. ```python from transformers import MBartForConditionalGeneration, AutoTokenizer model = MBartForConditionalGeneration.from_pretrained("s-nlp/mbart-detox-en-ru").cuda() tokenizer = AutoTokenizer.from_pretrained("facebook/mbart-large-50") ``` 2. Detoxification utility. ```python def paraphrase(text, model, tokenizer, n=None, max_length="auto", beams=3): texts = [text] if isinstance(text, str) else text inputs = tokenizer(texts, return_tensors="pt", padding=True)["input_ids"].to( model.device ) if max_length == "auto": max_length = inputs.shape[1] + 10 result = model.generate( inputs, num_return_sequences=n or 1, do_sample=True, temperature=1.0, repetition_penalty=10.0, max_length=max_length, min_length=int(0.5 * max_length), num_beams=beams, forced_bos_token_id=tokenizer.lang_code_to_id[tokenizer.tgt_lang] ) texts = [tokenizer.decode(r, skip_special_tokens=True) for r in result] if not n and isinstance(text, str): return texts[0] return texts ``` ## Citation TBD