--- license: cc-by-sa-4.0 language: - pl library_name: transformers --- The presented model can be used for text de-noising. You can use it if you have text that has noise after loading, such as after loading pdf files. The model was learned on texts in Polish. The dataset was automatically noised. [allegro/plt5-base](https://huggingface.co/allegro/plt5-base) was used as the base model. **Model input** Model input must be preceded by the tag `denoise:` F.e. if you have text: ``` As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k. ``` then input to the model must be constructed as follows: ``` denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k. ``` **Sample model usage** ```python from transformers import T5ForConditionalGeneration, T5Tokenizer def do_inference(text, model, tokenizer): input_text = f"denoise: {text}" inputs = tokenizer.encode( input_text, return_tensors="pt", max_length=256, padding="max_length", truncation=True, ) corrected_ids = model.generate( inputs, max_length=256, num_beams=5, early_stopping=True, ) corrected_sentence = tokenizer.decode(corrected_ids[0], skip_special_tokens=True) return corrected_sentence model = T5ForConditionalGeneration.from_pretrained("radlab/polish-denoiser-t5-base") tokenizer = T5Tokenizer.from_pretrained("radlab/polish-denoiser-t5-base") text_str = "As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k." print(do_inference(text_str, model, tokenizer)) ``` Model reponse for **input**: ``` denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u k. ``` is: ``` Astronomia jest jedną z najstarszych nauk. ``` **Evaluation** Eval loss: ![image/png](https://cdn-uploads.huggingface.co/production/uploads/644addfe9279988e0cbc296b/HIJI2a1nojM6lbDyYe0-A.png) More information (in Polish) on our [blog](https://radlab.dev/2024/04/20/odszumiacz-tekstow/)