radlab/polish-denoiser-t5-base

The presented model can be used for text de-noising. You can use it if you have text that has noise after loading, such as after loading pdf files.

The model was learned on texts in Polish. The dataset was automatically noised. allegro/plt5-base was used as the base model.

Model input

Model input must be preceded by the tag denoise: F.e. if you have text:

As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u   k.

then input to the model must be constructed as follows:

denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u   k.

Sample model usage

from transformers import T5ForConditionalGeneration, T5Tokenizer


def do_inference(text, model, tokenizer):
    input_text = f"denoise: {text}"
    inputs = tokenizer.encode(
        input_text,
        return_tensors="pt",
        max_length=256,
        padding="max_length",
        truncation=True,
    )

    corrected_ids = model.generate(
        inputs,
        max_length=256,
        num_beams=5,
        early_stopping=True,
    )

    corrected_sentence = tokenizer.decode(corrected_ids[0], skip_special_tokens=True)
    return corrected_sentence


model = T5ForConditionalGeneration.from_pretrained("radlab/polish-denoiser-t5-base")
tokenizer = T5Tokenizer.from_pretrained("radlab/polish-denoiser-t5-base")

text_str = "As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u   k."
print(do_inference(text_str, model, tokenizer))

Model reponse for input:

denoise: As | -Tron^# om ia je@st je!d &*ną z na -J s | AA ta rsZy ch n a u   k.

is:

Astronomia jest jedną z najstarszych nauk.

Evaluation

Eval loss:

More information (in Polish) on our blog

radlab
/

polish-denoiser-t5-base

Collection including radlab/polish-denoiser-t5-base

Text processing