cointegrated/rubert-tiny2 · About preprocessing before feeding texts to model

About preprocessing before feeding texts to model

by alexbalandi - opened Sep 8, 2022

Sep 8, 2022

Right now I'm feeding raw texts to model, I'm wondering if lowercasing everything and/or stripping punctuation can be worth it, it probably depends on how the model was trained.

Would be cool if the author gave his thoughts :)

cointegrated

Owner Sep 8, 2022

@alexbalandi The model has been trained primarily on sentences from https://wortschatz.uni-leipzig.de/en/download/Russian and on longer texts from https://huggingface.co/datasets/bertin-project/mc4-sampling without any preprocessing. Therefore, I do not expect that the model performance will benefit from preprocessing the inputs.
But, of course, I am not 100% sure; and it is always better to test such things experimentally.

cointegrated changed discussion status to closed Mar 30, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment