About preprocessing before feeding texts to model
#1
by
alexbalandi
- opened
Right now I'm feeding raw texts to model, I'm wondering if lowercasing everything and/or stripping punctuation can be worth it, it probably depends on how the model was trained.
Would be cool if the author gave his thoughts :)
@alexbalandi
The model has been trained primarily on sentences from https://wortschatz.uni-leipzig.de/en/download/Russian and on longer texts from https://huggingface.co/datasets/bertin-project/mc4-sampling without any preprocessing. Therefore, I do not expect that the model performance will benefit from preprocessing the inputs.
But, of course, I am not 100% sure; and it is always better to test such things experimentally.
cointegrated
changed discussion status to
closed