About preprocessing before feeding texts to model

#1
by alexbalandi - opened

Right now I'm feeding raw texts to model, I'm wondering if lowercasing everything and/or stripping punctuation can be worth it, it probably depends on how the model was trained.

Would be cool if the author gave his thoughts :)

@alexbalandi The model has been trained primarily on sentences from https://wortschatz.uni-leipzig.de/en/download/Russian and on longer texts from https://huggingface.co/datasets/bertin-project/mc4-sampling without any preprocessing. Therefore, I do not expect that the model performance will benefit from preprocessing the inputs.
But, of course, I am not 100% sure; and it is always better to test such things experimentally.

cointegrated changed discussion status to closed

Sign up or log in to comment