flax-community
/

arabic-t5-small

Text2Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Metrics Training metrics Community

salti commited on Jul 29, 2021

Commit

887b7a5

·

1 Parent(s): e4555b1

Fix typos in README

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -24,9 +24,9 @@ The model could only be trained for about `10%` of the whole dataset due to time
 ## Preprocessing and the tokenizer
-We tried to keep the preprocessing to the bare minimum. We ony replaced URLs, emails and social media user mentions with fixed tokens.
-Contrary to other pretrained Arabic LMs, we decided to not strip the Arabic diacritics and to keep them in the vocabulary.
 The tokenizer was trained on `5%` of the training set, with a vocabulary size of `64'000`.

 ## Preprocessing and the tokenizer
+We tried to keep the preprocessing to a bare minimum. We only replaced URLs, emails and social media user mentions with fixed tokens.
+Contrary to other pretrained Arabic LMs, we decided to not strip the Arabic diacritics and to keep them part of the vocabulary.
 The tokenizer was trained on `5%` of the training set, with a vocabulary size of `64'000`.