Fairseq
Catalan
English
aleixsant commited on
Commit
a66338c
1 Parent(s): 5d192f8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -2
README.md CHANGED
@@ -62,8 +62,7 @@ and other sources.
62
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
63
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
64
  The filtered datasets are then concatenated to form a final corpus of 30.023.034 parallel sentences and before training
65
- the punctuation is normalized using a modified version of the join-single-file.py script from
66
- [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
67
 
68
 
69
  #### Tokenization
 
62
  All datasets are deduplicated and filtered to remove any sentence pairs with a cosine similarity of less than 0.75.
63
  This is done using sentence embeddings calculated using [LaBSE](https://huggingface.co/sentence-transformers/LaBSE).
64
  The filtered datasets are then concatenated to form a final corpus of 30.023.034 parallel sentences and before training
65
+ the punctuation is normalized using a modified version of the join-single-file.py script from [SoftCatalà](https://github.com/Softcatala/nmt-models/blob/master/data-processing-tools/join-single-file.py).
 
66
 
67
 
68
  #### Tokenization