Update README.md
Browse files
README.md
CHANGED
@@ -74,7 +74,7 @@ The Galician-Catalan data collected from the web was a combination of the follow
|
|
74 |
|Memories Projectes Lliures | 794.631 |
|
75 |
| **Total** | **4.92.275** |
|
76 |
|
77 |
-
The datasets were
|
78 |
The 5.750.000 sentence pairs of synthetic parallel data were created from a random sampling of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina-ca-es)
|
79 |
|
80 |
### Training procedure
|
|
|
74 |
|Memories Projectes Lliures | 794.631 |
|
75 |
| **Total** | **4.92.275** |
|
76 |
|
77 |
+
The datasets were concatenated before filtering to avoid intra-dataset duplicates and the final size was 4.267.995.
|
78 |
The 5.750.000 sentence pairs of synthetic parallel data were created from a random sampling of the [Projecte Aina ES-CA corpus](https://huggingface.co/projecte-aina/mt-aina-ca-es)
|
79 |
|
80 |
### Training procedure
|