Add dataset metadata to model card
Browse files
README.md
CHANGED
|
@@ -11,6 +11,9 @@ tags:
|
|
| 11 |
- turkish
|
| 12 |
- english
|
| 13 |
- bilingual
|
|
|
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
# Multrenizer
|
|
@@ -255,6 +258,14 @@ The released artifact is trained with the default file-based interleave in `trai
|
|
| 255 |
|
| 256 |
Corpus collection is Turkish-forward, and code-switching examples are generated from OPUS parallel pairs during data preparation.
|
| 257 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 258 |
### Vocabulary Budget
|
| 259 |
|
| 260 |
Multrenizer is designed around a `26,000` target vocabulary, with a fixed budget reserved for always-preserved tokens:
|
|
|
|
| 11 |
- turkish
|
| 12 |
- english
|
| 13 |
- bilingual
|
| 14 |
+
datasets:
|
| 15 |
+
- wikimedia/wikipedia
|
| 16 |
+
- Helsinki-NLP/opus-100
|
| 17 |
---
|
| 18 |
|
| 19 |
# Multrenizer
|
|
|
|
| 258 |
|
| 259 |
Corpus collection is Turkish-forward, and code-switching examples are generated from OPUS parallel pairs during data preparation.
|
| 260 |
|
| 261 |
+
Exact source configs used during corpus preparation:
|
| 262 |
+
|
| 263 |
+
- `wikimedia/wikipedia` with `20231101.tr`
|
| 264 |
+
- `wikimedia/wikipedia` with `20231101.en`
|
| 265 |
+
- `Helsinki-NLP/opus-100` with `en-tr`
|
| 266 |
+
|
| 267 |
+
The synthetic code-switching stream is generated locally from OPUS-100 parallel pairs, so it does not appear as a separate Hugging Face dataset entry.
|
| 268 |
+
|
| 269 |
### Vocabulary Budget
|
| 270 |
|
| 271 |
Multrenizer is designed around a `26,000` target vocabulary, with a fixed budget reserved for always-preserved tokens:
|