morten-j commited on
Commit
7d16188
โ€ข
1 Parent(s): fac3fd5

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +19 -1
README.md CHANGED
@@ -1,2 +1,20 @@
1
  BPE based tokenizer used for the MEHDIE project and the training of a bilingual BERT model.
2
- Vocab size of 52000.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  BPE based tokenizer used for the MEHDIE project and the training of a bilingual BERT model.
2
+
3
+ Vocabulary size: 52000
4
+ Trained on:
5
+ - Arabic dataset: https://huggingface.co/datasets/bigscience-data/roots_ar_openiti_proc
6
+ - Hebrew/English dataset: https://huggingface.co/datasets/mehdie/sefaria
7
+
8
+ Examples:
9
+ Hebrew:
10
+ - "ื–ื” ื”ืกืคืจ ืžื—ื•ื‘ืจ ืžื“ื‘ืจื™ื ืฉืกืคืจ ืื™ืฉ ืื—ื“ ืžืืจืฅ ื ื‘ืืจื” ืฉืฉืžื• ืจื‘ื™ ื‘ื ื™ืžื™ืŸ ื‘ืจ ื™ื•ื ื” ืžื˜ื•ื“ื™ืœื”. ื•ื™ืœืš ื”ืœื•ืš ื•ื™ื‘ื ื‘ืืจืฆื•ืช ืจื‘ื•ืช ื•ืจื—ื•ืงื•ืช ื›ืืฉืจ ื™ืชืคืจืฉ ื‘ื“ื‘ืจื™ื• ืืœื• ื•ื‘ื›ืœ ืžืงื•ื ืฉื‘ื ื‘ื• ื›ืชื‘ ื›ืœ ื”ื“ื‘ืจื™ื ืฉืจืื” ืื• ืฉืฉืžืข ืžืคื™ ืื ืฉื™ ืืžืช ืืฉืจ ื ืฉืžืขื• ื‘ืืจืฅ ืกืคืจื“: ื•ื›ืš ื”ื•ื ื–ื•ื›ืจ ืžืงืฆืช ื”ื’ื“ื•ืœื™ื ื•ื”ื ืฉื™ืื™ื ืฉื‘ืžืงืฆืช ืžืงื•ืžื•ืช ื•ื›ืฉื‘ื ื”ื‘ื™ื ื“ื‘ืจื™ื• ืืœื” ืขืžื• ืœืืจืฅ ืงืฉื˜ื™ืœื™ื ื‘ืฉื ืช ืชืชืงืœ"
11
+ - {'input_ids': [1060, 15784, 20958, 31767, 476, 4398, 3294, 1812, 19949, 42648, 455, 38010, 2069, 23008, 978, 11894, 3509, 8222, 973, 26, 23816, 8043, 461, 19170, 2998, 6517, 4245, 960, 5536, 928, 4122, 1008, 2643, 16456, 2702, 10350, 1796, 3044, 1333, 1488, 1019, 5501, 15530, 1109, 26822, 8473, 11437, 5419, 1919, 467, 13163, 6566, 4398, 454, 38, 7922, 1203, 41248, 9907, 21722, 1001, 16464, 931, 1123, 9907, 9647, 1053, 3044, 4553, 3573, 2851, 4088, 9330, 3492, 18352, 1057, 23994, 32635, 463], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
12
+
13
+ Arabic:
14
+ - "ุณู„ุณู„ุฉ ุงู„ุฃุฌุฒุงุก ูˆุงู„ูƒุชุจ ุงู„ุญุฏูŠุซูŠุฉ ุงู„ููˆุงุฆุฏ ูˆุงู„ุฃุฎุจุงุฑ ูˆุงู„ุญูƒุงูŠุงุช ุนู† ุงู„ุดุงูุนูŠ ูˆุญุงุชู… ุงู„ุฃุตู… ูˆู…ุนุฑูˆู ุงู„ูƒุฑุฎูŠ ูˆุบูŠุฑู‡ู… ู„ู„ู…ุญุฏุซ ุงู„ูู‚ูŠู‡ ุฃุจูŠ ุนู„ูŠ ุงู„ุญุณู† ุจู† ุงู„ุญุณูŠู† ุจู† ุญู…ูƒุงู† ุงู„ู‡ู…ุฐุงู†ูŠ ุงู„ุดุงูุนูŠ ุฏุฑุงุณุฉ ูˆุชุญู‚ูŠู‚ ูˆุชุนู„ูŠู‚ ุงู„ุทุจุนุฉ ุงู„ุฃูˆู„ู‰ ุงู„ุฌุฒุก ุงู„ุฃูˆู„ ู…ู† ุงู„ููˆุงุฆุฏ ูˆุงู„ุฃุฎุจุงุฑ ูˆุงู„ุญูƒุงูŠุงุช ุนู† ุงู„ุดุงูุนูŠ ูˆุญุงุชู… ุงู„ุฃุตู… ูˆู…ุนุฑูˆู ุงู„ูƒุฑุฎูŠ ูˆุบูŠุฑู‡ู… ุฑุถูŠ ุงู„ู„ู‡ ุนู†ู‡ู… ุฑูˆุงูŠุฉ"
15
+ - {'input_ids': [27193, 15595, 34780, 1361, 949, 13852, 21459, 2169, 30440, 896, 2040, 41252, 9723, 50442, 16317, 3057, 1675, 1216, 3320, 958, 910, 1260, 888, 1532, 888, 912, 935, 13333, 2040, 36093, 22637, 49937, 16554, 2254, 4572, 1576, 890, 13852, 21459, 2169, 30440, 896, 2040, 41252, 9723, 50442, 16317, 3057, 1432, 904, 2710, 1933], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
16
+
17
+ English:
18
+ - "The medieval Arabic name of the northernmost of the three provinces of the Jazira, the other two being Diyar Mudar and Diyar Rabi'a"
19
+ - {'input_ids': [2034, 16522, 4490, 1270, 22040, 1837, 2340, 7960, 1183, 989, 10048, 2068, 90, 13377, 1183, 989, 8235, 14261, 1021, 7322, 1183, 989, 54, 18017, 17311, 24, 989, 3249, 5269, 8500, 48, 17821, 1294, 57, 3307, 1294, 1261, 48, 17821, 1294, 26438, 85, 19, 77], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
20
+