BPE based tokenizer used for the MEHDIE project and the training of a bilingual BERT model.
Vocabulary size: 52000 Trained on:
- Arabic dataset: https://huggingface.co/datasets/bigscience-data/roots_ar_openiti_proc
- Hebrew/English dataset: https://huggingface.co/datasets/mehdie/sefaria
Examples: Hebrew:
- "ืื ืืกืคืจ ืืืืืจ ืืืืจืื ืฉืกืคืจ ืืืฉ ืืื ืืืจืฅ ื ืืืจื ืฉืฉืื ืจืื ืื ืืืื ืืจ ืืื ื ืืืืืืื. ืืืื ืืืื ืืืื ืืืจืฆืืช ืจืืืช ืืจืืืงืืช ืืืฉืจ ืืชืคืจืฉ ืืืืจืื ืืื ืืืื ืืงืื ืฉืื ืื ืืชื ืื ืืืืจืื ืฉืจืื ืื ืฉืฉืืข ืืคื ืื ืฉื ืืืช ืืฉืจ ื ืฉืืขื ืืืจืฅ ืกืคืจื: ืืื ืืื ืืืืจ ืืงืฆืช ืืืืืืื ืืื ืฉืืืื ืฉืืืงืฆืช ืืงืืืืช ืืืฉืื ืืืื ืืืจืื ืืื ืขืื ืืืจืฅ ืงืฉืืืืื ืืฉื ืช ืชืชืงื"
- {'input_ids': [1060, 15784, 20958, 31767, 476, 4398, 3294, 1812, 19949, 42648, 455, 38010, 2069, 23008, 978, 11894, 3509, 8222, 973, 26, 23816, 8043, 461, 19170, 2998, 6517, 4245, 960, 5536, 928, 4122, 1008, 2643, 16456, 2702, 10350, 1796, 3044, 1333, 1488, 1019, 5501, 15530, 1109, 26822, 8473, 11437, 5419, 1919, 467, 13163, 6566, 4398, 454, 38, 7922, 1203, 41248, 9907, 21722, 1001, 16464, 931, 1123, 9907, 9647, 1053, 3044, 4553, 3573, 2851, 4088, 9330, 3492, 18352, 1057, 23994, 32635, 463], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Arabic:
- "ุณูุณูุฉ ุงูุฃุฌุฒุงุก ูุงููุชุจ ุงูุญุฏูุซูุฉ ุงูููุงุฆุฏ ูุงูุฃุฎุจุงุฑ ูุงูุญูุงูุงุช ุนู ุงูุดุงูุนู ูุญุงุชู ุงูุฃุตู ูู ุนุฑูู ุงููุฑุฎู ูุบูุฑูู ููู ุญุฏุซ ุงููููู ุฃุจู ุนูู ุงูุญุณู ุจู ุงูุญุณูู ุจู ุญู ูุงู ุงููู ุฐุงูู ุงูุดุงูุนู ุฏุฑุงุณุฉ ูุชุญููู ูุชุนููู ุงูุทุจุนุฉ ุงูุฃููู ุงูุฌุฒุก ุงูุฃูู ู ู ุงูููุงุฆุฏ ูุงูุฃุฎุจุงุฑ ูุงูุญูุงูุงุช ุนู ุงูุดุงูุนู ูุญุงุชู ุงูุฃุตู ูู ุนุฑูู ุงููุฑุฎู ูุบูุฑูู ุฑุถู ุงููู ุนููู ุฑูุงูุฉ"
- {'input_ids': [27193, 15595, 34780, 1361, 949, 13852, 21459, 2169, 30440, 896, 2040, 41252, 9723, 50442, 16317, 3057, 1675, 1216, 3320, 958, 910, 1260, 888, 1532, 888, 912, 935, 13333, 2040, 36093, 22637, 49937, 16554, 2254, 4572, 1576, 890, 13852, 21459, 2169, 30440, 896, 2040, 41252, 9723, 50442, 16317, 3057, 1432, 904, 2710, 1933], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
English:
- "The medieval Arabic name of the northernmost of the three provinces of the Jazira, the other two being Diyar Mudar and Diyar Rabi'a"
- {'input_ids': [2034, 16522, 4490, 1270, 22040, 1837, 2340, 7960, 1183, 989, 10048, 2068, 90, 13377, 1183, 989, 8235, 14261, 1021, 7322, 1183, 989, 54, 18017, 17311, 24, 989, 3249, 5269, 8500, 48, 17821, 1294, 57, 3307, 1294, 1261, 48, 17821, 1294, 26438, 85, 19, 77], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}