BPE-tokenizer / README.md
morten-j's picture
Update README.md
7d16188 verified

BPE based tokenizer used for the MEHDIE project and the training of a bilingual BERT model.

Vocabulary size: 52000 Trained on:

Examples: Hebrew:

  • "ื–ื” ื”ืกืคืจ ืžื—ื•ื‘ืจ ืžื“ื‘ืจื™ื ืฉืกืคืจ ืื™ืฉ ืื—ื“ ืžืืจืฅ ื ื‘ืืจื” ืฉืฉืžื• ืจื‘ื™ ื‘ื ื™ืžื™ืŸ ื‘ืจ ื™ื•ื ื” ืžื˜ื•ื“ื™ืœื”. ื•ื™ืœืš ื”ืœื•ืš ื•ื™ื‘ื ื‘ืืจืฆื•ืช ืจื‘ื•ืช ื•ืจื—ื•ืงื•ืช ื›ืืฉืจ ื™ืชืคืจืฉ ื‘ื“ื‘ืจื™ื• ืืœื• ื•ื‘ื›ืœ ืžืงื•ื ืฉื‘ื ื‘ื• ื›ืชื‘ ื›ืœ ื”ื“ื‘ืจื™ื ืฉืจืื” ืื• ืฉืฉืžืข ืžืคื™ ืื ืฉื™ ืืžืช ืืฉืจ ื ืฉืžืขื• ื‘ืืจืฅ ืกืคืจื“: ื•ื›ืš ื”ื•ื ื–ื•ื›ืจ ืžืงืฆืช ื”ื’ื“ื•ืœื™ื ื•ื”ื ืฉื™ืื™ื ืฉื‘ืžืงืฆืช ืžืงื•ืžื•ืช ื•ื›ืฉื‘ื ื”ื‘ื™ื ื“ื‘ืจื™ื• ืืœื” ืขืžื• ืœืืจืฅ ืงืฉื˜ื™ืœื™ื ื‘ืฉื ืช ืชืชืงืœ"
  • {'input_ids': [1060, 15784, 20958, 31767, 476, 4398, 3294, 1812, 19949, 42648, 455, 38010, 2069, 23008, 978, 11894, 3509, 8222, 973, 26, 23816, 8043, 461, 19170, 2998, 6517, 4245, 960, 5536, 928, 4122, 1008, 2643, 16456, 2702, 10350, 1796, 3044, 1333, 1488, 1019, 5501, 15530, 1109, 26822, 8473, 11437, 5419, 1919, 467, 13163, 6566, 4398, 454, 38, 7922, 1203, 41248, 9907, 21722, 1001, 16464, 931, 1123, 9907, 9647, 1053, 3044, 4553, 3573, 2851, 4088, 9330, 3492, 18352, 1057, 23994, 32635, 463], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

Arabic:

  • "ุณู„ุณู„ุฉ ุงู„ุฃุฌุฒุงุก ูˆุงู„ูƒุชุจ ุงู„ุญุฏูŠุซูŠุฉ ุงู„ููˆุงุฆุฏ ูˆุงู„ุฃุฎุจุงุฑ ูˆุงู„ุญูƒุงูŠุงุช ุนู† ุงู„ุดุงูุนูŠ ูˆุญุงุชู… ุงู„ุฃุตู… ูˆู…ุนุฑูˆู ุงู„ูƒุฑุฎูŠ ูˆุบูŠุฑู‡ู… ู„ู„ู…ุญุฏุซ ุงู„ูู‚ูŠู‡ ุฃุจูŠ ุนู„ูŠ ุงู„ุญุณู† ุจู† ุงู„ุญุณูŠู† ุจู† ุญู…ูƒุงู† ุงู„ู‡ู…ุฐุงู†ูŠ ุงู„ุดุงูุนูŠ ุฏุฑุงุณุฉ ูˆุชุญู‚ูŠู‚ ูˆุชุนู„ูŠู‚ ุงู„ุทุจุนุฉ ุงู„ุฃูˆู„ู‰ ุงู„ุฌุฒุก ุงู„ุฃูˆู„ ู…ู† ุงู„ููˆุงุฆุฏ ูˆุงู„ุฃุฎุจุงุฑ ูˆุงู„ุญูƒุงูŠุงุช ุนู† ุงู„ุดุงูุนูŠ ูˆุญุงุชู… ุงู„ุฃุตู… ูˆู…ุนุฑูˆู ุงู„ูƒุฑุฎูŠ ูˆุบูŠุฑู‡ู… ุฑุถูŠ ุงู„ู„ู‡ ุนู†ู‡ู… ุฑูˆุงูŠุฉ"
  • {'input_ids': [27193, 15595, 34780, 1361, 949, 13852, 21459, 2169, 30440, 896, 2040, 41252, 9723, 50442, 16317, 3057, 1675, 1216, 3320, 958, 910, 1260, 888, 1532, 888, 912, 935, 13333, 2040, 36093, 22637, 49937, 16554, 2254, 4572, 1576, 890, 13852, 21459, 2169, 30440, 896, 2040, 41252, 9723, 50442, 16317, 3057, 1432, 904, 2710, 1933], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

English:

  • "The medieval Arabic name of the northernmost of the three provinces of the Jazira, the other two being Diyar Mudar and Diyar Rabi'a"
  • {'input_ids': [2034, 16522, 4490, 1270, 22040, 1837, 2340, 7960, 1183, 989, 10048, 2068, 90, 13377, 1183, 989, 8235, 14261, 1021, 7322, 1183, 989, 54, 18017, 17311, 24, 989, 3249, 5269, 8500, 48, 17821, 1294, 57, 3307, 1294, 1261, 48, 17821, 1294, 26438, 85, 19, 77], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}