LLM Course documentation

Fast Tokenizers များ၏ ထူးခြားသော စွမ်းအားများ

LLM Course

0. စတင်ပြင်ဆင်ခြင်း

1. Transformer models များ

2. 🤗 Transformers ကို အသုံးပြုခြင်း

3. Pretrained Model တစ်ခုကို Fine-tuning လုပ်ခြင်း

4. Models နှင့် Tokenizers များကို မျှဝေခြင်း

5. The 🤗 Datasets library

6. The 🤗 Tokenizers library

နိဒါန်း Old Tokenizer တစ်ခုမှ New Tokenizer တစ်ခုကို Training လုပ်ခြင်း Fast Tokenizers များ၏ ထူးခြားသော စွမ်းအားများ QA Pipeline ရှိ Fast Tokenizers များ Normalization နှင့် Pre-tokenization Byte-Pair Encoding Tokenization WordPiece Tokenization Unigram Tokenization Tokenizer တစ်ခုကို Block အလိုက် တည်ဆောက်ခြင်း Tokenizers၊ အဆင်သင့်ဖြစ်ပါပြီ! အခန်း (၆) ဆိုင်ရာ မေးခွန်းများ

7. Classical NLP Tasks များ

8. အကူအညီတောင်းခံနည်း

9. Demos များ တည်ဆောက်ခြင်းနှင့် မျှဝေခြင်း

10. အရည်အသွေးမြင့် Datasets များကို စုစည်းခြင်း

11. Large Language Models များကို Fine-tune လုပ်ခြင်း

12. Reasoning Models များ တည်ဆောက်ခြင်း new

သင်တန်း ဆိုင်ရာ အခမ်းအနားများ

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Pytorch TensorFlow

Fast Tokenizers များ၏ ထူးခြားသော စွမ်းအားများ

ဒီအပိုင်းမှာ 🤗 Transformers ထဲက tokenizers တွေရဲ့ စွမ်းဆောင်ရည်တွေကို အနီးကပ် လေ့လာသွားပါမယ်။ အခုထိ ကျွန်တော်တို့ဟာ inputs တွေကို tokenize လုပ်ဖို့ ဒါမှမဟုတ် IDs တွေကို text အဖြစ် ပြန်ပြောင်းဖို့သာ အသုံးပြုခဲ့ကြပါတယ်။ ဒါပေမယ့် tokenizers တွေ — အထူးသဖြင့် 🤗 Tokenizers library နဲ့ ထောက်ပံ့ထားတဲ့ tokenizers တွေ — က ပိုပြီးများစွာ လုပ်ဆောင်နိုင်ပါတယ်။ ဒီထပ်ဆောင်း features တွေကို ဖော်ပြဖို့အတွက်၊ Chapter 1 မှာ ပထမဆုံး တွေ့ခဲ့ရတဲ့ token-classification (ကျွန်တော်တို့ ner လို့ ခေါ်ခဲ့တဲ့) နဲ့ question-answering pipelines တွေရဲ့ ရလဒ်တွေကို ဘယ်လို ပြန်လည်ထုတ်လုပ်မလဲဆိုတာကို ကျွန်တော်တို့ လေ့လာသွားပါမယ်။

အောက်ပါဆွေးနွေးမှုမှာ၊ “slow” နဲ့ “fast” tokenizers တွေကြား ခြားနားချက်ကို မကြာခဏ ပြုလုပ်သွားမှာပါ။ Slow tokenizers တွေက 🤗 Transformers library အတွင်းမှာ Python နဲ့ ရေးသားထားတာတွေဖြစ်ပြီး၊ fast versions တွေကတော့ Rust နဲ့ ရေးသားထားတဲ့ 🤗 Tokenizers က ပံ့ပိုးပေးတာတွေ ဖြစ်ပါတယ်။ Chapter 5 က Drug Review Dataset ကို tokenize လုပ်ဖို့ fast နဲ့ slow tokenizer တစ်ခုစီ ဘယ်လောက်ကြာလဲဆိုတာ ဖော်ပြထားတဲ့ ဇယားကို သင်မှတ်မိမယ်ဆိုရင်၊ ဒါတွေကို ဘာလို့ fast နဲ့ slow လို့ခေါ်တာလဲဆိုတာ သင်စိတ်ကူးရနိုင်ပါလိမ့်မယ်။

	Fast tokenizer	Slow tokenizer
`batched=True`	10.8s	4min41s
`batched=False`	59.2s	5min3s

⚠️ စာကြောင်းတစ်ကြောင်းတည်းကို tokenize လုပ်တဲ့အခါ၊ တူညီတဲ့ tokenizer ရဲ့ slow နဲ့ fast versions တွေကြား speed ကွာခြားချက်ကို အမြဲတမ်း မြင်ရမှာ မဟုတ်ပါဘူး။ တကယ်တော့၊ fast version က တကယ်တမ်း ပိုနှေးနိုင်ပါသေးတယ်! texts အများအပြားကို တစ်ပြိုင်နက်တည်း parallel လုပ်တဲ့အခါမှသာ ခြားနားချက်ကို ရှင်းရှင်းလင်းလင်း မြင်တွေ့နိုင်ပါလိမ့်မယ်။

Batch Encoding

tokenizer ရဲ့ output ဟာ ရိုးရှင်းတဲ့ Python dictionary တစ်ခု မဟုတ်ပါဘူး။ ကျွန်တော်တို့ ရရှိတာက တကယ်တော့ BatchEncoding object အထူးတစ်ခုပါ။ ဒါဟာ dictionary ရဲ့ subclass တစ်ခုပါ (ဒါကြောင့် ကျွန်တော်တို့ အရင်က ဘာပြဿနာမှ မရှိဘဲ ရလဒ်ထဲကို indexing လုပ်နိုင်ခဲ့တာပါ)၊ ဒါပေမယ့် fast tokenizers တွေက အဓိကအသုံးပြုတဲ့ ထပ်ဆောင်း methods တွေနဲ့ပါ။

parallelization စွမ်းဆောင်ရည်တွေအပြင်၊ fast tokenizers တွေရဲ့ အဓိကလုပ်ဆောင်ချက်ကတော့ ၎င်းတို့ဟာ final tokens တွေ ထွက်ပေါ်လာတဲ့ original text ရဲ့ span ကို အမြဲတမ်း ခြေရာခံထားနိုင်ခြင်းပါပဲ — ဒီ feature ကို ကျွန်တော်တို့ offset mapping လို့ ခေါ်ပါတယ်။ ဒါက တစ်ဖန်၊ စကားလုံးတစ်ခုစီကို ၎င်းက ထုတ်လုပ်ခဲ့တဲ့ tokens တွေနဲ့ map လုပ်ခြင်း ဒါမှမဟုတ် original text ရဲ့ character တစ်ခုစီကို ၎င်းပါဝင်တဲ့ token ထဲသို့ map လုပ်ခြင်း၊ ပြီးတော့ ပြောင်းပြန် map လုပ်ခြင်းစတဲ့ features တွေကို ဖွင့်ပေးပါတယ်။

ဥပမာတစ်ခု ကြည့်ရအောင်…

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
encoding = tokenizer(example)
print(type(encoding))

အစောပိုင်းမှာ ဖော်ပြခဲ့တဲ့အတိုင်း၊ tokenizer ရဲ့ output မှာ BatchEncoding object တစ်ခုကို ကျွန်တော်တို့ ရရှိပါတယ်။

<class 'transformers.tokenization_utils_base.BatchEncoding'>

AutoTokenizer class က default အားဖြင့် fast tokenizer ကို ရွေးချယ်တာကြောင့်၊ ဒီ BatchEncoding object က ပံ့ပိုးပေးတဲ့ ထပ်ဆောင်း methods တွေကို ကျွန်တော်တို့ အသုံးပြုနိုင်ပါတယ်။ ကျွန်တော်တို့ရဲ့ tokenizer ဟာ fast လား ဒါမှမဟုတ် slow လားဆိုတာ စစ်ဆေးဖို့ နည်းလမ်းနှစ်ခုရှိပါတယ်။ tokenizer ရဲ့ is_fast attribute ကို စစ်ဆေးနိုင်ပါတယ်။

tokenizer.is_fast

True

ဒါမှမဟုတ် ကျွန်တော်တို့ရဲ့ encoding ရဲ့ တူညီတဲ့ attribute ကို စစ်ဆေးနိုင်ပါတယ်။

encoding.is_fast

True

fast tokenizer က ကျွန်တော်တို့ကို ဘာတွေ လုပ်ဆောင်နိုင်စေလဲ ကြည့်ရအောင်။ ပထမဆုံး၊ IDs တွေကို text အဖြစ် ပြန်ပြောင်းစရာမလိုဘဲ tokens တွေကို ဝင်ရောက်ကြည့်ရှုနိုင်ပါတယ်။

encoding.tokens()

['[CLS]', 'My', 'name', 'is', 'S', '##yl', '##va', '##in', 'and', 'I', 'work', 'at', 'Hu', '##gging', 'Face', 'in',
 'Brooklyn', '.', '[SEP]']

ဒီကိစ္စမှာ index 5 မှာရှိတဲ့ token က ##yl ဖြစ်ပြီး၊ original sentence ထဲက “Sylvain” စကားလုံးရဲ့ အစိတ်အပိုင်းတစ်ခုပါ။ word_ids() method ကိုလည်း အသုံးပြုပြီး token တစ်ခုစီ ဘယ် word ကနေ လာလဲဆိုတဲ့ index ကို ရယူနိုင်ပါတယ်။

encoding.word_ids()

[None, 0, 1, 2, 3, 3, 3, 3, 4, 5, 6, 7, 8, 8, 9, 10, 11, 12, None]

tokenizer ရဲ့ special tokens တွေဖြစ်တဲ့ [CLS] နဲ့ [SEP] တွေကို None နဲ့ map လုပ်ထားပြီး၊ token တစ်ခုစီကို ၎င်းထွက်ပေါ်လာတဲ့ word နဲ့ map လုပ်ထားတာကို ကျွန်တော်တို့ မြင်နိုင်ပါတယ်။ ဒါက token တစ်ခုဟာ word ရဲ့ အစမှာရှိမရှိ ဒါမှမဟုတ် tokens နှစ်ခုဟာ တူညီတဲ့ word ထဲမှာရှိမရှိ ဆုံးဖြတ်ဖို့ အထူးအသုံးဝင်ပါတယ်။ ဒါအတွက် ## prefix ကို ကျွန်တော်တို့ အားကိုးနိုင်ပေမယ့်၊ ဒါက BERT-like tokenizers တွေအတွက်သာ အလုပ်လုပ်ပါတယ်။ ဒီ method ကတော့ fast tokenizer ဖြစ်သရွေ့ မည်သည့် tokenizer အမျိုးအစားအတွက်မဆို အလုပ်လုပ်ပါတယ်။ နောက်အခန်းမှာ၊ named entity recognition (NER) နဲ့ part-of-speech (POS) tagging လို tasks တွေမှာ word တစ်ခုစီအတွက် ကျွန်တော်တို့မှာရှိတဲ့ labels တွေကို tokens တွေမှာ မှန်ကန်စွာ ဘယ်လိုအသုံးချရမလဲဆိုတာကို ဒီစွမ်းဆောင်ရည်ကို ဘယ်လိုအသုံးပြုမလဲဆိုတာ မြင်ရပါလိမ့်မယ်။ ဒါ့အပြင် masked language modeling မှာ တူညီတဲ့ word ကနေ လာတဲ့ tokens အားလုံးကို mask လုပ်ဖို့လည်း အသုံးပြုနိုင်ပါတယ် (ဒါကို whole word masking လို့ ခေါ်ပါတယ်)။

Word ဆိုတာဘာလဲဆိုတဲ့ အယူအဆက ရှုပ်ထွေးပါတယ်။ ဥပမာ၊ “I’ll” (“I will” ရဲ့ အတိုကောက်) ကို စကားလုံးတစ်လုံး သို့မဟုတ် နှစ်လုံးအဖြစ် သတ်မှတ်မလား။ ဒါက တကယ်တော့ tokenizer နဲ့ ၎င်းက အသုံးချတဲ့ pre-tokenization operation ပေါ် မူတည်ပါတယ်။ တချို့ tokenizers တွေက spaces တွေပေါ်မှာပဲ ပိုင်းဖြတ်တာကြောင့် ဒါကို စကားလုံးတစ်လုံးအဖြစ် သတ်မှတ်ပါလိမ့်မယ်။ တချို့ကတော့ spaces တွေအပြင် punctuation ကိုပါ အသုံးပြုတာကြောင့် ဒါကို စကားလုံးနှစ်လုံးအဖြစ် သတ်မှတ်ပါလိမ့်မယ်။

✏️ စမ်းသပ်ကြည့်ပါ။ bert-base-cased နဲ့ roberta-base checkpoints တွေကနေ tokenizer တစ်ခု ဖန်တီးပြီး ”81s” ကို tokenize လုပ်ပါ။ သင်ဘာတွေ တွေ့ရသလဲ။ word IDs တွေက ဘာတွေလဲ။

အလားတူပဲ၊ token တစ်ခုကို ၎င်းထွက်ပေါ်လာတဲ့ sentence နဲ့ map လုပ်ဖို့ sentence_ids() method တစ်ခု ရှိပါတယ် (ဒီကိစ္စမှာတော့ tokenizer က ပြန်ပေးတဲ့ token_type_ids က ကျွန်တော်တို့ကို အတူတူ အချက်အလက်တွေ ပေးနိုင်ပါတယ်)။

နောက်ဆုံးအနေနဲ့၊ word_to_chars() ဒါမှမဟုတ် token_to_chars() နဲ့ char_to_word() ဒါမှမဟုတ် char_to_token() methods တွေကနေတစ်ဆင့် မည်သည့် word သို့မဟုတ် token ကိုမဆို original text ထဲက characters တွေနဲ့ map လုပ်နိုင်ပြီး၊ ပြောင်းပြန် map လုပ်နိုင်ပါတယ်။ ဥပမာ၊ word_ids() method က ##yl ဟာ index 3 မှာရှိတဲ့ word ရဲ့ အစိတ်အပိုင်းဖြစ်တယ်လို့ ပြောခဲ့ပါတယ်၊ ဒါပေမယ့် အဲဒီ word က sentence ထဲမှာ ဘယ် word လဲ။ အောက်ပါအတိုင်း ရှာဖွေနိုင်ပါတယ်။

start, end = encoding.word_to_chars(3)
example[start:end]

Sylvain

ကျွန်တော်တို့ အစောပိုင်းမှာ ဖော်ပြခဲ့တဲ့အတိုင်း၊ ဒါတွေအားလုံးဟာ fast tokenizer က token တစ်ခုစီ ထွက်ပေါ်လာတဲ့ text ရဲ့ span ကို offsets list တစ်ခုထဲမှာ ခြေရာခံထားနိုင်တဲ့ အချက်ကြောင့်ပါ။ ၎င်းတို့ရဲ့ အသုံးပြုမှုကို ဖော်ပြဖို့အတွက်၊ နောက်မှာ token-classification pipeline ရဲ့ ရလဒ်တွေကို ကိုယ်တိုင် ဘယ်လိုပြန်လည်ထုတ်လုပ်မလဲဆိုတာကို ကျွန်တော်တို့ ပြသပေးပါမယ်။

✏️ စမ်းသပ်ကြည့်ပါ။ သင့်ကိုယ်ပိုင် ဥပမာ text ကို ဖန်တီးပြီး ဘယ် tokens တွေက word ID နဲ့ ဆက်စပ်နေသလဲ၊ ပြီးတော့ single word တစ်ခုအတွက် character spans တွေကို ဘယ်လိုထုတ်ယူရမလဲဆိုတာ သင်နားလည်နိုင်မလားဆိုတာ ကြည့်ပါ။ bonus အမှတ်များအတွက်၊ inputs အဖြစ် sentences နှစ်ခုကို အသုံးပြုပြီး sentence IDs တွေက သင့်အတွက် အဓိပ္ပာယ်ရှိမရှိ ကြည့်ပါ။

token-classification pipeline အတွင်းပိုင်း

Chapter 1 မှာ ကျွန်တော်တို့ NER (လုပ်ငန်းက text ရဲ့ ဘယ်အပိုင်းတွေက လူပုဂ္ဂိုလ်၊ နေရာဒေသ ဒါမှမဟုတ် အဖွဲ့အစည်းလို entity တွေနဲ့ ကိုက်ညီတယ်ဆိုတာ ခွဲခြားသိမြင်ဖို့ပါ) ကို 🤗 Transformers pipeline() function နဲ့ ပထမဆုံး စတင်အသုံးပြုခဲ့ကြပါတယ်။ အဲဒီနောက် Chapter 2 မှာ၊ raw text ကနေ predictions တွေရဖို့ လိုအပ်တဲ့ အဆင့်သုံးဆင့် (tokenization, inputs တွေကို model ကနေတစ်ဆင့် ပေးပို့ခြင်း, နဲ့ post-processing) ကို pipeline တစ်ခုက ဘယ်လို အုပ်စုဖွဲ့ထားတယ်ဆိုတာ ကျွန်တော်တို့ တွေ့ခဲ့ရပါတယ်။ token-classification pipeline ထဲက ပထမဆုံး အဆင့်နှစ်ဆင့်က တခြား pipeline တွေနဲ့ တူညီပါတယ်၊ ဒါပေမယ့် post-processing ကတော့ နည်းနည်း ပိုရှုပ်ထွေးပါတယ်။ ဘယ်လိုလဲဆိုတာ ကြည့်ရအောင်။

Pipeline ဖြင့် အခြေခံရလဒ်များကို ရယူခြင်း

ပထမဆုံး၊ ကျွန်တော်တို့ ကိုယ်တိုင် နှိုင်းယှဉ်ဖို့ ရလဒ်အချို့ရရှိစေရန် token classification pipeline တစ်ခုကို ယူကြစို့။ default အားဖြင့် အသုံးပြုတဲ့ model က dbmdz/bert-large-cased-finetuned-conll03-english ဖြစ်ပြီး၊ ဒါက sentences တွေပေါ်မှာ NER ကို လုပ်ဆောင်ပါတယ်။

from transformers import pipeline

token_classifier = pipeline("token-classification")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S', 'start': 11, 'end': 12},
 {'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14},
 {'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va', 'start': 14, 'end': 16},
 {'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in', 'start': 16, 'end': 18},
 {'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35},
 {'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40},
 {'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45},
 {'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}]

model က “Sylvain” ကနေ ထုတ်လုပ်တဲ့ tokens တွေကို လူပုဂ္ဂိုလ်အဖြစ်၊ “Hugging Face” ကနေ ထုတ်လုပ်တဲ့ tokens တွေကို အဖွဲ့အစည်းအဖြစ်၊ ပြီးတော့ “Brooklyn” token ကို နေရာဒေသအဖြစ် မှန်ကန်စွာ ခွဲခြားသတ်မှတ်ခဲ့ပါတယ်။ တူညီတဲ့ entity နဲ့ ကိုက်ညီတဲ့ tokens တွေကို အုပ်စုဖွဲ့ဖို့ pipeline ကိုလည်း ကျွန်တော်တို့ တောင်းဆိုနိုင်ပါတယ်။

from transformers import pipeline

token_classifier = pipeline("token-classification", aggregation_strategy="simple")
token_classifier("My name is Sylvain and I work at Hugging Face in Brooklyn.")

[{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18},
 {'entity_group': 'ORG', 'score': 0.97960204, 'word': 'Hugging Face', 'start': 33, 'end': 45},
 {'entity_group': 'LOC', 'score': 0.99321055, 'word': 'Brooklyn', 'start': 49, 'end': 57}]

ရွေးချယ်ထားတဲ့ aggregation_strategy က အုပ်စုဖွဲ့ထားတဲ့ entity တစ်ခုစီအတွက် တွက်ချက်ထားတဲ့ scores တွေကို ပြောင်းလဲပါလိမ့်မယ်။ "simple" နဲ့ဆိုရင် score က ပေးထားတဲ့ entity ထဲက token တစ်ခုစီရဲ့ scores တွေရဲ့ mean (ပျမ်းမျှ) ပါပဲ- ဥပမာ၊ “Sylvain” ရဲ့ score က ယခင်ဥပမာမှာ S, ##yl, ##va, နဲ့ ##in tokens တွေအတွက် ကျွန်တော်တို့ တွေ့ခဲ့ရတဲ့ scores တွေရဲ့ mean ပါပဲ။ ရရှိနိုင်တဲ့ အခြား strategies တွေကတော့…

"first"၊ entity တစ်ခုစီရဲ့ score က အဲဒီ entity ရဲ့ ပထမဆုံး token ရဲ့ score ဖြစ်ပါတယ် (ဒါကြောင့် “Sylvain” အတွက်ကတော့ S token ရဲ့ score ဖြစ်တဲ့ 0.993828 ဖြစ်ပါလိမ့်မယ်)။
"max"၊ entity တစ်ခုစီရဲ့ score က အဲဒီ entity ထဲက tokens တွေရဲ့ အမြင့်ဆုံး score ဖြစ်ပါတယ် (ဒါကြောင့် “Hugging Face” အတွက်ကတော့ “Face” ရဲ့ score ဖြစ်တဲ့ 0.98879766 ဖြစ်ပါလိမ့်မယ်)။
"average"၊ entity တစ်ခုစီရဲ့ score က အဲဒီ entity ကို ဖွဲ့စည်းထားတဲ့ words တွေရဲ့ scores တွေရဲ့ average (ပျမ်းမျှ) ဖြစ်ပါတယ် (ဒါကြောင့် “Sylvain” အတွက်ကတော့ "simple" strategy နဲ့ မကွာခြားပါဘူး၊ ဒါပေမယ့် “Hugging Face” ကတော့ “Hugging” ရဲ့ score ဖြစ်တဲ့ 0.975 နဲ့ “Face” ရဲ့ score ဖြစ်တဲ့ 0.98879 တို့ရဲ့ average ဖြစ်တဲ့ 0.9819 ရဲ့ score ကို ရရှိပါလိမ့်မယ်)။

အခု pipeline() function ကို မသုံးဘဲ ဒီရလဒ်တွေကို ဘယ်လိုရယူရမလဲဆိုတာ ကြည့်ရအောင်။

Inputs ကနေ Predictions တွေဆီသို့

ပထမဆုံး ကျွန်တော်တို့ inputs တွေကို tokenize လုပ်ပြီး model ထဲကို ထည့်သွင်းဖို့ လိုအပ်ပါတယ်။ ဒါကို Chapter 2 မှာအတိုင်းပဲ လုပ်ဆောင်ပါတယ်၊ AutoXxx classes တွေကို အသုံးပြုပြီး tokenizer နဲ့ model ကို instantiate လုပ်ပြီးနောက်၊ ၎င်းတို့ကို ကျွန်တော်တို့ရဲ့ ဥပမာပေါ်မှာ အသုံးပြုပါတယ်။

from transformers import AutoTokenizer, AutoModelForTokenClassification

model_checkpoint = "dbmdz/bert-large-cased-finetuned-conll03-english"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
model = AutoModelForTokenClassification.from_pretrained(model_checkpoint)

example = "My name is Sylvain and I work at Hugging Face in Brooklyn."
inputs = tokenizer(example, return_tensors="pt")
outputs = model(**inputs)

ဒီနေရာမှာ ကျွန်တော်တို့ AutoModelForTokenClassification ကို အသုံးပြုနေတာကြောင့်၊ input sequence ထဲက token တစ်ခုစီအတွက် logits အစုတစ်ခုစီ ရရှိပါတယ်။

print(inputs["input_ids"].shape)
print(outputs.logits.shape)

torch.Size([1, 19])
torch.Size([1, 19, 9])

ကျွန်တော်တို့မှာ tokens ၁၉ ခုပါဝင်တဲ့ sequence ၁ ခုနဲ့ model မှာ labels ၉ ခု ရှိပါတယ်။ ဒါကြောင့် model ရဲ့ output က 1 x 19 x 9 shape ရှိပါတယ်။ text classification pipeline မှာလိုပဲ၊ ဒီ logits တွေကို probabilities အဖြစ် ပြောင်းလဲဖို့ softmax function ကို ကျွန်တော်တို့ အသုံးပြုပြီး၊ predictions တွေရဖို့ argmax ကို ယူပါတယ် (softmax က order ကို မပြောင်းလဲတဲ့အတွက် logits ပေါ်မှာ argmax ကို ယူနိုင်တယ်ဆိုတာ သတိပြုပါ)။

import torch

probabilities = torch.nn.functional.softmax(outputs.logits, dim=-1)[0].tolist()
predictions = outputs.logits.argmax(dim=-1)[0].tolist()
print(predictions)

[0, 0, 0, 0, 4, 4, 4, 4, 0, 0, 0, 0, 6, 6, 6, 0, 8, 0, 0]

model.config.id2label attribute မှာ predictions တွေကို နားလည်စေဖို့ ကျွန်တော်တို့ အသုံးပြုနိုင်တဲ့ indexes တွေကနေ labels တွေဆီသို့ map လုပ်ထားတာ ပါဝင်ပါတယ်။

model.config.id2label

{0: 'O',
 1: 'B-MISC',
 2: 'I-MISC',
 3: 'B-PER',
 4: 'I-PER',
 5: 'B-ORG',
 6: 'I-ORG',
 7: 'B-LOC',
 8: 'I-LOC'}

ကျွန်တော်တို့ အစောပိုင်းမှာ တွေ့ခဲ့ရတဲ့အတိုင်း၊ labels ၉ ခု ရှိပါတယ်- O က မည်သည့် named entity ထဲမှာမှ မပါဝင်တဲ့ tokens တွေအတွက် label ဖြစ်ပါတယ် (“outside” ကို ဆိုလိုပါတယ်)၊ ပြီးတော့ entity အမျိုးအစားတစ်ခုစီ (miscellaneous, person, organization, နဲ့ location) အတွက် labels နှစ်ခုစီ ရှိပါတယ်။ B-XXX label က token ဟာ XXX entity ရဲ့ အစမှာရှိတယ်ဆိုတာကို ဖော်ပြပြီး I-XXX label က token ဟာ XXX entity ရဲ့ အတွင်းမှာရှိတယ်ဆိုတာကို ဖော်ပြပါတယ်။ ဥပမာ၊ ဒီဥပမာမှာ ကျွန်တော်တို့ model က S token ကို B-PER (person entity ရဲ့ အစ) အဖြစ်နဲ့ ##yl, ##va နဲ့ ##in tokens တွေကို I-PER (person entity ရဲ့ အတွင်း) အဖြစ် ခွဲခြားသတ်မှတ်မယ်လို့ မျှော်လင့်ထားပါတယ်။

model က ဒီ tokens လေးခုလုံးကို I-PER label ပေးခဲ့တဲ့အတွက် မှားယွင်းတယ်လို့ သင်ထင်ကောင်းထင်နိုင်ပါတယ်၊ ဒါပေမယ့် ဒါက လုံးဝမှန်တာ မဟုတ်ပါဘူး။ အမှန်တကယ်တော့ အဲဒီ B- နဲ့ I- labels တွေအတွက် formats နှစ်မျိုးရှိပါတယ်- IOB1 နဲ့ IOB2 ပါ။ IOB2 format (အောက်ပါ ပန်းရောင်ဖြင့်) က ကျွန်တော်တို့ မိတ်ဆက်ခဲ့တဲ့ format ဖြစ်ပြီး IOB1 format (အပြာရောင်ဖြင့်) မှာတော့ B- နဲ့ စတင်တဲ့ labels တွေကို တူညီတဲ့ အမျိုးအစား entity နှစ်ခုကို ခွဲခြားဖို့အတွက်သာ အသုံးပြုပါတယ်။ ကျွန်တော်တို့ အသုံးပြုနေတဲ့ model ကို အဲဒီ format ကို အသုံးပြုတဲ့ dataset ပေါ်မှာ fine-tune လုပ်ခဲ့တာကြောင့် S token ကို I-PER label သတ်မှတ်ပေးခဲ့တာပါ။

ဒီ map နဲ့၊ ကျွန်တော်တို့ဟာ ပထမ pipeline ရဲ့ ရလဒ်တွေကို (လုံးဝနီးပါး) ပြန်လည်ထုတ်လုပ်ဖို့ အဆင်သင့်ပါပဲ — O အဖြစ် မခွဲခြားထားတဲ့ token တစ်ခုစီရဲ့ score နဲ့ label ကို ယူနိုင်ပါတယ်။

results = []
tokens = inputs.tokens()

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        results.append(
            {"entity": label, "score": probabilities[idx][pred], "word": tokens[idx]}
        )

print(results)

[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S'},
 {'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl'},
 {'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va'},
 {'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in'},
 {'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu'},
 {'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging'},
 {'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face'},
 {'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn'}]

ဒါက အရင်က ကျွန်တော်တို့ ရရှိခဲ့တာနဲ့ အတော်လေး ဆင်တူပါတယ်၊ ခြွင်းချက်တစ်ခုကတော့ pipeline က ကျွန်တော်တို့ကို original sentence ထဲက entity တစ်ခုစီရဲ့ start နဲ့ end အကြောင်း အချက်အလက်တွေလည်း ပေးခဲ့တာပါပဲ။ ဒီနေရာမှာ ကျွန်တော်တို့ရဲ့ offset mapping က အခန်းကဏ္ဍက ပါလာပါလိမ့်မယ်။ offsets တွေရဖို့၊ ကျွန်တော်တို့ inputs တွေကို tokenizer အသုံးပြုတဲ့အခါ return_offsets_mapping=True ကို သတ်မှတ်ပေးဖို့ပဲ လိုပါတယ်။

inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
inputs_with_offsets["offset_mapping"]

[(0, 0), (0, 2), (3, 7), (8, 10), (11, 12), (12, 14), (14, 16), (16, 18), (19, 22), (23, 24), (25, 29), (30, 32),
 (33, 35), (35, 40), (41, 45), (46, 48), (49, 57), (57, 58), (0, 0)]

tuple တစ်ခုစီဟာ token တစ်ခုစီနဲ့ ကိုက်ညီတဲ့ text ရဲ့ span ဖြစ်ပြီး၊ (0, 0) ကို special tokens တွေအတွက် သီးသန့်ထားပါတယ်။ index 5 မှာရှိတဲ့ token က ##yl ဖြစ်ပြီး၊ ဒီနေရာမှာ (12, 14) ကို offsets အဖြစ် ရရှိတယ်ဆိုတာ အရင်က ကျွန်တော်တို့ တွေ့ခဲ့ပါတယ်။ ကျွန်တော်တို့ ဥပမာထဲမှာ သက်ဆိုင်ရာ slice ကို ယူမယ်ဆိုရင်…

example[12:14]

ကျွန်တော်တို့ ## မပါဘဲ မှန်ကန်တဲ့ text span ကို ရရှိပါတယ်။

yl

ဒါကို အသုံးပြုပြီး၊ အရင်ရလဒ်တွေကို အခု ကျွန်တော်တို့ ဖြည့်စွက်နိုင်ပါပြီ။

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

for idx, pred in enumerate(predictions):
    label = model.config.id2label[pred]
    if label != "O":
        start, end = offsets[idx]
        results.append(
            {
                "entity": label,
                "score": probabilities[idx][pred],
                "word": tokens[idx],
                "start": start,
                "end": end,
            }
        )

print(results)

[{'entity': 'I-PER', 'score': 0.9993828, 'index': 4, 'word': 'S', 'start': 11, 'end': 12},
 {'entity': 'I-PER', 'score': 0.99815476, 'index': 5, 'word': '##yl', 'start': 12, 'end': 14},
 {'entity': 'I-PER', 'score': 0.99590725, 'index': 6, 'word': '##va', 'start': 14, 'end': 16},
 {'entity': 'I-PER', 'score': 0.9992327, 'index': 7, 'word': '##in', 'start': 16, 'end': 18},
 {'entity': 'I-ORG', 'score': 0.97389334, 'index': 12, 'word': 'Hu', 'start': 33, 'end': 35},
 {'entity': 'I-ORG', 'score': 0.976115, 'index': 13, 'word': '##gging', 'start': 35, 'end': 40},
 {'entity': 'I-ORG', 'score': 0.98879766, 'index': 14, 'word': 'Face', 'start': 41, 'end': 45},
 {'entity': 'I-LOC', 'score': 0.99321055, 'index': 16, 'word': 'Brooklyn', 'start': 49, 'end': 57}]

ဒါက ကျွန်တော်တို့ ပထမ pipeline ကနေ ရရှိခဲ့တာနဲ့ အတူတူပါပဲ။

Entities များကို အုပ်စုဖွဲ့ခြင်း

offsets တွေကို အသုံးပြုပြီး entity တစ်ခုစီအတွက် start နဲ့ end keys တွေကို ဆုံးဖြတ်တာက အသုံးဝင်ပေမယ့်၊ အဲဒီအချက်အလက်တွေက တင်းကြပ်စွာ မလိုအပ်ပါဘူး။ သို့သော်လည်း entities တွေကို အတူတကွ အုပ်စုဖွဲ့ချင်တဲ့အခါ၊ offsets တွေက ကျွန်တော်တို့ကို ရှုပ်ထွေးတဲ့ code များစွာကနေ ကယ်တင်ပါလိမ့်မယ်။ ဥပမာ၊ Hu, ##gging, နဲ့ Face tokens တွေကို အုပ်စုဖွဲ့ချင်တယ်ဆိုရင်၊ ပထမနှစ်ခုကို ## ကို ဖယ်ရှားပြီး တွဲထားသင့်တယ်၊ ပြီးတော့ Face ကို space တစ်ခုနဲ့ ထည့်သင့်တယ်လို့ ပြောတဲ့ အထူးစည်းမျဉ်းတွေ ဖန်တီးနိုင်ပါတယ် — ဒါပေမယ့် ဒါက ဒီ tokenizer အမျိုးအစားအတွက်ပဲ အလုပ်လုပ်ပါလိမ့်မယ်။ SentencePiece ဒါမှမဟုတ် Byte-Pair-Encoding tokenizer အတွက် အခြားစည်းမျဉ်းအစုံတစ်ခုကို ကျွန်တော်တို့ ရေးရပါလိမ့်မယ် (ဒီအခန်းရဲ့ နောက်ပိုင်းမှာ ဆွေးနွေးထားပါတယ်)။

offsets တွေနဲ့ဆိုရင်၊ အဲဒီ custom code အားလုံး ပျောက်သွားပါတယ်- ပထမဆုံး token နဲ့ စတင်ပြီး နောက်ဆုံး token နဲ့ အဆုံးသတ်တဲ့ original text ထဲက span ကို ကျွန်တော်တို့ ယူနိုင်ပါတယ်။ ဒါကြောင့်၊ Hu, ##gging, နဲ့ Face tokens တွေရဲ့ ကိစ္စမှာ၊ ကျွန်တော်တို့ character 33 ( Hu ရဲ့ အစ) ကနေ စတင်ပြီး character 45 ( Face ရဲ့ အဆုံး) မတိုင်ခင်မှာ အဆုံးသတ်သင့်ပါတယ်။

example[33:45]

Hugging Face

entities တွေကို အုပ်စုဖွဲ့နေစဉ် predictions တွေကို post-process လုပ်တဲ့ code ကို ရေးဖို့အတွက်၊ I-XXX လို့ label တပ်ထားပြီး ဆက်တိုက်ဖြစ်နေတဲ့ entities တွေကို အုပ်စုဖွဲ့ပါမယ်။ ပထမဆုံး entity ကိုတော့ B-XXX ဒါမှမဟုတ် I-XXX လို့ label တပ်နိုင်ပါတယ် (ဒါကြောင့်၊ O ကို ရရှိတဲ့အခါ၊ entity အမျိုးအစားအသစ်တစ်ခု ဒါမှမဟုတ် တူညီတဲ့ အမျိုးအစား entity တစ်ခု စတင်နေတယ်လို့ ပြောတဲ့ B-XXX ကို ရရှိတဲ့အခါ entity တစ်ခု အုပ်စုဖွဲ့တာကို ရပ်ပါမယ်။)။

import numpy as np

results = []
inputs_with_offsets = tokenizer(example, return_offsets_mapping=True)
tokens = inputs_with_offsets.tokens()
offsets = inputs_with_offsets["offset_mapping"]

idx = 0
while idx < len(predictions):
    pred = predictions[idx]
    label = model.config.id2label[pred]
    if label != "O":
        # B- or I- ကို ဖယ်ရှားပါ။
        label = label[2:]
        start, _ = offsets[idx]

        # I-label နဲ့ label တပ်ထားတဲ့ tokens အားလုံးကို ယူပါ။
        all_scores = []
        while (
            idx < len(predictions)
            and model.config.id2label[predictions[idx]] == f"I-{label}"
        ):
            all_scores.append(probabilities[idx][pred])
            _, end = offsets[idx]
            idx += 1

        # Score က အုပ်စုဖွဲ့ထားတဲ့ entity ထဲက tokens အားလုံးရဲ့ scores တွေရဲ့ mean ဖြစ်ပါတယ်။
        score = np.mean(all_scores).item()
        word = example[start:end]
        results.append(
            {
                "entity_group": label,
                "score": score,
                "word": word,
                "start": start,
                "end": end,
            }
        )
    idx += 1

print(results)

ပြီးတော့ ကျွန်တော်တို့ ဒုတိယ pipeline နဲ့ တူညီတဲ့ ရလဒ်တွေကို ရရှိပါတယ်!

[{'entity_group': 'PER', 'score': 0.9981694, 'word': 'Sylvain', 'start': 11, 'end': 18},
 {'entity_group': 'ORG', 'score': 0.97960204, 'word': 'Hugging Face', 'start': 33, 'end': 45},
 {'entity_group': 'LOC', 'score': 0.99321055, 'word': 'Brooklyn', 'start': 49, 'end': 57}]

offsets တွေ အလွန်အသုံးဝင်တဲ့ နောက်ထပ် task ဥပမာတစ်ခုက question answering ပါ။ နောက်အပိုင်းမှာ ကျွန်တော်တို့ လေ့လာမယ့် pipeline ထဲကို နက်နက်နဲနဲ လေ့လာခြင်းက 🤗 Transformers library ထဲက tokenizers တွေရဲ့ နောက်ဆုံး feature တစ်ခုကိုလည်း ကြည့်ရှုနိုင်စေပါလိမ့်မယ်- input တစ်ခုကို သတ်မှတ်ထားတဲ့ အရှည်အထိ truncate လုပ်တဲ့အခါ overflowing tokens တွေကို ကိုင်တွယ်ဖြေရှင်းခြင်းပါ။

ဝေါဟာရ ရှင်းလင်းချက် (Glossary)

Tokenizer: စာသား (သို့မဟုတ် အခြားဒေတာ) ကို AI မော်ဒယ်များ စီမံဆောင်ရွက်နိုင်ရန် tokens တွေအဖြစ် ပိုင်းခြားပေးသည့် ကိရိယာ သို့မဟုတ် လုပ်ငန်းစဉ်။
🤗 Transformers: Hugging Face က ထုတ်လုပ်ထားတဲ့ library တစ်ခုဖြစ်ပြီး Transformer မော်ဒယ်တွေကို အသုံးပြုပြီး Natural Language Processing (NLP), computer vision, audio processing စတဲ့ နယ်ပယ်တွေမှာ အဆင့်မြင့် AI မော်ဒယ်တွေကို တည်ဆောက်ပြီး အသုံးပြုနိုင်စေပါတယ်။
🤗 Tokenizers Library: Rust ဘာသာနဲ့ ရေးသားထားတဲ့ Hugging Face library တစ်ခုဖြစ်ပြီး မြန်ဆန်ထိရောက်တဲ့ tokenization ကို လုပ်ဆောင်ပေးသည်။
Token-classification: စာသား sequence တစ်ခုရှိ token တစ်ခုစီကို အမျိုးအစားခွဲခြားသတ်မှတ်ခြင်း။ (Named Entity Recognition - NER ကဲ့သို့)။
NER (Named Entity Recognition): စာသားထဲက လူအမည်၊ နေရာအမည်၊ အဖွဲ့အစည်းအမည် စတဲ့ သီးခြားအမည်တွေကို ရှာဖွေဖော်ထုတ်ခြင်း။
Question Answering: မေးခွန်းတစ်ခုကို စာသား document တစ်ခုမှ အဖြေရှာခြင်း။
Pipeline: 🤗 Transformers library မှ model တစ်ခုကို အသုံးပြုရန်အတွက် မြင့်မားသောအဆင့် (high-level) API တစ်ခုဖြစ်ပြီး tokenization, model inference နှင့် post-processing တို့ကို ပေါင်းစပ်လုပ်ဆောင်သည်။
Slow Tokenizers: Python ဘာသာစကားဖြင့် အကောင်အထည်ဖော်ထားသော tokenizers များ။
Fast Tokenizers: Rust ဘာသာစကားဖြင့် အကောင်အထည်ဖော်ထားသော tokenizers များဖြစ်ပြီး Python-based “slow” tokenizers များထက် အလွန်မြန်ဆန်သည်။
Rust: System programming language တစ်ခုဖြစ်ပြီး performance မြင့်မားသော applications များ တည်ဆောက်ရာတွင် အသုံးပြုသည်။
Batch Encoding: Tokenizer မှ output အဖြစ် ပြန်ပေးသော အထူး object တစ်ခုဖြစ်ပြီး encoded inputs များနှင့်အတူ အခြားအသုံးဝင်သော methods (ဥပမာ- word_ids(), offset_mapping) ပါဝင်သည်။
BatchEncoding Object: Tokenizer မှ encoding လုပ်ပြီးနောက် ပြန်ပေးသော object အမျိုးအစား။
Subclass: class တစ်ခု၏ အင်္ဂါရပ်များကို အမွေဆက်ခံထားသော class အသစ်။
Offset Mapping: token တစ်ခုစီသည် မူရင်းစာသား၏ မည်သည့်စတင်ခြင်းနှင့် အဆုံးသတ် character index များကြားတွင် ရှိနေသည်ကို ဖော်ပြသော map။
AutoTokenizer: Hugging Face Transformers library မှာ ပါဝင်တဲ့ class တစ်ခုဖြစ်ပြီး မော်ဒယ်အမည်ကို အသုံးပြုပြီး သက်ဆိုင်ရာ tokenizer ကို အလိုအလျောက် load လုပ်ပေးသည်။
bert-base-cased: BERT model ၏ base version အတွက် checkpoint identifier (cased version)။
is_fast Attribute: Tokenizer သည် fast tokenizer ဟုတ်မဟုတ်ကို စစ်ဆေးသော attribute။
tokens() Method: BatchEncoding object မှ tokens များ၏ list ကို ပြန်ပေးသော method။
##yl: Subword tokenization တွင် စကားလုံး၏ အစိတ်အပိုင်းကို ဖော်ပြရန် အသုံးပြုသော prefix (BERT-like tokenizers တွင်)။
word_ids() Method: BatchEncoding object မှ token တစ်ခုစီသည် မည်သည့် word (မူရင်းစာသား၏) မှ ဆင်းသက်လာသည်ကို ပြန်ပေးသော method။
Special Tokens: Tokenizer သို့မဟုတ် model အတွက် သီးခြားအဓိပ္ပာယ်ရှိသော tokens များ (ဥပမာ- [CLS], [SEP], [PAD])။
[CLS] Token: BERT model တွင် sequence ၏ အစကို ကိုယ်စားပြုသော special token။
[SEP] Token: BERT model တွင် sentence တစ်ခု၏ အဆုံး သို့မဟုတ် sentence နှစ်ခုကြား ပိုင်းခြားရန် အသုံးပြုသော special token။
BERT-like Tokenizers: BERT model ကဲ့သို့ subword tokenization နည်းလမ်းများကို အသုံးပြုသော tokenizers များ။
Named Entity Recognition (NER): စာသားထဲက လူအမည်၊ နေရာအမည်၊ အဖွဲ့အစည်းအမည် စတဲ့ သီးခြားအမည်တွေကို ရှာဖွေဖော်ထုတ်ခြင်း။
Part-of-Speech (POS) Tagging: စာကြောင်းတစ်ခုရှိ စကားလုံးတစ်ခုစီကို ၎င်း၏ သဒ္ဒါဆိုင်ရာ အခန်းကဏ္ဍ (ဥပမာ- noun, verb, adjective) အလိုက် label တပ်ခြင်း။
Masked Language Modeling (MLM): စာကြောင်းထဲမှ စကားလုံးအချို့ကို ဝှက်ထားပြီး ၎င်းတို့ကို ခန့်မှန်းစေရန် model ကို လေ့ကျင့်သော task တစ်ခု။
Whole Word Masking: Masked Language Modeling (MLM) တွင် token တစ်ခုတည်းကို mask လုပ်မယ့်အစား၊ မူရင်းစကားလုံးတစ်ခုလုံးကို ကိုယ်စားပြုတဲ့ tokens အားလုံးကို mask လုပ်ခြင်း။
Contraction: စကားလုံးနှစ်လုံး သို့မဟုတ် ထို့ထက်ပိုသော စကားလုံးများကို ပေါင်းစပ်၍ တစ်လုံးတည်း ဖြစ်အောင် လုပ်ဆောင်ခြင်း (ဥပမာ- I will -> I’ll)။
Pre-tokenization Operation: tokenization လုပ်ငန်းစဉ်မစမီ စာသားကို အဆင့်တစ်ဆင့်အနေဖြင့် စီမံဆောင်ရွက်ခြင်း (ဥပမာ- spaces သို့မဟုတ် punctuation ဖြင့် ပိုင်းခြားခြင်း)။
roberta-base: RoBERTa model ၏ base version အတွက် checkpoint identifier။
sentence_ids() Method: BatchEncoding object မှ token တစ်ခုစီသည် မည်သည့် sentence မှ ဆင်းသက်လာသည်ကို ပြန်ပေးသော method။
token_type_ids: Sentence pair လုပ်ငန်းများတွင် input sequence တစ်ခုစီမှ token တစ်ခုစီသည် မည်သည့် sentence (ပထမ သို့မဟုတ် ဒုတိယ) နှင့် သက်ဆိုင်သည်ကို ဖော်ပြပေးသော IDs များ။
word_to_chars() Method: BatchEncoding object မှ word index တစ်ခုကို မူရင်းစာသားရှိ စတင်ခြင်းနှင့် အဆုံးသတ် character index များသို့ map လုပ်ပေးသော method။
token_to_chars() Method: BatchEncoding object မှ token index တစ်ခုကို မူရင်းစာသားရှိ စတင်ခြင်းနှင့် အဆုံးသတ် character index များသို့ map လုပ်ပေးသော method။
char_to_word() Method: BatchEncoding object မှ character index တစ်ခုကို ၎င်းပါဝင်သော word ၏ index သို့ map လုပ်ပေးသော method။
char_to_token() Method: BatchEncoding object မှ character index တစ်ခုကို ၎င်းပါဝင်သော token ၏ index သို့ map လုပ်ပေးသော method။
Token-classification Pipeline: pipeline() function ကို အသုံးပြု၍ token classification task ကို လုပ်ဆောင်ရန် တည်ဆောက်ထားသော pipeline။
Post-processing: Model ၏ output များကို နောက်ဆုံးအသုံးပြုမှုအတွက် ပြင်ဆင်ခြင်း လုပ်ငန်းစဉ်။
dbmdz/bert-large-cased-finetuned-conll03-english: CoNLL-2003 dataset တွင် fine-tune လုပ်ထားသော BERT Large cased model အတွက် Hugging Face Hub ရှိ ID။
aggregation_strategy="simple": token-classification pipeline တွင် entities များကို အုပ်စုဖွဲ့ရန် အသုံးပြုသော strategy တစ်ခုဖြစ်ပြီး grouped entity ၏ score ကို ၎င်းအတွင်းရှိ tokens များ၏ scores ပျမ်းမျှဖြင့် တွက်ချက်သည်။
mean (Average): ပျမ်းမျှတန်ဖိုး။
first Strategy: grouped entity ၏ score ကို ၎င်း၏ ပထမဆုံး token ၏ score အဖြစ် ယူသော strategy။
max Strategy: grouped entity ၏ score ကို ၎င်းအတွင်းရှိ tokens များ၏ အမြင့်ဆုံး score အဖြစ် ယူသော strategy။
average Strategy: grouped entity ကို ဖွဲ့စည်းထားသော words များ၏ scores ပျမ်းမျှဖြင့် တွက်ချက်သော strategy (subword tokenization ကြောင့် "simple" နှင့် ကွာခြားနိုင်သည်)။
AutoModelForTokenClassification: Hugging Face Transformers library မှ token classification task အတွက် model class ကို အလိုအလျောက် load လုပ်ပေးသော class။
model_checkpoint: pretrained model ၏ ID (ဥပမာ- “dbmdz/bert-large-cased-finetuned-conll03-english”)။
return_tensors="pt" / "tf": Tokenizer မှ output tensors များကို PyTorch ("pt") သို့မဟုတ် TensorFlow ("tf") format ဖြင့် ပြန်ပေးရန် သတ်မှတ်ခြင်း။
outputs.logits: Model ၏ output ဖြစ်ပြီး raw, unnormalized scores များကို ဖော်ပြသည်။
torch.Size / (1, 19) / (1, 19, 9): Tensors များ၏ ပုံသဏ္ဍာန် (shape) ကို ဖော်ပြသည်။
Softmax Function: ဂဏန်းတန်ဖိုးများ (logits) အစုအဝေးတစ်ခုကို probability distribution (ပေါင်းလဒ် ၁ ဖြစ်သော တန်ဖိုးများ) အဖြစ် ပြောင်းလဲပေးသော သင်္ချာဆိုင်ရာ function။
Argmax: array တစ်ခုအတွင်းရှိ အမြင့်ဆုံးတန်ဖိုး၏ index ကို ပြန်ပေးသော function။
dim=-1 / axis=-1: operation ကို tensor ၏ နောက်ဆုံး dimension (axis) တွင် လုပ်ဆောင်ရန် သတ်မှတ်ခြင်း။
model.config.id2label: Model ၏ configuration object မှ ID များကို labels များသို့ map လုပ်ထားသော dictionary။
O Label: Named Entity Recognition (NER) တွင် မည်သည့် entity အမျိုးအစားမှ မဟုတ်သော token များကို ကိုယ်စားပြုသော label (“Outside” entity)။
B-XXX Label: Named Entity Recognition (NER) တွင် entity အမျိုးအစား XXX ၏ စတင်ခြင်း token ကို ကိုယ်စားပြုသော label (“Beginning” of entity)။
I-XXX Label: Named Entity Recognition (NER) တွင် entity အမျိုးအစား XXX ၏ အတွင်းပိုင်း token ကို ကိုယ်စားပြုသော label (“Inside” entity)။
Miscellaneous (MISC): အခြား အမျိုးအစားများအောက်တွင် မပါဝင်သော entities များ။
Person (PER): လူပုဂ္ဂိုလ်၏ အမည်ကို ကိုယ်စားပြုသော entity။
Organization (ORG): အဖွဲ့အစည်း၏ အမည်ကို ကိုယ်စားပြုသော entity။
Location (LOC): နေရာဒေသ၏ အမည်ကို ကိုယ်စားပြုသော entity။
IOB1 Format: Named Entity Recognition (NER) တွင် အသုံးပြုသော labeling scheme တစ်မျိုးဖြစ်ပြီး B- labels များကို တူညီသော entity အမျိုးအစားနှစ်ခု ဆက်တိုက်ဖြစ်ပေါ်မှသာ အသုံးပြုသည်။
IOB2 Format: Named Entity Recognition (NER) တွင် အသုံးပြုသော labeling scheme တစ်မျိုးဖြစ်ပြီး B- labels များကို entity တစ်ခု၏ စတင်ခြင်း token အတွက် အမြဲတမ်း အသုံးပြုသည်။
return_offsets_mapping=True: Tokenizer ကို အသုံးပြုသောအခါ offset mapping အချက်အလက်များကို output တွင် ထည့်သွင်းရန် သတ်မှတ်ခြင်း။
Tuple: Python တွင် elements များကို ပြောင်းလဲမရနိုင်သော (immutable) အစုအဝေး။ (start, end) pair ကဲ့သို့။
numpy: Python အတွက် scientific computing အတွက် အသုံးပြုသော library။
np.mean(): NumPy မှ array တစ်ခု၏ elements များ၏ ပျမ်းမျှကို တွက်ချက်သော function။
item() Method: PyTorch/NumPy tensor မှ single element value ကို Python standard type အဖြစ် ပြောင်းလဲပေးသော method။
Consecutive: တစ်ခုနှင့်တစ်ခု ဆက်တိုက် ဖြစ်ပေါ်ခြင်း။
Truncate: စာသား sequence တစ်ခုကို သတ်မှတ်ထားသော အရှည်တစ်ခုအထိ ဖြတ်တောက်ခြင်း။
Overflowing Tokens: Max length ထက် ပိုနေသောကြောင့် truncate လုပ်ခံရသည့် tokens များ။

Update on GitHub

←Old Tokenizer တစ်ခုမှ New Tokenizer တစ်ခုကို Training လုပ်ခြင်း QA Pipeline ရှိ Fast Tokenizers များ→