course documentation

Models

course

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Pytorch TensorFlow

Models

ဒီအပိုင်းမှာတော့ model တွေကို ဘယ်လိုဖန်တီးရမလဲ၊ အသုံးပြုရမလဲဆိုတာကို ပိုမိုနက်နဲစွာ လေ့လာသွားပါမယ်။ checkpoint တစ်ခုကနေ မည်သည့် model ကိုမဆို instantiate လုပ်ချင်တဲ့အခါ အသုံးဝင်တဲ့ AutoModel class ကို ကျွန်တော်တို့ အသုံးပြုပါမယ်။

Transformer တစ်ခုကို ဖန်တီးခြင်း

AutoModel တစ်ခုကို instantiate လုပ်တဲ့အခါ ဘာတွေဖြစ်ပျက်လဲဆိုတာကို ကြည့်ခြင်းဖြင့် စတင်လိုက်ရအောင်။

from transformers import AutoModel

model = AutoModel.from_pretrained("bert-base-cased")

tokenizer နဲ့ ဆင်တူစွာ၊ from_pretrained() method က Hugging Face Hub ကနေ model data တွေကို download လုပ်ပြီး cache လုပ်ပါလိမ့်မယ်။ ယခင်က ဖော်ပြခဲ့သလိုပဲ၊ checkpoint name က သီးခြား model architecture နဲ့ weights တွေနဲ့ ကိုက်ညီပါတယ်။ ဒီဥပမာမှာတော့ basic architecture (12 layers, 768 hidden size, 12 attention heads) နဲ့ cased inputs (ဆိုလိုသည်မှာ စာလုံးအကြီးအသေး ခွဲခြားမှုက အရေးကြီးသည်) ပါဝင်တဲ့ BERT model တစ်ခု ဖြစ်ပါတယ်။ Hub မှာ ရရှိနိုင်တဲ့ checkpoints များစွာ ရှိပါတယ် - ဒီနေရာမှာ ရှာဖွေနိုင်ပါတယ်။

AutoModel class နဲ့ ၎င်းရဲ့ ဆက်စပ် classes တွေဟာ တကယ်တော့ ပေးထားတဲ့ checkpoint အတွက် သင့်လျော်တဲ့ model architecture ကို ရယူဖို့ ဒီဇိုင်းထုတ်ထားတဲ့ ရိုးရှင်းတဲ့ wrappers တွေပါ။ ဒါက “auto” class တစ်ခုဖြစ်ပြီး သင့်အတွက် သင့်လျော်တဲ့ model architecture ကို ခန့်မှန်းပြီး မှန်ကန်တဲ့ model class ကို instantiate လုပ်ပေးပါလိမ့်မယ်။ သို့သော်၊ သင်အသုံးပြုချင်တဲ့ model အမျိုးအစားကို သိရှိထားတယ်ဆိုရင်၊ ၎င်းရဲ့ architecture ကို တိုက်ရိုက်သတ်မှတ်ပေးတဲ့ class ကို အသုံးပြုနိုင်ပါတယ်။

from transformers import BertModel

model = BertModel.from_pretrained("bert-base-cased")

Loading နှင့် Saving

Model တစ်ခုကို save လုပ်တာက tokenizer တစ်ခုကို save လုပ်တာလိုပဲ ရိုးရှင်းပါတယ်။ တကယ်တော့၊ model တွေမှာ model ရဲ့ weights တွေနဲ့ architecture configuration တွေကို save လုပ်ပေးတဲ့ save_pretrained() method တူတူကို ပိုင်ဆိုင်ထားပါတယ်။

model.save_pretrained("directory_on_my_computer")

ဒါက သင့် disk ထဲမှာ ဖိုင်နှစ်ခုကို save လုပ်ပါလိမ့်မယ်။

ls directory_on_my_computer

config.json model.safetensors

config.json ဖိုင်ထဲကို ကြည့်လိုက်ရင်၊ model architecture ကို တည်ဆောက်ဖို့ လိုအပ်တဲ့ attributes အားလုံးကို တွေ့ရပါလိမ့်မယ်။ ဒီဖိုင်ထဲမှာ checkpoint ဘယ်ကနေ စတင်ခဲ့သလဲ၊ နောက်ဆုံး checkpoint ကို save လုပ်ခဲ့တုန်းက သင်အသုံးပြုခဲ့တဲ့ 🤗 Transformers version စတဲ့ metadata အချို့လည်း ပါဝင်ပါတယ်။

pytorch_model.safetensors ဖိုင်ကို state dictionary လို့ခေါ်ပါတယ်။ ၎င်းထဲမှာ သင့် model ရဲ့ weights အားလုံး ပါဝင်ပါတယ်။ ဖိုင်နှစ်ခုစလုံး အတူတူ အလုပ်လုပ်ပါတယ်- configuration file က model architecture အကြောင်း သိရှိဖို့ လိုအပ်ပြီး၊ model weights တွေကတော့ model ရဲ့ parameters တွေ ဖြစ်ပါတယ်။

save လုပ်ထားတဲ့ model တစ်ခုကို ပြန်လည်အသုံးပြုဖို့အတွက် from_pretrained() method ကို ထပ်မံအသုံးပြုပါ။

from transformers import AutoModel

model = AutoModel.from_pretrained("directory_on_my_computer")

🤗 Transformers library ရဲ့ အံ့ဖွယ်ကောင်းတဲ့ အင်္ဂါရပ်တစ်ခုကတော့ model တွေနဲ့ tokenizers တွေကို community နဲ့ အလွယ်တကူ မျှဝေနိုင်စွမ်းပါပဲ။ ဒါကို လုပ်ဖို့ Hugging Face မှာ account ရှိဖို့ သေချာပါစေ။ သင် notebook ကို အသုံးပြုနေတယ်ဆိုရင်၊ ဒါနဲ့ အလွယ်တကူ log in လုပ်ဆောင်နိုင်ပါတယ်။

from huggingface_hub import notebook_login

notebook_login()

မဟုတ်ရင်တော့ သင့် terminal မှာ အောက်ပါအတိုင်း run ပါ။

huggingface-cli login

အဲဒီနောက် push_to_hub() method နဲ့ model ကို Hub ကို push လုပ်နိုင်ပါတယ်။

model.push_to_hub("my-awesome-model")

ဒါက model files တွေကို Hub ကို upload လုပ်ပါလိမ့်မယ်။ သင့် namespace အောက်မှာ my-awesome-model လို့ နာမည်ပေးထားတဲ့ repository ထဲမှာပါ။ အဲဒီနောက်၊ မည်သူမဆို သင့် model ကို from_pretrained() method နဲ့ load လုပ်နိုင်ပါပြီ။

from transformers import AutoModel

model = AutoModel.from_pretrained("your-username/my-awesome-model")

Hub API နဲ့ ပိုပြီး လုပ်ဆောင်နိုင်တာတွေ အများကြီး ရှိပါတယ်-

local repository ကနေ model တစ်ခုကို push လုပ်ခြင်း
အားလုံးကို ပြန်လည် upload မလုပ်ဘဲ သီးခြား files များကို update လုပ်ခြင်း
model ရဲ့ စွမ်းဆောင်ရည်၊ ကန့်သတ်ချက်များ၊ သိထားတဲ့ bias စသည်တို့ကို မှတ်တမ်းတင်ဖို့ model cards တွေ ထည့်သွင်းခြင်း

ဒီအကြောင်းအရာတွေအတွက် ပြည့်စုံတဲ့ tutorial ကို documentation မှာ ကြည့်ရှုနိုင်ပါတယ်၊ ဒါမှမဟုတ် အဆင့်မြင့် Chapter 4 ကို လေ့လာနိုင်ပါတယ်။

Text များကို Encoding လုပ်ခြင်း

Transformer မော်ဒယ်တွေက inputs တွေကို ဂဏန်းတွေအဖြစ် ပြောင်းလဲခြင်းဖြင့် text တွေကို ကိုင်တွယ်ပါတယ်။ ဒီနေရာမှာ သင်ရဲ့ text ကို tokenizer က ဘယ်လိုလုပ်ဆောင်တယ်ဆိုတာကို အတိအကျ ကြည့်ပါမယ်။ Chapter 1 မှာ tokenizer တွေက text ကို tokens တွေအဖြစ် ပိုင်းခြားပြီး အဲဒီ tokens တွေကို ဂဏန်းတွေအဖြစ် ပြောင်းလဲတယ်ဆိုတာကို ကျွန်တော်တို့ တွေ့ခဲ့ရပါပြီ။ ဒီ conversion ကို ရိုးရှင်းတဲ့ tokenizer တစ်ခုနဲ့ ကြည့်နိုင်ပါတယ်။

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

encoded_input = tokenizer("Hello, I'm a single sentence!")
print(encoded_input)

{'input_ids': [101, 8667, 117, 1000, 1045, 1005, 1049, 2235, 17662, 12172, 1012, 102], 
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

ကျွန်တော်တို့ဟာ အောက်ပါ fields တွေပါဝင်တဲ့ dictionary တစ်ခုကို ရရှိပါတယ် -

input_ids: သင့် tokens တွေရဲ့ ဂဏန်းဆိုင်ရာ ကိုယ်စားပြုမှုများ
token_type_ids: ဒါတွေက model ကို input ရဲ့ ဘယ်အပိုင်းက sentence A ဖြစ်ပြီး ဘယ်အပိုင်းက sentence B ဖြစ်တယ်ဆိုတာ ပြောပြပါတယ် (နောက်အပိုင်းမှာ ပိုမိုဆွေးနွေးပါမယ်)
attention_mask: ဒါက မည်သည့် tokens များကို အာရုံစိုက်သင့်ပြီး မည်သည့် tokens များကို အာရုံစိုက်ရန် မလိုအပ်ကြောင်း ဖော်ပြပါတယ် (နောက်မှ ပိုမိုဆွေးနွေးပါမယ်)

မူရင်း text ကို ပြန်ရဖို့အတွက် input IDs တွေကို decode လုပ်နိုင်ပါတယ်။

tokenizer.decode(encoded_input["input_ids"])

"[CLS] Hello, I'm a single sentence! [SEP]"

tokenizer က model လိုအပ်တဲ့ special tokens တွေဖြစ်တဲ့ [CLS] နဲ့ [SEP] တွေကို ထည့်သွင်းပေးထားတာကို သင်သတိထားမိပါလိမ့်မယ်။ မော်ဒယ်အားလုံးက special tokens တွေ လိုအပ်တာ မဟုတ်ပါဘူး။ ၎င်းတို့ကို မော်ဒယ်ကို pretrained လုပ်ခဲ့တုန်းက အသုံးပြုခဲ့ရင် အသုံးဝင်ပါတယ်။ အဲဒီအခါမှာတော့ tokenizer က ဒီ tokens တွေကို model က မျှော်လင့်ထားတဲ့အတိုင်း ထည့်ပေးဖို့ လိုအပ်ပါတယ်။

စာကြောင်းများစွာကို တစ်ကြိမ်တည်း encode လုပ်နိုင်ပါတယ်၊ ဒါကို batch လုပ်ခြင်းဖြင့် (ဒီအကြောင်းကို မကြာမီ ဆွေးနွေးပါမယ်) ဒါမှမဟုတ် list တစ်ခု ပေးပို့ခြင်းဖြင့် လုပ်ဆောင်နိုင်ပါတယ်။

encoded_input = tokenizer("How are you?", "I'm fine, thank you!")
print(encoded_input)

{'input_ids': [[101, 1731, 1132, 1128, 136, 102], [101, 1045, 1005, 1049, 2503, 117, 5763, 1128, 136, 102]], 
 'token_type_ids': [[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 
 'attention_mask': [[1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]}

စာကြောင်းများစွာကို ပေးပို့တဲ့အခါ၊ tokenizer က dictionary value တစ်ခုစီအတွက် စာကြောင်းတစ်ခုစီအတွက် list တစ်ခု ပြန်ပေးတာကို သတိပြုပါ။ tokenizer ကို PyTorch ကနေ tensors တွေကို တိုက်ရိုက်ပြန်ပေးဖို့လည်း တောင်းဆိုနိုင်ပါတယ်။

encoded_input = tokenizer("How are you?", "I'm fine, thank you!", return_tensors="pt")
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102],
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

ဒါပေမယ့် ပြဿနာတစ်ခု ရှိပါတယ်- list နှစ်ခုရဲ့ အရှည်က မတူပါဘူး။ Arrays နဲ့ tensors တွေဟာ ထောင့်မှန်ပုံစံ (rectangular shapes) ဖြစ်ဖို့ လိုအပ်ပါတယ်။ ဒါကြောင့် ဒီ list တွေကို PyTorch tensor (သို့မဟုတ် NumPy array) အဖြစ် ရိုးရှင်းစွာ ပြောင်းလဲလို့ မရပါဘူး။ tokenizer က ဒါအတွက် ရွေးချယ်စရာတစ်ခု ပေးထားပါတယ်၊ padding ပါ။

Inputs တွေကို Padding လုပ်ခြင်း

ကျွန်တော်တို့ inputs တွေကို pad လုပ်ဖို့ tokenizer ကို တောင်းဆိုရင်၊ ၎င်းက အရှည်ဆုံးစာကြောင်းထက် တိုနေတဲ့ စာကြောင်းတွေမှာ special padding token တွေ ထည့်သွင်းခြင်းဖြင့် စာကြောင်းအားလုံးကို အရှည်တူညီအောင် ပြုလုပ်ပေးပါလိမ့်မယ်။

encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"], padding=True, return_tensors="pt"
)
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   136,   102,     0,     0,     0,     0],
         [  101,  1045,  1005,  1049,  2503,   117,  5763,  1128,   136,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0],
         [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

အခု ကျွန်တော်တို့မှာ ထောင့်မှန်ပုံစံ tensors တွေ ရပါပြီ။ padding tokens တွေကို ID 0 နဲ့ input IDs တွေအဖြစ် encode လုပ်ထားပြီး၊ ၎င်းတို့မှာ attention mask value ကလည်း 0 ဖြစ်တာကို သတိပြုပါ။ ဒါက ဘာလို့လဲဆိုတော့ အဲဒီ padding tokens တွေကို model က analyze လုပ်ဖို့ မလိုအပ်ပါဘူး- ၎င်းတို့က တကယ့်စာကြောင်းရဲ့ အစိတ်အပိုင်းတွေ မဟုတ်ပါဘူး။

Inputs တွေကို Truncating လုပ်ခြင်း

Tensors တွေက model က လုပ်ဆောင်ဖို့အတွက် အရမ်းကြီးလာနိုင်ပါတယ်။ ဥပမာအားဖြင့်၊ BERT ကို အများဆုံး tokens 512 ခုအထိပဲ sequences တွေနဲ့ pretrain လုပ်ထားတာကြောင့် ပိုရှည်တဲ့ sequences တွေကို လုပ်ဆောင်လို့ မရပါဘူး။ သင့်မှာ model က ကိုင်တွယ်နိုင်တာထက် ပိုရှည်တဲ့ sequences တွေရှိရင်၊ truncation parameter နဲ့ ၎င်းတို့ကို ဖြတ်တောက်ဖို့ လိုအပ်ပါလိမ့်မယ်-

encoded_input = tokenizer(
    "This is a very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very very long sentence.",
    truncation=True,
)
print(encoded_input["input_ids"])

[101, 1188, 1110, 170, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1505, 1179, 5650, 119, 102]

padding နဲ့ truncation arguments တွေကို ပေါင်းစပ်ခြင်းဖြင့်၊ သင်လိုအပ်တဲ့ တိကျတဲ့ size ရှိတဲ့ tensors တွေကို ရရှိကြောင်း သေချာစေနိုင်ပါတယ်။

encoded_input = tokenizer(
    ["How are you?", "I'm fine, thank you!"],
    padding=True,
    truncation=True,
    max_length=5,
    return_tensors="pt",
)
print(encoded_input)

{'input_ids': tensor([[  101,  1731,  1132,  1128,   102],
         [  101,  1045,  1005,  1049,   102]]), 
 'token_type_ids': tensor([[0, 0, 0, 0, 0],
         [0, 0, 0, 0, 0]]), 
 'attention_mask': tensor([[1, 1, 1, 1, 1],
         [1, 1, 1, 1, 1]])}

Special Tokens တွေ ထည့်သွင်းခြင်း

Special tokens တွေ (သို့မဟုတ် ၎င်းတို့ရဲ့ သဘောတရား) ဟာ BERT နဲ့ ဆင်းသက်လာတဲ့ မော်ဒယ်တွေအတွက် အထူးအရေးကြီးပါတယ်။ ဒီ tokens တွေကို စာကြောင်းရဲ့ အစ ( [CLS] ) ဒါမှမဟုတ် စာကြောင်းတွေကြားက ပိုင်းခြားတဲ့နေရာ ( [SEP] ) လိုမျိုး စာကြောင်းရဲ့ နယ်နိမိတ်တွေကို ပိုမိုကောင်းမွန်စွာ ကိုယ်စားပြုနိုင်ဖို့ ထည့်သွင်းထားတာပါ။ ရိုးရှင်းတဲ့ ဥပမာတစ်ခုကို ကြည့်ကြည့်ရအောင်။

encoded_input = tokenizer("How are you?")
print(encoded_input["input_ids"])
tokenizer.decode(encoded_input["input_ids"])

[101, 1731, 1132, 1128, 136, 102]
'[CLS] How are you? [SEP]'

ဒီ special tokens တွေကို tokenizer က အလိုအလျောက် ထည့်သွင်းပေးပါတယ်။ မော်ဒယ်အားလုံးက special tokens တွေ လိုအပ်တာ မဟုတ်ပါဘူး။ ၎င်းတို့ကို မော်ဒယ်ကို pretrained လုပ်ခဲ့တုန်းက အသုံးပြုခဲ့ရင် အဓိကအားဖြင့် အသုံးပြုပါတယ်။ အဲဒီအခါမှာတော့ tokenizer က ဒီ tokens တွေကို model က မျှော်လင့်ထားတဲ့အတိုင်း ထည့်ပေးပါလိမ့်မယ်။

ဒါတွေအားလုံး ဘာကြောင့် လိုအပ်တာလဲ။

ဒီနေရာမှာ တိကျတဲ့ ဥပမာတစ်ခု ရှိပါတယ်- encode လုပ်ထားတဲ့ sequences တွေကို စဉ်းစားကြည့်ပါ။

sequences = [
    "I've been waiting for a HuggingFace course my whole life.",
    "I hate this so much!",
]

tokenized လုပ်ပြီးတာနဲ့ ကျွန်တော်တို့မှာ အောက်ပါအတိုင်း ရှိပါတယ်။

encoded_sequences = [
    [
        101,
        1045,
        1005,
        2310,
        2042,
        3403,
        2005,
        1037,
        17662,
        12172,
        2607,
        2026,
        2878,
        2166,
        1012,
        102,
    ],
    [101, 1045, 5223, 2023, 2061, 2172, 999, 102],
]

ဒါက encode လုပ်ထားတဲ့ sequences တွေရဲ့ list တစ်ခုပါ- list of lists တစ်ခုပေါ့။ Tensors တွေက ထောင့်မှန်ပုံစံ (rectangular shapes) တွေကိုသာ လက်ခံပါတယ် (matrices တွေကို တွေးကြည့်ပါ)။ ဒီ “array” ဟာ ထောင့်မှန်ပုံစံ ဖြစ်နေပြီဆိုတော့ ဒါကို tensor အဖြစ် ပြောင်းလဲဖို့က လွယ်ကူပါတယ်။

import torch

model_inputs = torch.tensor(encoded_sequences)

Tensors များကို Model ၏ Inputs များအဖြစ် အသုံးပြုခြင်း

tensors တွေကို model နဲ့ အသုံးပြုတာက အလွန်ရိုးရှင်းပါတယ်- inputs တွေနဲ့ model ကို ခေါ်လိုက်ရုံပါပဲ။

output = model(model_inputs)

model က မတူညီတဲ့ arguments များစွာကို လက်ခံပေမယ့်၊ input IDs တွေကသာ လိုအပ်တဲ့ arguments တွေပါ။ အခြား arguments တွေက ဘာလုပ်တယ်၊ ဘယ်အချိန်မှာ လိုအပ်တယ်ဆိုတာကို နောက်မှ ရှင်းပြပါမယ်။ ဒါပေမယ့် ပထမဆုံး Transformer model က နားလည်နိုင်တဲ့ inputs တွေကို တည်ဆောက်တဲ့ tokenizers တွေအကြောင်းကို ပိုပြီး နက်နဲစွာ လေ့လာဖို့ လိုအပ်ပါတယ်။

ဝေါဟာရ ရှင်းလင်းချက် (Glossary)

Models: Artificial Intelligence (AI) နယ်ပယ်တွင် အသုံးပြုသော သင်္ချာဆိုင်ရာ ပုံစံများ သို့မဟုတ် algorithms များ။
AutoModel Class: Hugging Face Transformers library မှာ ပါဝင်တဲ့ class တစ်ခုဖြစ်ပြီး မော်ဒယ်အမည်ကို အသုံးပြုပြီး Transformer model ကို အလိုအလျောက် load လုပ်ပေးသည်။
Checkpoint: မော်ဒယ်တစ်ခု၏ လေ့ကျင့်မှုအခြေအနေ (weights, architecture configuration) ကို သတ်မှတ်ထားသော အချိန်တစ်ခုတွင် မှတ်တမ်းတင်ထားခြင်း။
from_pretrained() Method: Pre-trained model သို့မဟုတ် tokenizer ကို load လုပ်ရန် အသုံးပြုသော method။
bert-base-cased: BERT မော်ဒယ်၏ အမည်။ base သည် မော်ဒယ်၏ အရွယ်အစားကို ဖော်ပြပြီး cased သည် စာလုံးအကြီးအသေး ခွဲခြားမှုကို ထည့်သွင်းစဉ်းစားပြီး လေ့ကျင့်ထားကြောင်း ဖော်ပြသည်။
Hugging Face Hub: AI မော်ဒယ်တွေ၊ datasets တွေနဲ့ demo တွေကို အခြားသူတွေနဲ့ မျှဝေဖို့၊ ရှာဖွေဖို့နဲ့ ပြန်လည်အသုံးပြုဖို့အတွက် အွန်လိုင်း platform တစ်ခု ဖြစ်ပါတယ်။
Model Architecture: မော်ဒယ်တစ်ခု၏ ဖွဲ့စည်းပုံ၊ layer အမျိုးအစားများ၊ ၎င်းတို့ ချိတ်ဆက်ပုံ စသည်တို့ကို ဖော်ပြသော ဒီဇိုင်း။
Weights: Machine Learning မော်ဒယ်တစ်ခု၏ သင်ယူနိုင်သော အစိတ်အပိုင်းများ။ ၎င်းတို့သည် လေ့ကျင့်နေစဉ်အတွင်း ဒေတာများမှ ပုံစံများကို သင်ယူကာ ချိန်ညှိပေးသည်။
Layers: Neural Network တစ်ခု၏ အဆင့်များ။
Hidden Size: Hidden states vector တစ်ခု၏ dimension အရွယ်အစား။
Attention Heads: Transformer model ၏ attention mechanism တွင် ပါဝင်သော အပြိုင်လုပ်ဆောင်နိုင်သည့် အစိတ်အပိုင်းများ။
Cased Inputs: စာလုံးအကြီးအသေး ကွာခြားမှုကို ထည့်သွင်းစဉ်းစားသည့် inputs များ။
Wrappers: အခြား code များကို ပိုမိုလွယ်ကူစွာ အသုံးပြုနိုင်စေရန် ပတ်ခြုံပေးထားသော code များ။
BertModel Class: BERT model architecture ကို တိုက်ရိုက်သတ်မှတ်ပေးသော class။
save_pretrained() Method: Model သို့မဟုတ် tokenizer ၏ weights များနှင့် architecture configuration ကို save လုပ်ရန် အသုံးပြုသော method။
config.json: Model architecture တည်ဆောက်ရန် လိုအပ်သော attributes များနှင့် metadata များ ပါဝင်သော JSON ဖိုင်။
model.safetensors / pytorch_model.safetensors: Model ၏ weights များ ပါဝင်သော ဖိုင်။
State Dictionary: မော်ဒယ်တစ်ခု၏ သင်ယူထားသော parameters (weights) များကို သိုလှောင်ထားသော dictionary။
Metadata: ဒေတာအကြောင်းအရာနှင့်ပတ်သက်သော အချက်အလက်များ (ဥပမာ - ဘယ်ကနေ စတင်ခဲ့သည်၊ ဘယ် version ဖြင့် save လုပ်ခဲ့သည်)။
Hugging Face: AI နှင့် machine learning အတွက် tools များနှင့် platform များ ထောက်ပံ့ပေးသော ကုမ္ပဏီ။
huggingface_hub: Hugging Face Hub နှင့် ချိတ်ဆက်ရန်အတွက် Python library။
notebook_login(): Jupyter/Colab notebook များတွင် Hugging Face Hub ကို log in လုပ်ရန် အသုံးပြုသော function။
huggingface-cli login: terminal တွင် Hugging Face Hub ကို log in လုပ်ရန် အသုံးပြုသော command line tool။
push_to_hub() Method: model ကို Hugging Face Hub သို့ upload လုပ်ရန် အသုံးပြုသော method။
Repository: Hugging Face Hub ပေါ်ရှိ model files များ သို့မဟုတ် datasets များကို သိုလှောင်ထားသော နေရာ။
Namespace: Hugging Face Hub တွင် သုံးစွဲသူအကောင့် သို့မဟုတ် အဖွဲ့အစည်းအမည်။
Hub API: Hugging Face Hub နှင့် ပရိုဂရမ်ဖြင့် အပြန်အလှန် ချိတ်ဆက်ရန်အတွက် API။
Local Repository: သင့်ကွန်ပျူတာပေါ်ရှိ model files များ သို့မဟုတ် datasets များကို သိုလှောင်ထားသော နေရာ။
Model Cards: Hugging Face Hub ပေါ်ရှိ မော်ဒယ်တစ်ခု၏ အချက်အလက်များ၊ အသုံးပြုပုံနှင့် စွမ်းဆောင်ရည်များကို အကျဉ်းချုပ်ဖော်ပြထားသော စာမျက်နှာ။
AutoTokenizer Class: Hugging Face Transformers library မှာ ပါဝင်တဲ့ class တစ်ခုဖြစ်ပြီး မော်ဒယ်အမည်ကို အသုံးပြုပြီး သက်ဆိုင်ရာ tokenizer ကို အလိုအလျောက် load လုပ်ပေးသည်။
Tokens: စာသားကို ခွဲခြမ်းစိတ်ဖြာရာတွင် အသုံးပြုသော အသေးငယ်ဆုံးယူနစ်များ (ဥပမာ- စကားလုံးများ၊ subwords များ သို့မဟုတ် ပုဒ်ဖြတ်သံများ)။
encoded_input: Tokenizer ကနေ ထွက်လာတဲ့ encode လုပ်ထားတဲ့ input data။
input_ids: Tokenizer မှ ထုတ်ပေးသော tokens တစ်ခုစီ၏ ထူးခြားသော ဂဏန်းဆိုင်ရာ ID များ။
token_type_ids: Multi-sentence inputs တွေမှာ မည်သည့် token က မည်သည့် sentence (A သို့မဟုတ် B) မှ လာသည်ကို model ကို ပြောပြပေးသော IDs များ။
attention_mask: မော်ဒယ်ကို အာရုံစိုက်သင့်သည့် tokens များနှင့် လျစ်လျူရှုသင့်သည့် (padding) tokens များကို ခွဲခြားပေးသည့် binary mask။
tokenizer.decode() Method: Token IDs များကို မူရင်းစာသားသို့ ပြန်ပြောင်းလဲပေးသော method။
Special Tokens: Transformer model များက စာကြောင်းနယ်နိမိတ်များ သို့မဟုတ် အခြားအချက်အလက်များကို ကိုယ်စားပြုရန် အသုံးပြုသော အထူး tokens များ (ဥပမာ - [CLS], [SEP], [PAD])။
[CLS]: BERT မော်ဒယ်တွင် classification task အတွက် အသုံးပြုသော special token (စာကြောင်း၏ အစတွင် ပေါ်လာသည်)။
[SEP]: BERT မော်ဒယ်တွင် စာကြောင်းများကြား ပိုင်းခြားရန် အသုံးပြုသော special token။
Batching: မတူညီသော input များစွာကို တစ်ပြိုင်နက်တည်း လုပ်ဆောင်နိုင်ရန် အုပ်စုဖွဲ့ခြင်း။
Tensors: Machine Learning frameworks (PyTorch, TensorFlow) များတွင် ဒေတာများကို ကိုယ်စားပြုသော multi-dimensional array များ။
Rectangular Shapes: ညီညာသော အတန်းများနှင့် ကော်လံများပါဝင်သည့် ပုံစံ (matrices ကဲ့သို့)။
Arguments: function သို့မဟုတ် method တစ်ခုသို့ ပေးပို့သော တန်ဖိုးများ။

Update on GitHub

←Pipeline နောက်ကွယ်မှ အကြောင်းအရာများ Tokenizers→