course documentation

အနှစ်ချုပ်ဖော်ပြခြင်း (Summarization)

course

Join the Hugging Face community

and get access to the augmented documentation experience

Collaborate on models, datasets and Spaces

Faster examples with accelerated inference

Switch between documentation themes

to get started

Pytorch TensorFlow

အနှစ်ချုပ်ဖော်ပြခြင်း (Summarization)

ဒီအပိုင်းမှာ Transformer models တွေကို ရှည်လျားတဲ့ document တွေကို အနှစ်ချုပ်အဖြစ် condensing (သိပ်သည်းအောင် ချုံ့) လုပ်ရာမှာ ဘယ်လိုအသုံးပြုနိုင်လဲဆိုတာ ကြည့်သွားပါမယ်။ ဒီလုပ်ငန်းကို text summarization လို့ သိကြပါတယ်။ ဒါက NLP tasks တွေထဲမှာ အခက်ခဲဆုံးတစ်ခုပါ၊ ဘာလို့လဲဆိုတော့ ရှည်လျားတဲ့ စာပိုဒ်တွေကို နားလည်ခြင်းနဲ့ document တစ်ခုရဲ့ အဓိက ခေါင်းစဉ်တွေကို ဖော်ပြတဲ့ ကိုက်ညီတဲ့ စာသားကို ထုတ်လုပ်ခြင်းလို စွမ်းရည်မျိုးစုံ လိုအပ်လို့ပါပဲ။ သို့သော်လည်း၊ ကောင်းကောင်းလုပ်ဆောင်နိုင်ရင် text summarization ဟာ domain expert တွေရဲ့ ရှည်လျားတဲ့ document တွေကို အသေးစိတ်ဖတ်ရမယ့် ဝန်ထုပ်ဝန်ပိုးကို လျှော့ချခြင်းဖြင့် လုပ်ငန်းလုပ်ဆောင်မှု အမျိုးမျိုးကို အရှိန်မြှင့်တင်ပေးနိုင်တဲ့ အစွမ်းထက်တဲ့ ကိရိယာတစ်ခု ဖြစ်ပါတယ်။

Hugging Face Hub ပေါ်မှာ summarization အတွက် fine-tuned models အမျိုးမျိုး ရှိနေပေမယ့်၊ ဒီ models တွေအားလုံးနီးပါးဟာ English documents တွေအတွက်ပဲ သင့်လျော်ပါတယ်။ ဒါကြောင့်၊ ဒီအပိုင်းမှာ ဆန်းသစ်မှုတစ်ခုအနေနဲ့၊ English နဲ့ Spanish အတွက် bilingual model တစ်ခုကို ကျွန်တော်တို့ train လုပ်ပါမယ်။ ဒီအပိုင်းရဲ့ အဆုံးမှာ၊ ဒီနေရာမှာ ပြသထားတဲ့အတိုင်း customer review တွေကို အနှစ်ချုပ်နိုင်တဲ့ model တစ်ခုကို သင်ရရှိပါလိမ့်မယ်။

ကျွန်တော်တို့ တွေ့မြင်ရမယ့်အတိုင်း၊ ဒီအနှစ်ချုပ်တွေက တိုတောင်းပါတယ်။ ဘာလို့လဲဆိုတော့ ၎င်းတို့ကို customer တွေရဲ့ product reviews တွေမှာ ပေးထားတဲ့ titles တွေကနေ သင်ယူထားလို့ပါပဲ။ ဒီ task အတွက် သင့်လျော်တဲ့ bilingual corpus တစ်ခုကို စုစည်းခြင်းဖြင့် စတင်ကြရအောင်။

Multilingual Corpus တစ်ခုကို ပြင်ဆင်ခြင်း

ကျွန်တော်တို့ရဲ့ bilingual summarizer ကို ဖန်တီးဖို့ Multilingual Amazon Reviews Corpus ကို အသုံးပြုပါမယ်။ ဒီ corpus ဟာ ဘာသာစကား ၆ မျိုးနဲ့ Amazon product reviews တွေ ပါဝင်ပြီး multilingual classifiers တွေကို benchmark လုပ်ဖို့အတွက် ပုံမှန်အားဖြင့် အသုံးပြုပါတယ်။ ဒါပေမယ့် review တစ်ခုစီမှာ short title တစ်ခု ပါဝင်တာကြောင့်၊ ကျွန်တော်တို့ရဲ့ model ကနေ သင်ယူဖို့အတွက် titles တွေကို target summaries အဖြစ် အသုံးပြုနိုင်ပါတယ်! စတင်ဖို့ Hugging Face Hub ကနေ English နဲ့ Spanish subsets တွေကို download လုပ်ရအောင်။

from datasets import load_dataset

spanish_dataset = load_dataset("amazon_reviews_multi", "es")
english_dataset = load_dataset("amazon_reviews_multi", "en")
english_dataset

DatasetDict({
    train: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 200000
    })
    validation: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
    test: Dataset({
        features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'],
        num_rows: 5000
    })
})

သင်မြင်ရတဲ့အတိုင်း၊ ဘာသာစကားတစ်ခုစီအတွက် train split မှာ review ၂၀၀,၀၀၀ ရှိပြီး validation နဲ့ test splits တစ်ခုစီအတွက် review ၅,၀၀၀ စီ ရှိပါတယ်။ ကျွန်တော်တို့ စိတ်ဝင်စားတဲ့ review အချက်အလက်တွေက review_body နဲ့ review_title columns တွေမှာ ပါဝင်ပါတယ်။ Chapter 5 မှာ သင်ယူခဲ့တဲ့ နည်းလမ်းတွေနဲ့ training set ကနေ random sample တစ်ခုကို ယူတဲ့ ရိုးရှင်းတဲ့ function တစ်ခုကို ဖန်တီးပြီး ဥပမာအချို့ကို ကြည့်ရအောင်။

def show_samples(dataset, num_samples=3, seed=42):
    sample = dataset["train"].shuffle(seed=seed).select(range(num_samples))
    for example in sample:
        print(f"\n'>> Title: {example['review_title']}'")
        print(f"'>> Review: {example['review_body']}'")


show_samples(english_dataset)

'>> Title: Worked in front position, not rear'
'>> Review: 3 stars because these are not rear brakes as stated in the item description. At least the mount adapter only worked on the front fork of the bike that I got it for.'

'>> Title: meh'
'>> Review: Does it’s job and it’s gorgeous but mine is falling apart, I had to basically put it together again with hot glue'

'>> Title: Can\'t beat these for the money'
'>> Review: Bought this for handling miscellaneous aircraft parts and hanger "stuff" that I needed to organize; it really fit the bill. The unit arrived quickly, was well packaged and arrived intact (always a good sign). There are five wall mounts-- three on the top and two on the bottom. I wanted to mount it on the wall, so all I had to do was to remove the top two layers of plastic drawers, as well as the bottom corner drawers, place it when I wanted and mark it; I then used some of the new plastic screw in wall anchors (the 50 pound variety) and it easily mounted to the wall. Some have remarked that they wanted dividers for the drawers, and that they made those. Good idea. My application was that I needed something that I can see the contents at about eye level, so I wanted the fuller-sized drawers. I also like that these are the new plastic that doesn\'t get brittle and split like my older plastic drawers did. I like the all-plastic construction. It\'s heavy duty enough to hold metal parts, but being made of plastic it\'s not as heavy as a metal frame, so you can easily mount it to the wall and still load it up with heavy stuff, or light stuff. No problem there. For the money, you can\'t beat it. Best one of these I\'ve bought to date-- and I\'ve been using some version of these for over forty years.'

✏️ စမ်းသပ်ကြည့်ပါ။ Dataset.shuffle() command မှာ random seed ကို ပြောင်းလဲခြင်းဖြင့် corpus ထဲက အခြား reviews တွေကို လေ့လာကြည့်ပါ။ သင်က Spanish စကားပြောသူဖြစ်တယ်ဆိုရင်၊ spanish_dataset ထဲက reviews အချို့ကို ကြည့်ပြီး titles တွေကလည်း သင့်လျော်တဲ့ summaries တွေလို ဖြစ်နေလားဆိုတာ ကြည့်ပါ။

ဒီ sample က online မှာ ပုံမှန်တွေ့ရတဲ့ reviews အမျိုးမျိုးကို ပြသပါတယ်။ positive ကနေ negative အထိ (ပြီးတော့ နှစ်ခုကြားက အားလုံး) ပါဝင်ပါတယ်။ “meh” title ပါတဲ့ ဥပမာက သိပ်အသုံးဝင်တာ မဟုတ်ပေမယ့်၊ ကျန်တဲ့ titles တွေက reviews တွေရဲ့ ကောင်းမွန်တဲ့ summaries တွေလို ဖြစ်နေပါတယ်။ review ၄၀၀,၀၀၀ လုံးပေါ်မှာ summarization model တစ်ခုကို train လုပ်တာက single GPU တစ်ခုတည်းနဲ့ အချိန်အကြာကြီး ယူရမှာဖြစ်တာကြောင့်၊ ကျွန်တော်တို့က ထုတ်ကုန် domain တစ်ခုတည်းအတွက် summaries တွေ ထုတ်လုပ်တာကိုပဲ အာရုံစိုက်ပါမယ်။ ဘယ် domain တွေကို ရွေးချယ်နိုင်မလဲဆိုတာ သိရှိနိုင်ဖို့ english_dataset ကို pandas.DataFrame သို့ ပြောင်းပြီး product category တစ်ခုစီအတွက် reviews အရေအတွက်ကို တွက်ချက်ကြည့်ရအောင်။

english_dataset.set_format("pandas")
english_df = english_dataset["train"][:]
# ထိပ်ဆုံးထုတ်ကုန် ၂၀ အတွက် အရေအတွက်များကို ပြသသည်
english_df["product_category"].value_counts()[:20]

home                      17679
apparel                   15951
wireless                  15717
other                     13418
beauty                    12091
drugstore                 11730
kitchen                   10382
toy                        8745
sports                     8277
automotive                 7506
lawn_and_garden            7327
home_improvement           7136
pet_products               7082
digital_ebook_purchase     6749
pc                         6401
electronics                6186
office_product             5521
shoes                      5197
grocery                    4730
book                       3756
Name: product_category, dtype: int64

English dataset မှာ လူကြိုက်အများဆုံး ထုတ်ကုန်တွေက အိမ်သုံးပစ္စည်းတွေ၊ အဝတ်အထည်တွေနဲ့ wireless electronics တွေ ဖြစ်ပါတယ်။ Amazon theme နဲ့ ဆက်သွားဖို့၊ book reviews တွေကို အနှစ်ချုပ်တာကိုပဲ အာရုံစိုက်ကြရအောင် — ဘာပဲဖြစ်ဖြစ်၊ ဒါက ကုမ္ပဏီတည်ထောင်ခဲ့တာပါပဲ! ဒီအတွက် သင့်လျော်တဲ့ product categories နှစ်ခု (book နဲ့ digital_ebook_purchase) ကို ကျွန်တော်တို့ တွေ့ရပါတယ်။ ဒါကြောင့် ဘာသာစကားနှစ်မျိုးလုံးရှိ datasets တွေကနေ ဒီထုတ်ကုန်တွေအတွက်ပဲ filter လုပ်ရအောင်။ Chapter 5 မှာ ကျွန်တော်တို့ တွေ့ခဲ့ရတဲ့အတိုင်း၊ Dataset.filter() function က dataset တစ်ခုကို အလွန်ထိရောက်စွာ slice လုပ်နိုင်တာကြောင့်၊ ဒါကိုလုပ်ဖို့ ရိုးရှင်းတဲ့ function တစ်ခုကို ကျွန်တော်တို့ သတ်မှတ်နိုင်ပါတယ်။

def filter_books(example):
    return (
        example["product_category"] == "book"
        or example["product_category"] == "digital_ebook_purchase"
    )

အခု ဒီ function ကို english_dataset နဲ့ spanish_dataset တွေပေါ်မှာ အသုံးပြုတဲ့အခါ၊ ရလဒ်မှာ book categories နဲ့ သက်ဆိုင်တဲ့ rows တွေသာ ပါဝင်ပါလိမ့်မယ်။ filter ကို အသုံးမပြုခင်၊ english_dataset ရဲ့ format ကို "pandas" ကနေ "arrow" ကို ပြန်ပြောင်းရအောင်။

english_dataset.reset_format()

ပြီးရင် filter function ကို အသုံးပြုနိုင်ပြီး၊ sanity check တစ်ခုအနေနဲ့ reviews တွေရဲ့ sample တစ်ခုကို ကြည့်ပြီး ဒါတွေက စာအုပ်တွေနဲ့ ပတ်သက်တာ ဟုတ်မဟုတ် စစ်ဆေးကြည့်ရအောင်။

spanish_books = spanish_dataset.filter(filter_books)
english_books = english_dataset.filter(filter_books)
show_samples(english_books)

'>> Title: I\'m dissapointed.'
'>> Review: I guess I had higher expectations for this book from the reviews. I really thought I\'d at least like it. The plot idea was great. I loved Ash but, it just didnt go anywhere. Most of the book was about their radio show and talking to callers. I wanted the author to dig deeper so we could really get to know the characters. All we know about Grace is that she is attractive looking, Latino and is kind of a brat. I\'m dissapointed.'

'>> Title: Good art, good price, poor design'
'>> Review: I had gotten the DC Vintage calendar the past two years, but it was on backorder forever this year and I saw they had shrunk the dimensions for no good reason. This one has good art choices but the design has the fold going through the picture, so it\'s less aesthetically pleasing, especially if you want to keep a picture to hang. For the price, a good calendar'

'>> Title: Helpful'
'>> Review: Nearly all the tips useful and. I consider myself an intermediate to advanced user of OneNote. I would highly recommend.'

ကောင်းပါပြီ၊ reviews တွေဟာ စာအုပ်တွေနဲ့ တိတိကျကျ ပတ်သက်တာ မဟုတ်ဘဲ calendars တွေနဲ့ OneNote လို electronic applications တွေလို အရာတွေကို ရည်ညွှန်းနိုင်တာကို ကျွန်တော်တို့ တွေ့ရပါတယ်။ သို့သော်လည်း၊ domain က summarization model တစ်ခုကို train လုပ်ဖို့အတွက် သင့်တော်ပုံရပါတယ်။ ဒီ task အတွက် သင့်လျော်တဲ့ models အမျိုးမျိုးကို မကြည့်ခင်၊ ကျွန်တော်တို့မှာ နောက်ဆုံး data preparation အနည်းငယ် လုပ်စရာရှိပါသေးတယ်၊ English နဲ့ Spanish reviews တွေကို single DatasetDict object အဖြစ် ပေါင်းစပ်တာပါ။ 🤗 Datasets က အသုံးဝင်တဲ့ concatenate_datasets() function တစ်ခုကို ပံ့ပိုးပေးပြီး (နာမည်က ဖော်ပြထားတဲ့အတိုင်း) Dataset objects နှစ်ခုကို တစ်ခုပေါ်တစ်ခု ထပ်ပေးပါလိမ့်မယ်။ ဒါကြောင့်၊ ကျွန်တော်တို့ရဲ့ bilingual dataset ကို ဖန်တီးဖို့၊ split တစ်ခုစီကို loop လုပ်ပြီး၊ အဲဒီ split အတွက် datasets တွေကို concatenate လုပ်ကာ၊ model က ဘာသာစကားတစ်ခုတည်းကို overfit မဖြစ်စေဖို့ ရလဒ်ကို shuffle လုပ်ပါမယ်။

from datasets import concatenate_datasets, DatasetDict

books_dataset = DatasetDict()

for split in english_books.keys():
    books_dataset[split] = concatenate_datasets(
        [english_books[split], spanish_books[split]]
    )
    books_dataset[split] = books_dataset[split].shuffle(seed=42)

# ဥပမာအချို့ကို ကြည့်ရှုပါ
show_samples(books_dataset)

'>> Title: Easy to follow!!!!'
'>> Review: I loved The dash diet weight loss Solution. Never hungry. I would recommend this diet. Also the menus are well rounded. Try it. Has lots of the information need thanks.'

'>> Title: PARCIALMENTE DAÑADO'
'>> Review: Me llegó el día que tocaba, junto a otros libros que pedí, pero la caja llegó en mal estado lo cual dañó las esquinas de los libros porque venían sin protección (forro).'

'>> Title: no lo he podido descargar'
'>> Review: igual que el anterior'

ဒါက English နဲ့ Spanish reviews တွေ ရောနှောထားတာ သေချာပါတယ်။ အခု training corpus တစ်ခုရပြီဆိုတော့၊ နောက်ဆုံးစစ်ဆေးရမယ့်အရာက reviews နဲ့ titles တွေထဲက စကားလုံးဖြန့်ဝေမှု (distribution) ပါပဲ။ ဒါက summarization tasks တွေအတွက် အထူးအရေးကြီးပါတယ်။ ဘာလို့လဲဆိုတော့ data ထဲက တိုတောင်းတဲ့ reference summaries တွေက model ကို generated summaries တွေမှာ စကားလုံးတစ်လုံး ဒါမှမဟုတ် နှစ်လုံးပဲ ထုတ်လုပ်အောင် ဘက်လိုက်စေနိုင်လို့ပါ။ အောက်ပါ plots တွေက word distributions တွေကို ပြသထားပြီး၊ titles တွေက စကားလုံး ၁ လုံး၊ ၂ လုံးလောက်ပဲ အလွန်အမင်း skewed ဖြစ်နေတာကို ကျွန်တော်တို့ တွေ့မြင်နိုင်ပါတယ်။

Word count distributions for the review titles and texts.

ဒီပြဿနာကို ဖြေရှင်းဖို့၊ အလွန်တိုတောင်းတဲ့ titles တွေနဲ့ ဥပမာတွေကို ကျွန်တော်တို့ filter လုပ်ပါမယ်။ ဒါမှ ကျွန်တော်တို့ရဲ့ model က ပိုမိုစိတ်ဝင်စားစရာကောင်းတဲ့ summaries တွေ ထုတ်လုပ်နိုင်ပါလိမ့်မယ်။ English နဲ့ Spanish texts တွေနဲ့ အလုပ်လုပ်နေတာကြောင့်၊ titles တွေကို whitespace ပေါ်မှာ split လုပ်ဖို့ ကြမ်းတမ်းတဲ့ heuristic တစ်ခုကို အသုံးပြုနိုင်ပြီး၊ ကျွန်တော်တို့ရဲ့ ယုံကြည်ရတဲ့ Dataset.filter() method ကို အောက်ပါအတိုင်း အသုံးပြုနိုင်ပါတယ်။

books_dataset = books_dataset.filter(lambda x: len(x["review_title"].split()) > 2)

ကျွန်တော်တို့ corpus ကို ပြင်ဆင်ပြီးပြီဆိုတော့၊ ဒီ task အတွက် သင့်လျော်တဲ့ Transformer models အချို့ကို ကြည့်ရအောင်။

Text Summarization အတွက် Models များ

သင်စဉ်းစားကြည့်မယ်ဆိုရင်၊ text summarization ဟာ machine translation နဲ့ ဆင်တူတဲ့ task တစ်ခုပါ- review တစ်ခုလို body of text တစ်ခုရှိပြီး၊ input ရဲ့ salient features တွေကို ဖော်ပြတဲ့ ပိုတိုတောင်းတဲ့ version တစ်ခုအဖြစ် “translate” လုပ်ချင်ပါတယ်။ ဒါကြောင့်၊ summarization အတွက် Transformer models အများစုက Chapter 1 မှာ ကျွန်တော်တို့ ပထမဆုံး ကြုံတွေ့ခဲ့ရတဲ့ encoder-decoder architecture ကို အသုံးပြုကြပါတယ်။ သို့သော်လည်း၊ few-shot settings မှာ summarization အတွက် အသုံးပြုနိုင်တဲ့ GPT family of models လိုမျိုး ချွင်းချက်အချို့တော့ ရှိပါတယ်။ အောက်ပါဇယားက summarization အတွက် fine-tune လုပ်နိုင်တဲ့ ရေပန်းစားတဲ့ pretrained models အချို့ကို ဖော်ပြထားပါတယ်။

Transformer model	Description	Multilingual?
GPT-2	auto-regressive language model အဖြစ် train ထားသော်လည်း၊ input text ရဲ့ အဆုံးမှာ “TL;DR” ကို ထည့်သွင်းခြင်းဖြင့် GPT-2 ကို summaries တွေ ထုတ်လုပ်စေနိုင်ပါတယ်။	❌
PEGASUS	multi-sentence texts တွေထဲက masked sentences တွေကို ခန့်မှန်းဖို့ pretraining objective ကို အသုံးပြုပါတယ်။ ဒီ pretraining objective က vanilla language modeling ထက် summarization နဲ့ ပိုနီးစပ်ပြီး ရေပန်းစားတဲ့ benchmarks တွေမှာ မြင့်မားတဲ့ score ရရှိပါတယ်။	❌
T5	NLP tasks အားလုံးကို text-to-text framework မှာ ပုံဖော်ပေးတဲ့ universal Transformer architecture; ဥပမာ- document တစ်ခုကို summarize လုပ်ဖို့ model အတွက် input format က `summarize: ARTICLE` ဖြစ်ပါတယ်။	❌
mT5	T5 ရဲ့ multilingual version တစ်ခုဖြစ်ပြီး multilingual Common Crawl corpus (mC4) ပေါ်မှာ pretrained လုပ်ထားကာ ဘာသာစကား ၁၀၁ မျိုး ပါဝင်ပါတယ်။	✅
BART	encoder နဲ့ decoder stack နှစ်ခုလုံးပါဝင်ပြီး BERT နဲ့ GPT-2 ရဲ့ pretraining schemes တွေကို ပေါင်းစပ်ထားတဲ့ corrupted input ကို reconstruct လုပ်ဖို့ train ထားတဲ့ novel Transformer architecture။	❌
mBART-50	BART ရဲ့ multilingual version တစ်ခုဖြစ်ပြီး ဘာသာစကား ၅၀ ပေါ်မှာ pretrained လုပ်ထားပါတယ်။	✅

ဒီဇယားကနေ သင်မြင်ရတဲ့အတိုင်း၊ summarization အတွက် Transformer models အများစု (ပြီးတော့ NLP tasks အများစု) ဟာ monolingual ဖြစ်ပါတယ်။ ဒါက English ဒါမှမဟုတ် German လို “high-resource” language တစ်ခုမှာ သင့် task ရှိမယ်ဆိုရင် ကောင်းပါတယ်၊ ဒါပေမယ့် ကမ္ဘာတစ်ဝှမ်းလုံးမှာ အသုံးပြုနေတဲ့ ထောင်နဲ့ချီတဲ့ အခြားဘာသာစကားတွေအတွက်တော့ မကောင်းပါဘူး။ ကံကောင်းစွာနဲ့ပဲ၊ mT5 နဲ့ mBART လိုမျိုး multilingual Transformer models တွေရှိပြီး ဒါတွေက ကူညီကယ်တင်ပေးပါတယ်။ ဒီ models တွေကို language modeling ကို အသုံးပြုပြီး pretrained လုပ်ထားပေမယ့်၊ ကွဲပြားမှုတစ်ခုတော့ ရှိပါတယ်၊ ဘာသာစကားတစ်ခုတည်းရဲ့ corpus ပေါ်မှာ train လုပ်မယ့်အစား၊ ဘာသာစကား ၅၀ ကျော်ရှိတဲ့ texts တွေပေါ်မှာ တစ်ပြိုင်နက်တည်း train လုပ်ထားတာပါ!

ကျွန်တော်တို့ mT5 ကို အာရုံစိုက်ပါမယ်။ T5 ပေါ်မှာ အခြေခံထားတဲ့ စိတ်ဝင်စားစရာ architecture တစ်ခုဖြစ်ပြီး text-to-text framework မှာ pretrained လုပ်ထားပါတယ်။ T5 မှာ၊ NLP task တိုင်းကို summarize: လိုမျိုး prompt prefix တစ်ခုရဲ့ ပုံစံနဲ့ ဖော်ပြထားပါတယ်။ ဒါက generated text ကို prompt နဲ့ လိုက်လျောညီထွေဖြစ်အောင် model ကို အခြေအနေပေးပါတယ်။ အောက်ပါပုံမှာ ပြသထားတဲ့အတိုင်း၊ ဒါက T5 ကို အလွန်အမင်း versatility ရှိစေပြီး၊ single model တစ်ခုတည်းနဲ့ tasks များစွာကို ဖြေရှင်းနိုင်ပါတယ်!

Different tasks performed by the T5 architecture.

mT5 က prefixes တွေကို အသုံးမပြုပေမယ့်၊ T5 ရဲ့ versatility အများစုကို မျှဝေထားပြီး multilingual ဖြစ်တဲ့ အားသာချက်ရှိပါတယ်။ အခု model တစ်ခုကို ရွေးချယ်ပြီးပြီဆိုတော့၊ training အတွက် ကျွန်တော်တို့ရဲ့ data ကို ပြင်ဆင်တာကို ကြည့်ရအောင်။

✏️ စမ်းသပ်ကြည့်ပါ။ ဒီအပိုင်းကို ပြီးအောင်လုပ်ပြီးတာနဲ့၊ mT5 က mBART နဲ့ ဘယ်လောက်ကွာလဲဆိုတာကို mBART ကို တူညီတဲ့ နည်းလမ်းတွေနဲ့ fine-tuning လုပ်ပြီး ကြည့်ပါ။ bonus အမှတ်များအတွက်၊ English reviews တွေပေါ်မှာ T5 ကို fine-tuning လုပ်ကြည့်နိုင်ပါတယ်။ T5 မှာ special prefix prompt ရှိတာကြောင့်၊ အောက်ပါ preprocessing steps တွေမှာ input examples တွေရဲ့ အရှေ့မှာ summarize: ကို ထည့်ဖို့ လိုပါလိမ့်မယ်။

Data များကို Preprocessing လုပ်ခြင်း

ကျွန်တော်တို့ရဲ့ နောက်ထပ် task က reviews တွေနဲ့ titles တွေကို tokenize လုပ်ပြီး encode လုပ်ဖို့ပါပဲ။ ပုံမှန်အတိုင်း၊ pretrained model checkpoint နဲ့ ဆက်စပ်နေတဲ့ tokenizer ကို load လုပ်ခြင်းဖြင့် စတင်ပါမယ်။ model ကို သင့်လျော်တဲ့ အချိန်ကာလတစ်ခုအတွင်း fine-tune လုပ်နိုင်ဖို့ mt5-small ကို checkpoint အဖြစ် အသုံးပြုပါမယ်။

from transformers import AutoTokenizer

model_checkpoint = "google/mt5-small"
tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

💡 သင်၏ NLP projects ရဲ့ အစောပိုင်းအဆင့်တွေမှာ၊ “small” models class တစ်ခုကို data sample သေးသေးလေးပေါ်မှာ train လုပ်တာက ကောင်းမွန်တဲ့ လုပ်ဆောင်မှုတစ်ခုပါ။ ဒါက end-to-end workflow တစ်ခုဆီကို ပိုမိုမြန်ဆန်စွာ debug လုပ်ပြီး iterate လုပ်နိုင်စေပါတယ်။ ရလဒ်တွေမှာ သင်ယုံကြည်မှုရှိပြီဆိုတာနဲ့၊ model checkpoint ကို ရိုးရှင်းစွာ ပြောင်းလဲခြင်းဖြင့် model ကို အမြဲတိုးချဲ့နိုင်ပါတယ်။

mT5 tokenizer ကို ဥပမာသေးသေးလေးတစ်ခုပေါ်မှာ စမ်းသပ်ကြည့်ရအောင်။

inputs = tokenizer("I loved reading the Hunger Games!")
inputs

{'input_ids': [336, 259, 28387, 11807, 287, 62893, 295, 12507, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}

ဒီနေရာမှာ Chapter 3 မှာ ကျွန်တော်တို့ရဲ့ ပထမဆုံး fine-tuning experiments တွေမှာ ကြုံတွေ့ခဲ့ရတဲ့ ရင်းနှီးတဲ့ input_ids နဲ့ attention_mask တွေကို ကျွန်တော်တို့ တွေ့မြင်နိုင်ပါတယ်။ ဒီ input IDs တွေကို tokenizer ရဲ့ convert_ids_to_tokens() function နဲ့ decode လုပ်ပြီး ဘယ်လို tokenizer အမျိုးအစားနဲ့ အလုပ်လုပ်နေလဲဆိုတာ ကြည့်ရအောင်။

tokenizer.convert_ids_to_tokens(inputs.input_ids)

[' I', ' ', 'loved', ' reading', ' the', ' Hung', 'er', ' Games', '</s>']

အထူး Unicode character နဲ့ end-of-sequence token </s> က ကျွန်တော်တို့ SentencePiece tokenizer နဲ့ အလုပ်လုပ်နေတာကို ဖော်ပြပါတယ်။ ဒါက Chapter 6 မှာ ဆွေးနွေးခဲ့တဲ့ Unigram segmentation algorithm ပေါ် အခြေခံထားပါတယ်။ Unigram က multilingual corpora တွေအတွက် အထူးအသုံးဝင်ပါတယ်။ ဘာလို့လဲဆိုတော့ SentencePiece ကို accents တွေ၊ punctuation တွေနဲ့ Japanese လို ဘာသာစကားများစွာမှာ whitespace characters တွေ မပါဝင်ဘူးဆိုတဲ့ အချက်တွေနဲ့ ပတ်သက်ပြီး agnostic ဖြစ်စေလို့ပါ။

ကျွန်တော်တို့ corpus ကို tokenize လုပ်ဖို့အတွက် summarization နဲ့ ဆက်စပ်နေတဲ့ သိမ်မွေ့မှုတစ်ခုကို ဖြေရှင်းရပါမယ်၊ ကျွန်တော်တို့ရဲ့ labels တွေကလည်း text ဖြစ်တာကြောင့်၊ ၎င်းတို့ဟာ model ရဲ့ maximum context size ကို ကျော်လွန်သွားနိုင်ပါတယ်။ ဒါက reviews တွေနဲ့ titles တွေ နှစ်ခုလုံးကို truncation လုပ်ဖို့ လိုအပ်တယ်လို့ ဆိုလိုပါတယ်။ ဒါမှ ကျွန်တော်တို့ model ကို အလွန်အမင်း ရှည်လျားတဲ့ inputs တွေ ပေးပို့တာကို ရှောင်ရှားနိုင်ပါလိမ့်မယ်။ 🤗 Transformers မှာရှိတဲ့ tokenizers တွေက input တွေနဲ့ ပြိုင်တူ labels တွေကို tokenize လုပ်နိုင်စေမယ့် အသုံးဝင်တဲ့ text_target argument ကို ပံ့ပိုးပေးပါတယ်။ mT5 အတွက် inputs နဲ့ targets တွေကို ဘယ်လိုလုပ်ဆောင်လဲဆိုတဲ့ ဥပမာတစ်ခုကတော့ ဒီမှာပါ။

max_input_length = 512
max_target_length = 30


def preprocess_function(examples):
    model_inputs = tokenizer(
        examples["review_body"],
        max_length=max_input_length,
        truncation=True,
    )
    labels = tokenizer(
        examples["review_title"], max_length=max_target_length, truncation=True
    )
    model_inputs["labels"] = labels["input_ids"]
    return model_inputs

ဘာတွေဖြစ်နေလဲဆိုတာ နားလည်ဖို့ ဒီ code ကို ကြည့်ရအောင်။ ပထမဆုံး ကျွန်တော်တို့ လုပ်ခဲ့တာက max_input_length နဲ့ max_target_length အတွက် တန်ဖိုးတွေ သတ်မှတ်ခဲ့တာပါ။ ဒါတွေက ကျွန်တော်တို့ရဲ့ reviews တွေနဲ့ titles တွေ ဘယ်လောက်ရှည်နိုင်လဲဆိုတဲ့ အပေါ်ဆုံးကန့်သတ်ချက်တွေကို သတ်မှတ်ပါတယ်။ review body က title ထက် အများကြီး ပိုကြီးတာကြောင့်၊ ကျွန်တော်တို့ ဒီတန်ဖိုးတွေကို အချိုးကျ ပြောင်းလဲထားပါတယ်။

preprocess_function() နဲ့ဆိုရင်၊ ဒီသင်တန်းတစ်လျှောက်လုံး ကျွန်တော်တို့ ကျယ်ကျယ်ပြန့်ပြန့် အသုံးပြုခဲ့တဲ့ အသုံးဝင်တဲ့ Dataset.map() function ကို အသုံးပြုပြီး corpus တစ်ခုလုံးကို tokenize လုပ်တာက ရိုးရှင်းတဲ့ ကိစ္စတစ်ခုပါ။

tokenized_datasets = books_dataset.map(preprocess_function, batched=True)

အခု corpus ကို preprocessed လုပ်ပြီးပြီဆိုတော့၊ summarization အတွက် ပုံမှန်အသုံးပြုတဲ့ metrics အချို့ကို ကြည့်ရအောင်။ ကျွန်တော်တို့ တွေ့မြင်ရမယ့်အတိုင်း၊ machine-generated text ရဲ့ အရည်အသွေးကို တိုင်းတာရာမှာ မှော်ဆန်တဲ့ ဖြေရှင်းနည်းတစ်ခု မရှိပါဘူး။

💡 အပေါ်က Dataset.map() function မှာ batched=True ကို အသုံးပြုခဲ့တာကို သင်သတိထားမိပါလိမ့်မယ်။ ဒါက ဥပမာတွေကို batches of 1,000 (default) နဲ့ encode လုပ်ပြီး 🤗 Transformers မှာရှိတဲ့ fast tokenizers တွေရဲ့ multithreading စွမ်းရည်တွေကို အသုံးပြုနိုင်စေပါတယ်။ ဖြစ်နိုင်ရင်၊ သင့် preprocessing ကနေ အကောင်းဆုံးရရှိဖို့ batched=True ကို အသုံးပြုဖို့ ကြိုးစားပါ။

Text Summarization အတွက် Metrics များ

ဒီသင်တန်းမှာ ကျွန်တော်တို့ ဖော်ပြခဲ့တဲ့ အခြား tasks တွေနဲ့ နှိုင်းယှဉ်ရင်၊ summarization ဒါမှမဟုတ် translation လို text generation tasks တွေရဲ့ စွမ်းဆောင်ရည်ကို တိုင်းတာတာက သိပ်မရိုးရှင်းပါဘူး။ ဥပမာ၊ “I loved reading the Hunger Games” လို review တစ်ခုကို ပေးထားရင်၊ “I loved the Hunger Games” ဒါမှမဟုတ် “Hunger Games is a great read” လိုမျိုး မှန်ကန်တဲ့ summaries များစွာ ရှိနိုင်ပါတယ်။ generated summary နဲ့ label ကြားမှာ တိကျတဲ့ ကိုက်ညီမှုမျိုးကို အသုံးပြုတာက ကောင်းမွန်တဲ့ ဖြေရှင်းနည်း မဟုတ်ဘူးဆိုတာ ရှင်းပါတယ်။ ဘာလို့လဲဆိုတော့ ကျွန်တော်တို့ အားလုံးမှာ ကိုယ်ပိုင်ရေးသားဟန် ရှိကြတာကြောင့် လူသားတွေတောင် ဒီလို metric အောက်မှာ ကောင်းကောင်းလုပ်ဆောင်နိုင်မှာ မဟုတ်ပါဘူး။

Summarization အတွက် အသုံးအများဆုံး metrics တွေထဲက တစ်ခုကတော့ ROUGE score (Recall-Oriented Understudy for Gisting Evaluation ရဲ့ အတိုကောက်) ဖြစ်ပါတယ်။ ဒီ metric ရဲ့ အခြေခံစိတ်ကူးက generated summary တစ်ခုကို လူသားတွေ ဖန်တီးထားတဲ့ reference summaries အစုအဝေးတစ်ခုနဲ့ နှိုင်းယှဉ်ဖို့ပါပဲ။ ဒါကို ပိုပြီး တိကျအောင် လုပ်ဖို့၊ အောက်ပါ summaries နှစ်ခုကို နှိုင်းယှဉ်ချင်တယ်လို့ ယူဆပါစို့။

generated_summary = "I absolutely loved reading the Hunger Games"
reference_summary = "I loved reading the Hunger Games"

ဒါတွေကို နှိုင်းယှဉ်တဲ့ နည်းလမ်းတစ်ခုက ထပ်နေတဲ့ စကားလုံးအရေအတွက်ကို ရေတွက်တာ ဖြစ်နိုင်ပြီး၊ ဒီကိစ္စမှာ ၆ လုံး ရှိပါလိမ့်မယ်။ သို့သော်လည်း ဒါက အနည်းငယ် ကြမ်းတမ်းတာကြောင့်၊ ROUGE က ထပ်နေမှုအတွက် precision နဲ့ recall scores တွေကို တွက်ချက်ခြင်းပေါ်မှာ အခြေခံထားပါတယ်။

🙋 ဒါက precision နဲ့ recall အကြောင်း သင်ပထမဆုံး ကြားဖူးတာဆိုရင် မစိုးရိမ်ပါနဲ့ — ဒါတွေ အားလုံးကို ရှင်းလင်းအောင် အတူတူ ဥပမာအချို့ကို ကြည့်သွားပါမယ်။ ဒီ metrics တွေကို classification tasks တွေမှာ ပုံမှန်တွေ့ရတာကြောင့်၊ အဲဒီ context မှာ precision နဲ့ recall တွေကို ဘယ်လိုသတ်မှတ်ထားလဲ နားလည်ချင်တယ်ဆိုရင် scikit-learn guides ကို စစ်ဆေးကြည့်ဖို့ ကျွန်တော်တို့ အကြံပြုပါတယ်။

ROUGE အတွက်၊ recall က generated summary က reference summary ရဲ့ ဘယ်လောက်အတိုင်းအတာအထိ ဖမ်းယူနိုင်သလဲဆိုတာကို တိုင်းတာပါတယ်။ စကားလုံးတွေကိုပဲ နှိုင်းယှဉ်နေမယ်ဆိုရင်၊ recall ကို အောက်ပါ formula နဲ့ တွက်ချက်နိုင်ပါတယ်- $\mathrm{Recall} = \frac{\mathrm{Number\,of\,overlapping\, words}}{\mathrm{Total\, number\, of\, words\, in\, reference\, summary}}$

ကျွန်တော်တို့ရဲ့ အပေါ်က ရိုးရှင်းတဲ့ ဥပမာအတွက်၊ ဒီ formula က 6/6 = 1 ဆိုတဲ့ perfect recall ကို ပေးပါတယ်၊ ဆိုလိုတာက reference summary ထဲက စကားလုံးအားလုံးကို model က ထုတ်လုပ်ခဲ့ပါတယ်။ ဒါက ကောင်းမွန်တယ်လို့ ထင်ရပေမယ့်၊ ကျွန်တော်တို့ရဲ့ generated summary က “I really really loved reading the Hunger Games all night” ဖြစ်ခဲ့မယ်ဆိုရင် ဘယ်လိုလုပ်မလဲဆိုတာ စဉ်းစားကြည့်ပါ။ ဒါကလည်း perfect recall ရမှာဖြစ်ပေမယ့်၊ verbose ဖြစ်တာကြောင့် ပိုဆိုးတဲ့ summary ဖြစ်တယ်လို့ ငြင်းဆိုနိုင်ပါတယ်။ ဒီလိုအခြေအနေတွေကို ဖြေရှင်းဖို့ precision ကိုလည်း ကျွန်တော်တို့ တွက်ချက်ပါတယ်။ ROUGE context မှာ precision က generated summary ရဲ့ ဘယ်လောက်အတိုင်းအတာအထိ relevant ဖြစ်လဲဆိုတာကို တိုင်းတာပါတယ်။ $\mathrm{Precision} = \frac{\mathrm{Number\,of\,overlapping\, words}}{\mathrm{Total\, number\, of\, words\, in\, generated\, summary}}$

ဒါကို ကျွန်တော်တို့ရဲ့ verbose summary ပေါ်မှာ အသုံးပြုတဲ့အခါ 6/10 = 0.6 ဆိုတဲ့ precision ကို ပေးပါတယ်။ ဒါက ကျွန်တော်တို့ရဲ့ ပိုတိုတဲ့ summary က ရရှိခဲ့တဲ့ 6/7 = 0.86 precision ထက် သိသိသာသာ ဆိုးပါတယ်။ လက်တွေ့မှာတော့ precision နဲ့ recall နှစ်ခုလုံးကို ပုံမှန်တွက်ချက်ပြီး၊ F1-score (precision နဲ့ recall ရဲ့ harmonic mean) ကို ဖော်ပြပါတယ်။ ဒါကို 🤗 Datasets မှာ rouge_score package ကို အရင် install လုပ်ခြင်းဖြင့် လွယ်လွယ်ကူကူ လုပ်ဆောင်နိုင်ပါတယ်။

!pip install rouge_score

ပြီးရင် ROUGE metric ကို အောက်ပါအတိုင်း load လုပ်ပါ။

import evaluate

rouge_score = evaluate.load("rouge")

ပြီးရင် rouge_score.compute() function ကို အသုံးပြုပြီး metrics အားလုံးကို တစ်ပြိုင်နက်တည်း တွက်ချက်နိုင်ပါတယ်။

scores = rouge_score.compute(
    predictions=[generated_summary], references=[reference_summary]
)
scores

{'rouge1': AggregateScore(low=Score(precision=0.86, recall=1.0, fmeasure=0.92), mid=Score(precision=0.86, recall=1.0, fmeasure=0.92), high=Score(precision=0.86, recall=1.0, fmeasure=0.92)),
 'rouge2': AggregateScore(low=Score(precision=0.67, recall=0.8, fmeasure=0.73), mid=Score(precision=0.67, recall=0.8, fmeasure=0.73), high=Score(precision=0.67, recall=0.8, fmeasure=0.73)),
 'rougeL': AggregateScore(low=Score(precision=0.86, recall=1.0, fmeasure=0.92), mid=Score(precision=0.86, recall=1.0, fmeasure=0.92), high=Score(precision=0.86, recall=1.0, fmeasure=0.92)),
 'rougeLsum': AggregateScore(low=Score(precision=0.86, recall=1.0, fmeasure=0.92), mid=Score(precision=0.86, recall=1.0, fmeasure=0.92), high=Score(precision=0.86, recall=1.0, fmeasure=0.92))}

ဒီ output ထဲမှာ အချက်အလက်တွေ အများကြီးရှိနေတယ် — ဒါတွေအားလုံးက ဘာကိုဆိုလိုတာလဲ။ ပထမဆုံး၊ 🤗 Datasets က precision, recall နဲ့ F1-score အတွက် confidence intervals တွေကို တကယ်တွက်ချက်ပါတယ်၊ ဒါတွေက သင်ဒီနေရာမှာ မြင်နိုင်တဲ့ low, mid, နဲ့ high attributes တွေပါ။ ဒါ့အပြင်၊ 🤗 Datasets က generated နဲ့ reference summaries တွေကို နှိုင်းယှဉ်တဲ့အခါ မတူညီတဲ့ text granularity အမျိုးအစားတွေပေါ်မှာ အခြေခံထားတဲ့ ROUGE scores အမျိုးမျိုးကို တွက်ချက်ပါတယ်။ rouge1 variant က unigrams တွေရဲ့ ထပ်နေမှုဖြစ်ပါတယ် — ဒါက စကားလုံးတွေရဲ့ ထပ်နေမှုကို ဖော်ပြတဲ့ လှပတဲ့ နည်းလမ်းတစ်ခုဖြစ်ပြီး၊ ကျွန်တော်တို့ အပေါ်မှာ ဆွေးနွေးခဲ့တဲ့ metric နဲ့ အတိအကျတူညီပါတယ်။ ဒါကို စစ်ဆေးဖို့၊ ကျွန်တော်တို့ scores ရဲ့ mid value ကို ထုတ်ယူကြည့်ရအောင်။

scores["rouge1"].mid

Score(precision=0.86, recall=1.0, fmeasure=0.92)

ကောင်းပါပြီ၊ precision နဲ့ recall numbers တွေက ကိုက်ညီပါတယ်။ အခု ကျန်တဲ့ ROUGE scores တွေက ဘာတွေလဲ။ rouge2 က bigrams တွေရဲ့ ထပ်နေမှုကို တိုင်းတာပါတယ် (စကားလုံးအတွဲတွေရဲ့ ထပ်နေမှုကို စဉ်းစားပါ)၊ rougeL နဲ့ rougeLsum က generated နဲ့ reference summaries တွေမှာ အရှည်ဆုံး common substrings တွေကို ရှာဖွေခြင်းဖြင့် အရှည်ဆုံး ကိုက်ညီတဲ့ စကားလုံး sequence တွေကို တိုင်းတာပါတယ်။ rougeLsum ထဲက “sum” ဆိုတာက ဒီ metric ကို summary တစ်ခုလုံးပေါ်မှာ တွက်ချက်တယ်ဆိုတာကို ရည်ညွှန်းပြီး၊ rougeL ကတော့ တစ်ဦးချင်းစီ sentence တွေရဲ့ ပျမ်းမျှအဖြစ် တွက်ချက်ပါတယ်။

✏️ စမ်းသပ်ကြည့်ပါ။ generated နဲ့ reference summary ရဲ့ သင့်ကိုယ်ပိုင် ဥပမာတစ်ခုကို ဖန်တီးပြီး ရလဒ် ROUGE scores တွေက precision နဲ့ recall အတွက် formulas တွေပေါ် အခြေခံထားတဲ့ manual calculation နဲ့ ကိုက်ညီခြင်းရှိမရှိ ကြည့်ပါ။ bonus အမှတ်များအတွက်၊ text ကို bigrams တွေအဖြစ် ခွဲပြီး rouge2 metric အတွက် precision နဲ့ recall ကို နှိုင်းယှဉ်ပါ။

ဒီ ROUGE scores တွေကို ကျွန်တော်တို့ model ရဲ့ စွမ်းဆောင်ရည်ကို ခြေရာခံဖို့ အသုံးပြုပါမယ်၊ ဒါပေမယ့် ဒါကို မလုပ်ဆောင်ခင်၊ ကောင်းမွန်တဲ့ NLP practitioners တိုင်း လုပ်ဆောင်သင့်တဲ့အရာတစ်ခုကို လုပ်ကြရအောင်- strong, yet simple baseline တစ်ခုကို ဖန်တီးတာပါ။

Strong Baseline တစ်ခုကို ဖန်တီးခြင်း

Text summarization အတွက် common baseline တစ်ခုကတော့ article တစ်ခုရဲ့ ပထမဆုံး sentences သုံးခုကို ရိုးရှင်းစွာ ယူတာပါပဲ၊ ဒါကို မကြာခဏ lead-3 baseline လို့ ခေါ်ပါတယ်။ Sentence boundaries တွေကို ခြေရာခံဖို့ full stops တွေကို ကျွန်တော်တို့ အသုံးပြုနိုင်ပေမယ့်၊ “U.S.” ဒါမှမဟုတ် “U.N.” လို acronyms တွေမှာ ဒါက အဆင်မပြေနိုင်ပါဘူး — ဒါကြောင့် ဒါတွေလို အခြေအနေတွေကို ပိုကောင်းကောင်း ကိုင်တွယ်နိုင်မယ့် algorithm ပါဝင်တဲ့ nltk library ကို ကျွန်တော်တို့ အသုံးပြုပါမယ်။ ဒီ package ကို pip ကို အသုံးပြုပြီး အောက်ပါအတိုင်း install လုပ်နိုင်ပါတယ်။

!pip install nltk

ပြီးရင် punctuation rules တွေကို download လုပ်ပါ။

import nltk

nltk.download("punkt")

နောက်တစ်ဆင့်အနေနဲ့၊ nltk ကနေ sentence tokenizer ကို import လုပ်ပြီး review တစ်ခုထဲက ပထမဆုံး sentences သုံးခုကို ထုတ်ယူဖို့ ရိုးရှင်းတဲ့ function တစ်ခုကို ဖန်တီးပါမယ်။ text summarization မှာ convention က summary တစ်ခုစီကို newline နဲ့ ခွဲခြားဖို့ပါပဲ၊ ဒါကြောင့် ဒါကို ထည့်သွင်းပြီး training example တစ်ခုပေါ်မှာ စမ်းသပ်ကြည့်ရအောင်။

from nltk.tokenize import sent_tokenize


def three_sentence_summary(text):
    return "\n".join(sent_tokenize(text)[:3])


print(three_sentence_summary(books_dataset["train"][1]["review_body"]))

'I grew up reading Koontz, and years ago, I stopped,convinced i had "outgrown" him.'
'Still,when a friend was looking for something suspenseful too read, I suggested Koontz.'
'She found Strangers.'

ဒါက အလုပ်ဖြစ်ပုံရပါတယ်။ ဒါကြောင့် အခု ဒီ “summaries” တွေကို dataset တစ်ခုကနေ ထုတ်ယူပြီး baseline အတွက် ROUGE scores တွေကို တွက်ချက်ပေးမယ့် function တစ်ခုကို implement လုပ်ရအောင်။

def evaluate_baseline(dataset, metric):
    summaries = [three_sentence_summary(text) for text in dataset["review_body"]]
    return metric.compute(predictions=summaries, references=dataset["review_title"])

ပြီးရင် ဒီ function ကို အသုံးပြုပြီး validation set ပေါ်မှာ ROUGE scores တွေကို တွက်ချက်နိုင်ပြီး Pandas ကို အသုံးပြုပြီး အနည်းငယ် ပိုကောင်းအောင် ပြင်ဆင်နိုင်ပါတယ်။

import pandas as pd

score = evaluate_baseline(books_dataset["validation"], rouge_score)
rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"]
rouge_dict = dict((rn, round(score[rn].mid.fmeasure * 100, 2)) for rn in rouge_names)
rouge_dict

{'rouge1': 16.74, 'rouge2': 8.83, 'rougeL': 15.6, 'rougeLsum': 15.96}

rouge2 score က ကျန်တာတွေထက် သိသိသာသာ နိမ့်နေတာကို ကျွန်တော်တို့ တွေ့မြင်နိုင်ပါတယ်။ ဒါက review titles တွေဟာ ပုံမှန်အားဖြင့် တိုတောင်းပြီး lead-3 baseline ကတော့ အလွန် verbose ဖြစ်တာကို ရောင်ပြန်ဟပ်နေတာ ဖြစ်နိုင်ပါတယ်။ အခု ကျွန်တော်တို့မှာ အလုပ်လုပ်ဖို့ ကောင်းမွန်တဲ့ baseline တစ်ခုရပြီဆိုတော့၊ mT5 ကို fine-tuning လုပ်တာကို ကျွန်တော်တို့ အာရုံစိုက်ရအောင်။

Trainer API ဖြင့် mT5 ကို Fine-tuning လုပ်ခြင်း

Summarization အတွက် model တစ်ခုကို fine-tuning လုပ်တာက ဒီအခန်းမှာ ကျွန်တော်တို့ ဖော်ပြခဲ့တဲ့ အခြား tasks တွေနဲ့ အလွန်ဆင်တူပါတယ်။ ပထမဆုံး လုပ်ရမှာက mt5-small checkpoint ကနေ pretrained model ကို load လုပ်ဖို့ပါပဲ။ Summarization က sequence-to-sequence task ဖြစ်တာကြောင့်၊ AutoModelForSeq2SeqLM class နဲ့ model ကို load လုပ်နိုင်ပြီး၊ ဒါက weights တွေကို အလိုအလျောက် download လုပ်ပြီး cache လုပ်ပါလိမ့်မယ်။

from transformers import AutoModelForSeq2SeqLM

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

💡 downstream task တစ်ခုပေါ်မှာ model ကို fine-tuning လုပ်တာနဲ့ ပတ်သက်တဲ့ warnings တွေ ဘာကြောင့် မတွေ့ရလဲဆိုတာ သင်တွေးနေမယ်ဆိုရင်၊ ဒါက sequence-to-sequence tasks တွေအတွက် network ရဲ့ weights အားလုံးကို ကျွန်တော်တို့ ထိန်းသိမ်းထားလို့ပါပဲ။ Chapter 3 မှာရှိတဲ့ ကျွန်တော်တို့ရဲ့ text classification model နဲ့ နှိုင်းယှဉ်ကြည့်ပါ။ အဲဒီမှာ pretrained model ရဲ့ head ကို randomly initialized network တစ်ခုနဲ့ အစားထိုးခဲ့ပါတယ်။

နောက်တစ်ဆင့်အနေနဲ့၊ Hugging Face Hub ကို log in လုပ်ဖို့ လိုပါတယ်။ သင် ဒီ code ကို notebook ထဲမှာ run နေတယ်ဆိုရင်၊ အောက်ပါ utility function နဲ့ လုပ်ဆောင်နိုင်ပါတယ်-

from huggingface_hub import notebook_login

notebook_login()

ဒါက သင်၏ credentials တွေကို ထည့်သွင်းနိုင်မယ့် widget တစ်ခုကို ပြသပါလိမ့်မယ်။ ဒါမှမဟုတ်၊ ဒီ command ကို သင့် terminal မှာ run ပြီး အဲဒီမှာ log in လုပ်နိုင်ပါတယ်။

huggingface-cli login

training လုပ်နေစဉ် ROUGE scores တွေ တွက်ချက်နိုင်ဖို့ summaries တွေ ထုတ်လုပ်ဖို့ ကျွန်တော်တို့ လိုအပ်ပါလိမ့်မယ်။ ကံကောင်းစွာနဲ့ပဲ၊ 🤗 Transformers က ဒါကို ကျွန်တော်တို့အတွက် အလိုအလျောက် လုပ်ဆောင်ပေးနိုင်တဲ့ dedicated Seq2SeqTrainingArguments နဲ့ Seq2SeqTrainer classes တွေကို ပံ့ပိုးပေးပါတယ်။ ဒါက ဘယ်လိုအလုပ်လုပ်လဲဆိုတာ ကြည့်ဖို့၊ ကျွန်တော်တို့ရဲ့ experiments တွေအတွက် hyperparameters တွေနဲ့ အခြား arguments တွေကို အရင်ဆုံး သတ်မှတ်ရအောင်။

from transformers import Seq2SeqTrainingArguments

batch_size = 8
num_train_epochs = 8
# epoch တိုင်းမှာ training loss ကို ပြသပါ
logging_steps = len(tokenized_datasets["train"]) // batch_size
model_name = model_checkpoint.split("/")[-1]

args = Seq2SeqTrainingArguments(
    output_dir=f"{model_name}-finetuned-amazon-en-es",
    evaluation_strategy="epoch",
    learning_rate=5.6e-5,
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=batch_size,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=num_train_epochs,
    predict_with_generate=True,
    logging_steps=logging_steps,
    push_to_hub=True,
)

ဒီနေရာမှာ၊ predict_with_generate argument ကို evaluation လုပ်နေစဉ် summaries တွေ ထုတ်လုပ်သင့်တယ်ဆိုတာကို ဖော်ပြဖို့ သတ်မှတ်ထားတာကြောင့်၊ epoch တစ်ခုစီအတွက် ROUGE scores တွေကို ကျွန်တော်တို့ တွက်ချက်နိုင်ပါတယ်။ Chapter 1 မှာ ဆွေးနွေးခဲ့တဲ့အတိုင်း၊ decoder က tokens တွေကို တစ်ခုပြီးတစ်ခု ခန့်မှန်းခြင်းဖြင့် inference ကို လုပ်ဆောင်ပါတယ်၊ ဒါကို model ရဲ့ generate() method နဲ့ implement လုပ်ထားပါတယ်။ predict_with_generate=True လို့ သတ်မှတ်ခြင်းက Seq2SeqTrainer ကို evaluation အတွက် အဲဒီ method ကို အသုံးပြုဖို့ ပြောတာပါ။ ကျွန်တော်တို့က learning rate၊ epochs အရေအတွက်နဲ့ weight decay လိုမျိုး default hyperparameters အချို့ကိုလည်း ချိန်ညှိထားပြီး၊ training လုပ်နေစဉ် checkpoints ၃ ခုအထိသာ save လုပ်ဖို့ save_total_limit option ကို သတ်မှတ်ထားပါတယ် — ဒါက mT5 ရဲ့ “small” version ကတောင် hard drive space တစ် GB လောက် အသုံးပြုတာကြောင့်၊ ကျွန်တော်တို့ save လုပ်တဲ့ copies အရေအတွက်ကို ကန့်သတ်ခြင်းဖြင့် နေရာအနည်းငယ် ချွေတာနိုင်လို့ပါပဲ။

push_to_hub=True argument က training ပြီးတာနဲ့ model ကို Hub ကို push လုပ်နိုင်စေပါလိမ့်မယ်၊ repository ကို သင့် user profile အောက်မှာ output_dir က သတ်မှတ်ထားတဲ့ နေရာမှာ တွေ့ရပါလိမ့်မယ်။ သင် push လုပ်ချင်တဲ့ repository ရဲ့ နာမည်ကို hub_model_id argument နဲ့ သတ်မှတ်နိုင်တယ်ဆိုတာ သတိပြုပါ (အထူးသဖြင့်၊ organization တစ်ခုသို့ push လုပ်ဖို့ ဒီ argument ကို အသုံးပြုရပါလိမ့်မယ်)။ ဥပမာ၊ ကျွန်တော်တို့ model ကို huggingface-course organization ကို push လုပ်တဲ့အခါ၊ Seq2SeqTrainingArguments မှာ hub_model_id="huggingface-course/mt5-finetuned-amazon-en-es" ကို ထည့်သွင်းခဲ့ပါတယ်။

နောက်တစ်ဆင့်အနေနဲ့ trainer ကို compute_metrics() function တစ်ခု ပေးဖို့ လိုအပ်ပါတယ်။ ဒါမှ training လုပ်နေစဉ် ကျွန်တော်တို့ model ကို evaluate လုပ်နိုင်မှာပါ။ summarization အတွက် ဒါက model ရဲ့ predictions တွေပေါ်မှာ rouge_score.compute() ကို ရိုးရှင်းစွာ ခေါ်တာထက် အနည်းငယ် ပိုရှုပ်ထွေးပါတယ်။ ဘာလို့လဲဆိုတော့ ROUGE scores တွေ တွက်ချက်နိုင်ဖို့အတွက် outputs တွေနဲ့ labels တွေကို text အဖြစ် decode လုပ်ဖို့ လိုအပ်လို့ပါပဲ။ အောက်ပါ function က ဒါကို အတိအကျ လုပ်ဆောင်ပေးပြီး၊ summary sentences တွေကို newlines တွေနဲ့ ခွဲခြားဖို့ nltk က sent_tokenize() function ကိုလည်း အသုံးပြုပါတယ်။

import numpy as np


def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    # Generated summaries တွေကို text အဖြစ် Decode လုပ်ပါ
    decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    # labels တွေထဲက -100 တွေကို ကျွန်တော်တို့ decode လုပ်လို့မရတာကြောင့် အစားထိုးပါ
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    # Reference summaries တွေကို text အဖြစ် Decode လုပ်ပါ
    decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)
    # ROUGE က sentence တိုင်းနောက်မှာ newline တစ်ခု လိုအပ်ပါတယ်
    decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds]
    decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels]
    # ROUGE scores တွေကို တွက်ချက်ပါ
    result = rouge_score.compute(
        predictions=decoded_preds, references=decoded_labels, use_stemmer=True
    )
    # Median scores တွေကို ထုတ်ယူပါ
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    return {k: round(v, 4) for k, v in result.items()}

နောက်တစ်ဆင့်အနေနဲ့၊ ကျွန်တော်တို့ရဲ့ sequence-to-sequence task အတွက် data collator တစ်ခုကို သတ်မှတ်ဖို့ လိုပါတယ်။ mT5 က encoder-decoder Transformer model ဖြစ်တာကြောင့်၊ ကျွန်တော်တို့ရဲ့ batches တွေကို ပြင်ဆင်ရာမှာ သိမ်မွေ့မှုတစ်ခု ရှိပါတယ်၊ decoding လုပ်နေစဉ် labels တွေကို ညာဘက်သို့ တစ်နေရာရွှေ့ဖို့ လိုအပ်ပါတယ်။ ဒါက decoder က ယခင် ground truth labels တွေကိုသာ မြင်ရပြီး၊ လက်ရှိ ဒါမှမဟုတ် အနာဂတ် labels တွေကို မမြင်ရဖို့ သေချာစေဖို့အတွက် လိုအပ်ပါတယ်။ အဲဒါတွေက model က မှတ်မိဖို့ လွယ်ကူမှာဖြစ်ပါတယ်။ ဒါက causal language modeling လို task တစ်ခုမှာ inputs တွေပေါ်မှာ masked self-attention ဘယ်လိုအသုံးပြုလဲဆိုတာနဲ့ ဆင်တူပါတယ်။

ကံကောင်းစွာနဲ့ပဲ၊ 🤗 Transformers က inputs တွေနဲ့ labels တွေကို dynamically pad လုပ်ပေးမယ့် DataCollatorForSeq2Seq collator တစ်ခုကို ပံ့ပိုးပေးပါတယ်။ ဒီ collator ကို instantiate လုပ်ဖို့၊ ကျွန်တော်တို့ tokenizer နဲ့ model ကို ပေးဖို့ပဲ လိုအပ်ပါတယ်။

from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model)

collator က ဥပမာအနည်းငယ်ကို ပေးလိုက်တဲ့အခါ ဘာတွေ ထုတ်လုပ်ပေးလဲဆိုတာ ကြည့်ရအောင်။ ပထမဆုံး၊ string တွေပါဝင်တဲ့ columns တွေကို ကျွန်တော်တို့ ဖယ်ရှားဖို့ လိုအပ်ပါတယ်။ ဘာလို့လဲဆိုတော့ collator က ဒီ elements တွေကို ဘယ်လို pad လုပ်ရမလဲဆိုတာ သိမှာ မဟုတ်လို့ပါပဲ။

tokenized_datasets = tokenized_datasets.remove_columns(
    books_dataset["train"].column_names
)

collator က dict တွေရဲ့ list ကို မျှော်လင့်ထားတာကြောင့် (dict တစ်ခုစီက dataset ထဲက single example ကို ကိုယ်စားပြုပါတယ်)၊ data ကို collator ကို မပေးပို့ခင် မျှော်လင့်ထားတဲ့ format အဖြစ် wrangle လုပ်ဖို့လည်း ကျွန်တော်တို့ လိုအပ်ပါတယ်။

features = [tokenized_datasets["train"][i] for i in range(2)]
data_collator(features)

{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'input_ids': tensor([[  1494,    259,   8622,    390,    259,    262,   2316,   3435,    955,
            772,    281,    772,   1617,    263,    305,  14701,    260,   1385,
           3031,    259,  24146,    332,   1037,    259,  43906,    305,    336,
            260,      1,      0,      0,      0,      0,      0,      0],
        [   259,  27531,  13483,    259,   7505,    260, 112240,  15192,    305,
          53198,    276,    259,  74060,    263,    260,    459,  25640,    776,
           2119,    336,    259,   2220,    259,  18896,    288,   4906,    288,
           1037,   3931,    260,   7083, 101476,   1143,    260,      1]]), 'labels': tensor([[ 7483,   259,  2364, 15695,     1,  -100],
        [  259, 27531, 13483,   259,  7505,     1]]), 'decoder_input_ids': tensor([[    0,  7483,   259,  2364, 15695,     1],
        [    0,   259, 27531, 13483,   259,  7505]])}

ဒီနေရာမှာ သတိထားရမယ့် အဓိကအချက်က ပထမဥပမာက ဒုတိယဥပမာထက် ပိုရှည်တာကြောင့်၊ ဒုတိယဥပမာရဲ့ input_ids နဲ့ attention_mask တွေကို ညာဘက်မှာ [PAD] token (၎င်းရဲ့ ID က 0 ဖြစ်ပါတယ်) နဲ့ padding လုပ်ထားပါတယ်။ အလားတူပဲ၊ labels တွေကို -100 တွေနဲ့ padding လုပ်ထားတာကို တွေ့ရပါတယ်။ ဒါမှ padding tokens တွေကို loss function က လျစ်လျူရှုစေမှာပါ။ နောက်ဆုံးအနေနဲ့၊ decoder_input_ids အသစ်တစ်ခုကို ကျွန်တော်တို့ တွေ့မြင်နိုင်ပါတယ်။ ဒါက ပထမ entry မှာ [PAD] token တစ်ခု ထည့်သွင်းခြင်းဖြင့် labels တွေကို ညာဘက်သို့ ရွှေ့ထားပါတယ်။

ကျွန်တော်တို့ training လုပ်ဖို့ လိုအပ်တဲ့ ingredients အားလုံးကို နောက်ဆုံးတော့ ရရှိပါပြီ! အခု ကျွန်တော်တို့က trainer ကို standard arguments တွေနဲ့ instantiate လုပ်ဖို့ပဲ လိုအပ်ပါတယ်။

from transformers import Seq2SeqTrainer

trainer = Seq2SeqTrainer(
    model,
    args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["validation"],
    data_collator=data_collator,
    tokenizer=tokenizer,
    compute_metrics=compute_metrics,
)

ပြီးရင် training run ကို စတင်ပါ။

trainer.train()

training လုပ်နေစဉ်၊ epoch တိုင်းမှာ training loss ကျဆင်းပြီး ROUGE scores တိုးလာတာကို သင်တွေ့ရပါလိမ့်မယ်။ training ပြီးတာနဲ့ Trainer.evaluate() ကို run ခြင်းဖြင့် နောက်ဆုံး ROUGE scores တွေကို ကြည့်နိုင်ပါတယ်။

trainer.evaluate()

{'eval_loss': 3.028524398803711,
 'eval_rouge1': 16.9728,
 'eval_rouge2': 8.2969,
 'eval_rougeL': 16.8366,
 'eval_rougeLsum': 16.851,
 'eval_gen_len': 10.1597,
 'eval_runtime': 6.1054,
 'eval_samples_per_second': 38.982,
 'eval_steps_per_second': 4.914}

scores တွေကနေ ကျွန်တော်တို့ model ဟာ ကျွန်တော်တို့ရဲ့ lead-3 baseline ကို ကောင်းကောင်းကျော်ဖြတ်နိုင်ခဲ့တာကို တွေ့မြင်နိုင်ပါတယ် — ကောင်းလိုက်တာ! နောက်ဆုံးလုပ်ဆောင်ရမယ့်အရာက model weights တွေကို Hub ကို push လုပ်ဖို့ပါပဲ၊ အောက်ပါအတိုင်း လုပ်ဆောင်ပါ။

trainer.push_to_hub(commit_message="Training complete", tags="summarization")

'https://huggingface.co/huggingface-course/mt5-small-finetuned-amazon-en-es/commit/aa0536b829b28e73e1e4b94b8a5aacec420d40e0'

ဒါက checkpoint နဲ့ configuration files တွေကို output_dir မှာ သိမ်းဆည်းပေးပြီး၊ ဖိုင်အားလုံးကို Hub ကို upload လုပ်ပါလိမ့်မယ်။ tags argument ကို သတ်မှတ်ခြင်းဖြင့်၊ Hub ပေါ်က widget က mT5 architecture နဲ့ ဆက်စပ်နေတဲ့ default text generation pipeline အစား summarization pipeline အတွက် ဖြစ်နေဖို့လည်း ကျွန်တော်တို့ သေချာစေပါတယ် (model tags တွေအကြောင်း အသေးစိတ်အချက်အလက်တွေအတွက် 🤗 Hub documentation ကို ကြည့်ပါ)။ trainer.push_to_hub() ကနေ ထွက်လာတဲ့ output က Git commit hash ရဲ့ URL ဖြစ်တာကြောင့်၊ model repository မှာ ပြုလုပ်ခဲ့တဲ့ ပြောင်းလဲမှုတွေကို သင်အလွယ်တကူ မြင်နိုင်ပါလိမ့်မယ်!

ဒီအပိုင်းကို နိဂုံးချုပ်အနေနဲ့၊ 🤗 Accelerate က ပံ့ပိုးပေးတဲ့ low-level features တွေကို အသုံးပြုပြီး mT5 ကို ဘယ်လို fine-tune လုပ်နိုင်လဲဆိုတာ ကြည့်ရအောင်။

🤗 Accelerate ဖြင့် mT5 ကို Fine-tuning လုပ်ခြင်း

🤗 Accelerate ဖြင့် ကျွန်တော်တို့ model ကို fine-tuning လုပ်တာက Chapter 3 မှာ ကြုံတွေ့ခဲ့ရတဲ့ text classification ဥပမာနဲ့ အလွန်ဆင်တူပါတယ်။ အဓိကကွာခြားချက်တွေက training လုပ်နေစဉ် ကျွန်တော်တို့ summaries တွေကို ရှင်းရှင်းလင်းလင်း ထုတ်လုပ်ဖို့ လိုအပ်ပြီး ROUGE scores တွေကို ဘယ်လိုတွက်ချက်မလဲဆိုတာ သတ်မှတ်ဖို့ လိုအပ်မှာပါ (recall လုပ်ကြည့်ပါ၊ Seq2SeqTrainer က ကျွန်တော်တို့အတွက် generation ကို ဂရုစိုက်ပေးခဲ့ပါတယ်)။ 🤗 Accelerate ထဲမှာ ဒီလိုအပ်ချက်နှစ်ခုကို ဘယ်လို implement လုပ်နိုင်လဲဆိုတာ ကြည့်ရအောင်!

Training အတွက် အရာအားလုံးကို ပြင်ဆင်ခြင်း

ပထမဆုံး ကျွန်တော်တို့ လုပ်ရမှာက ကျွန်တော်တို့ splits တစ်ခုစီအတွက် DataLoader တစ်ခု ဖန်တီးဖို့ပါပဲ။ PyTorch dataloaders တွေက batches of tensors တွေကို မျှော်လင့်ထားတာကြောင့်၊ ကျွန်တော်တို့ datasets တွေမှာ format ကို "torch" လို့ သတ်မှတ်ဖို့ လိုအပ်ပါတယ်။

tokenized_datasets.set_format("torch")

အခု tensors တွေသာ ပါဝင်တဲ့ datasets တွေရပြီဆိုတော့၊ နောက်တစ်ဆင့်အနေနဲ့ DataCollatorForSeq2Seq ကို ထပ်မံ instantiate လုပ်ဖို့ပါပဲ။ ဒီအတွက် model ရဲ့ version အသစ်တစ်ခုကို ပေးဖို့ လိုအပ်တာကြောင့်၊ ကျွန်တော်တို့ cache ကနေ ဒါကို ထပ် load လုပ်ရအောင်။

model = AutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)

ပြီးရင် data collator ကို instantiate လုပ်ပြီး ဒါကို ကျွန်တော်တို့ dataloaders တွေကို သတ်မှတ်ဖို့ အသုံးပြုနိုင်ပါတယ်။

from torch.utils.data import DataLoader

batch_size = 8
train_dataloader = DataLoader(
    tokenized_datasets["train"],
    shuffle=True,
    collate_fn=data_collator,
    batch_size=batch_size,
)
eval_dataloader = DataLoader(
    tokenized_datasets["validation"], collate_fn=data_collator, batch_size=batch_size
)

နောက်တစ်ဆင့်အနေနဲ့ အသုံးပြုချင်တဲ့ optimizer ကို သတ်မှတ်ဖို့ပါပဲ။ ကျွန်တော်တို့ရဲ့ အခြားဥပမာတွေမှာလိုပဲ၊ ပြဿနာအများစုအတွက် ကောင်းကောင်းအလုပ်လုပ်တဲ့ AdamW ကို ကျွန်တော်တို့ အသုံးပြုပါမယ်။

from torch.optim import AdamW

optimizer = AdamW(model.parameters(), lr=2e-5)

နောက်ဆုံးအနေနဲ့၊ ကျွန်တော်တို့ model၊ optimizer နဲ့ dataloaders တွေကို accelerator.prepare() method ကို ပေးရပါမယ်။

from accelerate import Accelerator

accelerator = Accelerator()
model, optimizer, train_dataloader, eval_dataloader = accelerator.prepare(
    model, optimizer, train_dataloader, eval_dataloader
)

🚨 သင် TPU ပေါ်မှာ train လုပ်နေတယ်ဆိုရင်၊ အပေါ်က code အားလုံးကို dedicated training function တစ်ခုထဲကို ရွှေ့ဖို့ လိုပါလိမ့်မယ်။ အသေးစိတ်အချက်အလက်တွေအတွက် Chapter 3 ကို ကြည့်ပါ။

အခု ကျွန်တော်တို့ objects တွေကို ပြင်ဆင်ပြီးပြီဆိုတော့၊ ကျန်ရှိတဲ့ လုပ်စရာသုံးခု ရှိပါသေးတယ်။

Learning rate schedule ကို သတ်မှတ်ပါ။
Evaluation အတွက် summaries တွေကို post-process လုပ်ဖို့ function တစ်ခုကို implement လုပ်ပါ။
ကျွန်တော်တို့ model ကို push လုပ်နိုင်မယ့် Hub ပေါ်မှာ repository တစ်ခု ဖန်တီးပါ။

learning rate schedule အတွက်၊ ယခင်အပိုင်းတွေက standard linear schedule ကို ကျွန်တော်တို့ အသုံးပြုပါမယ်။

from transformers import get_scheduler

num_train_epochs = 10
num_update_steps_per_epoch = len(train_dataloader)
num_training_steps = num_train_epochs * num_update_steps_per_epoch

lr_scheduler = get_scheduler(
    "linear",
    optimizer=optimizer,
    num_warmup_steps=0,
    num_training_steps=num_training_steps,
)

Post-processing အတွက်၊ generated summaries တွေကို newlines တွေနဲ့ ခွဲထားတဲ့ sentences တွေအဖြစ် ခွဲထုတ်ပေးမယ့် function တစ်ခု လိုအပ်ပါတယ်။ ဒါက ROUGE metric မျှော်လင့်ထားတဲ့ format ဖြစ်ပြီး၊ အောက်ပါ code snippet နဲ့ ဒါကို အောင်မြင်အောင် လုပ်ဆောင်နိုင်ပါတယ်။

def postprocess_text(preds, labels):
    preds = [pred.strip() for pred in preds]
    labels = [label.strip() for label in labels]

    # ROUGE က sentence တိုင်းနောက်မှာ newline တစ်ခု လိုအပ်ပါတယ်
    preds = ["\n".join(nltk.sent_tokenize(pred)) for pred in preds]
    labels = ["\n".join(nltk.sent_tokenize(label)) for label in labels]

    return preds, labels

Seq2SeqTrainer ရဲ့ compute_metrics() function ကို ဘယ်လိုသတ်မှတ်ခဲ့လဲဆိုတာ သင်မှတ်မိတယ်ဆိုရင် ဒါက သင့်အတွက် ရင်းနှီးနေမှာပါ။

နောက်ဆုံးအနေနဲ့၊ Hugging Face Hub ပေါ်မှာ model repository တစ်ခု ဖန်တီးဖို့ လိုအပ်ပါတယ်။ ဒီအတွက်၊ သင့်လျော်တဲ့ခေါင်းစဉ်ရှိတဲ့ 🤗 Hub library ကို ကျွန်တော်တို့ အသုံးပြုနိုင်ပါတယ်။ ကျွန်တော်တို့ repository အတွက် နာမည်တစ်ခု သတ်မှတ်ဖို့ပဲ လိုအပ်ပြီး၊ library မှာ repository ID ကို user profile နဲ့ ပေါင်းစပ်ဖို့ utility function တစ်ခုရှိပါတယ်။

from huggingface_hub import get_full_repo_name

model_name = "test-bert-finetuned-squad-accelerate"
repo_name = get_full_repo_name(model_name)
repo_name

'lewtun/mt5-finetuned-amazon-en-es-accelerate'

အခု ဒီ repository name ကို အသုံးပြုပြီး ကျွန်တော်တို့ရဲ့ results directory ထဲကို local version တစ်ခုကို clone လုပ်နိုင်ပါတယ်။ အဲဒီ directory က training artifacts တွေကို သိမ်းဆည်းထားပါလိမ့်မယ်။

from huggingface_hub import Repository

output_dir = "results-mt5-finetuned-squad-accelerate"
repo = Repository(output_dir, clone_from=repo_name)

ဒါက training လုပ်နေစဉ် repo.push_to_hub() method ကို ခေါ်ခြင်းဖြင့် artifacts တွေကို Hub ကို ပြန် push လုပ်နိုင်စေပါလိမ့်မယ်! အခု ကျွန်တော်တို့ရဲ့ analysis ကို training loop ကို ရေးသားခြင်းဖြင့် နိဂုံးချုပ်လိုက်ရအောင်။

Training Loop

Summarization အတွက် training loop က ကျွန်တော်တို့ ကြုံတွေ့ခဲ့ရတဲ့ အခြား 🤗 Accelerate ဥပမာတွေနဲ့ အတော်လေး ဆင်တူပြီး အကြမ်းဖျင်းအားဖြင့် အဓိကအဆင့်လေးဆင့် ခွဲထားပါတယ်။

၁။ epoch တစ်ခုစီအတွက် train_dataloader ထဲက ဥပမာအားလုံးကို iterate လုပ်ခြင်းဖြင့် model ကို train လုပ်ပါ။ ၂။ epoch တစ်ခုစီရဲ့ အဆုံးမှာ model summaries တွေ ထုတ်လုပ်ပါ။ ဒါက tokens တွေကို အရင်ထုတ်လုပ်ပြီး ပြီးရင် ၎င်းတို့ (နဲ့ reference summaries) တွေကို text အဖြစ် decode လုပ်ခြင်းဖြင့် လုပ်ဆောင်ပါတယ်။ ၃။ အစောပိုင်းက ကျွန်တော်တို့ တွေ့ခဲ့ရတဲ့ နည်းလမ်းတွေ တူတူကို အသုံးပြုပြီး ROUGE scores တွေကို တွက်ချက်ပါ။ ၄။ checkpoints တွေကို save လုပ်ပြီး အရာအားလုံးကို Hub ကို push လုပ်ပါ။ ဒီနေရာမှာ ကျွန်တော်တို့ Repository object ရဲ့ အသုံးဝင်တဲ့ blocking=False argument ကို အားကိုးပါတယ်။ ဒါက epoch တစ်ခုစီအတွက် checkpoints တွေကို asynchronously push လုပ်နိုင်စေပါတယ်။ ဒါက GB အရွယ်အစားရှိတဲ့ model နဲ့ ဆက်စပ်နေတဲ့ အနည်းငယ်နှေးကွေးတဲ့ upload ကို စောင့်စရာမလိုဘဲ training ကို ဆက်လက်လုပ်ဆောင်နိုင်စေပါတယ်။

ဒီအဆင့်တွေကို အောက်ပါ code block မှာ တွေ့မြင်နိုင်ပါတယ်။

from tqdm.auto import tqdm
import torch
import numpy as np

progress_bar = tqdm(range(num_training_steps))

for epoch in range(num_train_epochs):
    # Training
    model.train()
    for step, batch in enumerate(train_dataloader):
        outputs = model(**batch)
        loss = outputs.loss
        accelerator.backward(loss)

        optimizer.step()
        lr_scheduler.step()
        optimizer.zero_grad()
        progress_bar.update(1)

    # Evaluation
    model.eval()
    for step, batch in enumerate(eval_dataloader):
        with torch.no_grad():
            generated_tokens = accelerator.unwrap_model(model).generate(
                batch["input_ids"],
                attention_mask=batch["attention_mask"],
            )

            generated_tokens = accelerator.pad_across_processes(
                generated_tokens, dim=1, pad_index=tokenizer.pad_token_id
            )
            labels = batch["labels"]

            # အကယ်၍ ကျွန်တော်တို့ max length အထိ padding မလုပ်ခဲ့ရင်၊ labels တွေကိုလည်း pad လုပ်ဖို့ လိုအပ်ပါတယ်
            labels = accelerator.pad_across_processes(
                batch["labels"], dim=1, pad_index=tokenizer.pad_token_id
            )

            generated_tokens = accelerator.gather(generated_tokens).cpu().numpy()
            labels = accelerator.gather(labels).cpu().numpy()

            # labels တွေထဲက -100 တွေကို ကျွန်တော်တို့ decode လုပ်လို့မရတာကြောင့် အစားထိုးပါ
            labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
            if isinstance(generated_tokens, tuple):
                generated_tokens = generated_tokens[0]
            decoded_preds = tokenizer.batch_decode(
                generated_tokens, skip_special_tokens=True
            )
            decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True)

            decoded_preds, decoded_labels = postprocess_text(
                decoded_preds, decoded_labels
            )

            rouge_score.add_batch(predictions=decoded_preds, references=decoded_labels)

    # Metrics တွေကို တွက်ချက်ပါ
    result = rouge_score.compute()
    # Median ROUGE scores တွေကို ထုတ်ယူပါ
    result = {key: value.mid.fmeasure * 100 for key, value in result.items()}
    result = {k: round(v, 4) for k, v in result.items()}
    print(f"Epoch {epoch}:", result)

    # Save လုပ်ပြီး upload လုပ်ပါ
    accelerator.wait_for_everyone()
    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(output_dir, save_function=accelerator.save)
    if accelerator.is_main_process:
        tokenizer.save_pretrained(output_dir)
        repo.push_to_hub(
            commit_message=f"Training in progress epoch {epoch}", blocking=False
        )

Epoch 0: {'rouge1': 5.6351, 'rouge2': 1.1625, 'rougeL': 5.4866, 'rougeLsum': 5.5005}
Epoch 1: {'rouge1': 9.8646, 'rouge2': 3.4106, 'rougeL': 9.9439, 'rougeLsum': 9.9306}
Epoch 2: {'rouge1': 11.0872, 'rouge2': 3.3273, 'rougeL': 11.0508, 'rougeLsum': 10.9468}
Epoch 3: {'rouge1': 11.8587, 'rouge2': 4.8167, 'rougeL': 11.7986, 'rougeLsum': 11.7518}
Epoch 4: {'rouge1': 12.9842, 'rouge2': 5.5887, 'rougeL': 12.7546, 'rougeLsum': 12.7029}
Epoch 5: {'rouge1': 13.4628, 'rouge2': 6.4598, 'rougeL': 13.312, 'rougeLsum': 13.2913}
Epoch 6: {'rouge1': 12.9131, 'rouge2': 5.8914, 'rougeL': 12.6896, 'rougeLsum': 12.5701}
Epoch 7: {'rouge1': 13.3079, 'rouge2': 6.2994, 'rougeL': 13.1536, 'rougeLsum': 13.1194}
Epoch 8: {'rouge1': 13.96, 'rouge2': 6.5998, 'rougeL': 13.9123, 'rougeLsum': 13.7744}
Epoch 9: {'rouge1': 14.1192, 'rouge2': 7.0059, 'rougeL': 14.1172, 'rougeLsum': 13.9509}

ဒါပါပဲ! ဒါကို run လိုက်တာနဲ့၊ Trainer နဲ့ ကျွန်တော်တို့ ရရှိခဲ့တဲ့ ရလဒ်တွေနဲ့ အတော်လေး ဆင်တူတဲ့ model နဲ့ results တွေ သင်ရရှိပါလိမ့်မယ်။

သင် Fine-tuned လုပ်ထားသော Model ကို အသုံးပြုခြင်း

Model ကို Hub ကို push လုပ်ပြီးတာနဲ့၊ inference widget ဒါမှမဟုတ် pipeline object ကို အသုံးပြုပြီး ဒါနဲ့ ကစားနိုင်ပါတယ်၊ အောက်ပါအတိုင်းပါ…

from transformers import pipeline

hub_model_id = "huggingface-course/mt5-small-finetuned-amazon-en-es"
summarizer = pipeline("summarization", model=hub_model_id)

test set (model က မမြင်ဖူးသေးသော) က ဥပမာအချို့ကို ကျွန်တော်တို့ရဲ့ pipeline ကို ပေးပို့ခြင်းဖြင့် summaries တွေရဲ့ အရည်အသွေးကို ခံစားကြည့်နိုင်ပါတယ်။ ပထမဆုံး review၊ title နဲ့ generated summary တွေကို အတူတကွ ပြသဖို့ ရိုးရှင်းတဲ့ function တစ်ခုကို implement လုပ်ရအောင်။

def print_summary(idx):
    review = books_dataset["test"][idx]["review_body"]
    title = books_dataset["test"][idx]["review_title"]
    summary = summarizer(books_dataset["test"][idx]["review_body"])[0]["summary_text"]
    print(f"'>>> Review: {review}'")
    print(f"\n'>>> Title: {title}'")
    print(f"\n'>>> Summary: {summary}'")

ကျွန်တော်တို့ ရရှိတဲ့ English ဥပမာတွေထဲက တစ်ခုကို ကြည့်ရအောင်…

print_summary(100)

'>>> Review: Nothing special at all about this product... the book is too small and stiff and hard to write in. The huge sticker on the back doesn’t come off and looks super tacky. I would not purchase this again. I could have just bought a journal from the dollar store and it would be basically the same thing. It’s also really expensive for what it is.'

'>>> Title: Not impressed at all... buy something else'

'>>> Summary: Nothing special at all about this product'

ဒါက သိပ်မဆိုးပါဘူး! ကျွန်တော်တို့ model ဟာ review ရဲ့ အစိတ်အပိုင်းတွေကို စကားလုံးအသစ်တွေနဲ့ ဖြည့်စွက်ခြင်းဖြင့် abstractive summarization ကို တကယ်လုပ်ဆောင်နိုင်ခဲ့တာကို တွေ့မြင်နိုင်ပါတယ်။ ပြီးတော့ ကျွန်တော်တို့ model ရဲ့ အမိုက်ဆုံး ကဏ္ဍကတော့ ဒါက bilingual ဖြစ်တာကြောင့် Spanish reviews တွေကိုလည်း summaries တွေ ထုတ်လုပ်နိုင်ပါတယ်။

print_summary(0)

'>>> Review: Es una trilogia que se hace muy facil de leer. Me ha gustado, no me esperaba el final para nada'

'>>> Title: Buena literatura para adolescentes'

'>>> Summary: Muy facil de leer'

summary ကို English လို “Very easy to read” လို့ ဘာသာပြန်နိုင်ပြီး၊ ဒီကိစ္စမှာ review ကနေ တိုက်ရိုက်ထုတ်ယူထားတာကို ကျွန်တော်တို့ တွေ့မြင်နိုင်ပါတယ်။ သို့သော်လည်း၊ ဒါက mT5 model ရဲ့ versatility ကို ပြသပြီး multilingual corpus နဲ့ အလုပ်လုပ်တာ ဘယ်လိုလဲဆိုတာကို သင့်ကို ခံစားကြည့်ခွင့် ပေးခဲ့ပါတယ်။

နောက်တစ်ဆင့်အနေနဲ့၊ အနည်းငယ် ပိုရှုပ်ထွေးတဲ့ task တစ်ခုကို ကျွန်တော်တို့ အာရုံစိုက်ပါမယ်၊ language model တစ်ခုကို အစကနေ train လုပ်တာပါ။

ဝေါဟာရ ရှင်းလင်းချက် (Glossary)

Transformer Models: Natural Language Processing (NLP) မှာ အောင်မြင်မှုများစွာရရှိခဲ့တဲ့ deep learning architecture တစ်မျိုးပါ။
Text Summarization: ရှည်လျားသော စာသားတစ်ခုကို အဓိကအချက်အလက်များပါဝင်သည့် ပိုတိုတောင်းသော version တစ်ခုအဖြစ် ပြောင်းလဲခြင်း။
NLP Tasks: ကွန်ပျူတာတွေ လူသားဘာသာစကားကို နားလည်၊ အဓိပ္ပာယ်ဖော်ပြီး၊ ဖန်တီးနိုင်အောင် လုပ်ဆောင်ပေးတဲ့ အလုပ်တွေ။
Coherent Text: အဓိပ္ပာယ်ပြည့်စုံပြီး စနစ်တကျရှိသော စာသား။
Domain Experts: သီးခြားနယ်ပယ်တစ်ခုတွင် ကျွမ်းကျင်သူများ။
Hugging Face Hub: AI မော်ဒယ်တွေ၊ datasets တွေနဲ့ demo တွေကို အခြားသူတွေနဲ့ မျှဝေဖို့၊ ရှာဖွေဖို့နဲ့ ပြန်လည်အသုံးပြုဖို့အတွက် အွန်လိုင်း platform တစ်ခု ဖြစ်ပါတယ်။
Pipeline Tag: Hugging Face Hub တွင် models များကို pipeline အမျိုးအစားအလိုက် စစ်ထုတ်ရန် အသုံးပြုသော tag။
Bilingual Model: ဘာသာစကားနှစ်မျိုး (ဤနေရာတွင် အင်္ဂလိပ်နှင့် စပိန်) ဖြင့် လုပ်ဆောင်နိုင်သော model။
Customer Reviews: ထုတ်ကုန်များ သို့မဟုတ် ဝန်ဆောင်မှုများနှင့် ပတ်သက်သော သုံးစွဲသူများ၏ ထင်မြင်ယူဆချက်များ။
Corpus: စာသား (သို့မဟုတ် အခြားဒေတာ) အစုအဝေးကြီးတစ်ခု။
Multilingual Amazon Reviews Corpus: Amazon ထုတ်ကုန် reviews များ ပါဝင်သည့် ဘာသာစကားမျိုးစုံသုံး dataset။
Multilingual Classifiers: ဘာသာစကားမျိုးစုံရှိ စာသားများကို အမျိုးအစားခွဲခြားနိုင်သော model များ။
Benchmark: Model များ၏ စွမ်းဆောင်ရည်ကို တိုင်းတာရန်အတွက် စံသတ်မှတ်ထားသော datasets နှင့် metrics များ။
Target Summaries: Model က ထုတ်လုပ်ရန် ရည်ရွယ်ထားသော အနှစ်ချုပ်စာသားများ (ဤနေရာတွင် review titles များ)။
Subsets: ပိုကြီးသော dataset တစ်ခုမှ ရွေးထုတ်ထားသော အစိတ်အပိုင်းများ။
load_dataset() Function: Hugging Face Datasets library မှ dataset များကို download လုပ်ပြီး cache လုပ်ရန် အသုံးပြုသော function။
DatasetDict Object: Training set, validation set, နှင့် test set ကဲ့သို့သော dataset အများအပြားကို dictionary ပုံစံဖြင့် သိမ်းဆည်းထားသော object။
train Split: Model ကို လေ့ကျင့်ရန်အတွက် အသုံးပြုသော dataset အပိုင်း။
validation Split: Training လုပ်နေစဉ် model ၏ စွမ်းဆောင်ရည်ကို အကဲဖြတ်ရန် အသုံးပြုသော dataset အပိုင်း။
test Split: Model ၏ နောက်ဆုံး စွမ်းဆောင်ရည်ကို တိုင်းတာရန် အသုံးပြုသော dataset အပိုင်း။
review_body Column: review ၏ အဓိက စာသားပါဝင်သော column။
review_title Column: review ၏ ခေါင်းစဉ်ပါဝင်သော column။
Random Sample: dataset တစ်ခုမှ ကျပန်းရွေးချယ်ထားသော elements များ။
Dataset.shuffle() Method: dataset အတွင်းရှိ elements များကို ကျပန်းရောနှော (shuffle) ရန် အသုံးပြုသော method။
Dataset.select() Method: dataset ၏ သီးခြား elements များကို index များဖြင့် ရွေးထုတ်ရန် အသုံးပြုသော method။
Random Seed: ကျပန်းနံပါတ်များ ထုတ်လုပ်ခြင်းကို ထိန်းချုပ်ရန် အသုံးပြုသော ကနဦးတန်ဖိုး။
Overfit: Model တစ်ခုသည် training data ကို ကောင်းမွန်စွာ သင်ယူထားသော်လည်း မမြင်ဖူးသော data အပေါ်တွင် စွမ်းဆောင်ရည် နည်းပါးခြင်း။
pandas.DataFrame: Pandas library ၏ data structure တစ်ခုဖြစ်ပြီး tabular data များကို သိမ်းဆည်းရန် အသုံးပြုသည်။
value_counts() Method: DataFrame column တစ်ခုအတွင်းရှိ ထူးခြားသော တန်ဖိုးတစ်ခုစီ၏ အရေအတွက်ကို ရေတွက်သော Pandas method။
Product Category: ထုတ်ကုန်များ၏ အမျိုးအစား။
Dataset.filter() Function: 🤗 Datasets library မှာ ပါဝင်တဲ့ method တစ်ခုဖြစ်ပြီး သတ်မှတ်ထားသော အခြေအနေများနှင့် ကိုက်ညီသော ဒေတာများကိုသာ dataset မှ ရွေးထုတ်ရန် အသုံးပြုသည်။
Heuristic: ပြဿနာတစ်ခုကို ဖြေရှင်းရန်အတွက် လက်တွေ့ကျသော သို့မဟုတ် rule-of-thumb နည်းလမ်း။
Whitespace: စာသားများအတွင်းရှိ နေရာလွတ်များ (space, tab, newline)။
split() Method: string တစ်ခုကို သတ်မှတ်ထားသော delimiter (ဥပမာ- whitespace) ဖြင့် ပိုင်းခြားပြီး list တစ်ခုအဖြစ် ပြန်ပေးသော Python string method။
concatenate_datasets() Function: 🤗 Datasets library မှ Dataset objects နှစ်ခု သို့မဟုတ် နှစ်ခုထက်ပိုသော objects များကို ပေါင်းစပ်ရန် အသုံးပြုသော function။
Skewed: data ၏ ဖြန့်ဝေမှု (distribution) သည် တစ်ဖက်သို့ စောင်းနေခြင်း။
Encoder-Decoder Architecture: Transformer architecture တစ်မျိုးဖြစ်ပြီး input sequence ကို encode လုပ်ရန် encoder နှင့် output sequence ကို decode လုပ်ရန် decoder နှစ်ခုပါဝင်သည်။
GPT Family of Models: OpenAI မှ ထုတ်လုပ်ထားသော Generative Pretrained Transformer (GPT) models များ။ auto-regressive language models များဖြစ်သည်။
Few-shot Settings: လေ့ကျင့်မှုအတွက် data ဥပမာအနည်းငယ်သာ ရရှိနိုင်သော အခြေအနေ။
GPT-2: auto-regressive language model တစ်ခု။
Auto-regressive Language Model: ယခင် token များကို အခြေခံ၍ နောက် token ကို ခန့်မှန်းသော language model။
TL;DR (Too Long; Didn’t Read): ရှည်လျားသော စာသားတစ်ခု၏ အနှစ်ချုပ်ကို ဖော်ပြရန် အင်တာနက်တွင် အသုံးပြုသော အတိုကောက်။
PEGASUS: Masked sentences များကို ခန့်မှန်းခြင်းဖြင့် pretraining လုပ်ထားသော summarization model။
T5 (Text-to-Text Transfer Transformer): NLP tasks အားလုံးကို text-to-text format ဖြင့် ကိုင်တွယ်သော universal Transformer architecture။
summarize: ARTICLE: T5 model တွင် summarization task အတွက် အသုံးပြုသော prompt prefix format။
mT5: T5 model ၏ multilingual version။
Multilingual Common Crawl Corpus (mC4): ဘာသာစကားမျိုးစုံဖြင့် အင်တာနက်မှ စုဆောင်းထားသော large-scale corpus။
BART: Encoder-decoder architecture ပါဝင်သော Transformer model တစ်မျိုးဖြစ်ပြီး corrupted input များကို reconstruct လုပ်ရန် လေ့ကျင့်ထားသည်။
mBART-50: BART model ၏ multilingual version။
Monolingual: ဘာသာစကားတစ်ခုတည်းဖြင့်သာ လုပ်ဆောင်နိုင်သော။
High-resource Language: ဒေတာအမြောက်အမြားနှင့် ကိရိယာများစွာ ရရှိနိုင်သော ဘာသာစကား။
Jointly Trained: မော်ဒယ်တစ်ခုကို ဒေတာအမျိုးအစားမျိုးစုံ သို့မဟုတ် ဘာသာစကားမျိုးစုံဖြင့် တစ်ပြိုင်နက်တည်း လေ့ကျင့်ခြင်း။
Prefix: စာသား၏ အစပိုင်းတွင် ထည့်သွင်းထားသော စကားလုံး သို့မဟုတ် စာကြောင်း။
Condition: Model ၏ output ကို ထိန်းချုပ်ခြင်း သို့မဟုတ် သတ်မှတ်ခြင်း။
Versatile: ကိစ္စရပ်မျိုးစုံတွင် အသုံးပြုနိုင်သော။
Tokenize: စာသား (သို့မဟုတ် အခြားဒေတာ) ကို AI မော်ဒယ်များ စီမံဆောင်ရွက်နိုင်ရန် tokens တွေအဖြစ် ပိုင်းခြားပေးသည့် လုပ်ငန်းစဉ်။
Encode: ဒေတာများကို ဂဏန်းဆိုင်ရာ ကိုယ်စားပြုမှုအဖြစ် ပြောင်းလဲခြင်း။
AutoTokenizer: Hugging Face Transformers library မှာ ပါဝင်တဲ့ class တစ်ခုဖြစ်ပြီး မော်ဒယ်အမည်ကို အသုံးပြုပြီး သက်ဆိုင်ရာ tokenizer ကို အလိုအလျောက် load လုပ်ပေးသည်။
mt5-small: mT5 model ၏ small version အတွက် checkpoint identifier။
Checkpoint: မော်ဒယ်၏ weights များနှင့် အခြားဖွဲ့စည်းပုံများ (configuration) ကို သတ်မှတ်ထားသော အချိန်တစ်ခုတွင် သိမ်းဆည်းထားခြင်း။
Debug: code ထဲက အမှားတွေကို ရှာဖွေပြီး ပြင်ဆင်ခြင်း။
Iterate: ပြဿနာတစ်ခုကို ဖြေရှင်းရန်အတွက် အဆင့်များစွာကို ထပ်ခါတလဲလဲ လုပ်ဆောင်ခြင်း။
End-to-end Workflow: အစမှအဆုံးအထိ ပြည့်စုံသော လုပ်ငန်းစီးဆင်းမှု။
Scale Up: Model ၏ အရွယ်အစား သို့မဟုတ် training data ပမာဏကို တိုးမြှင့်ခြင်း။
input_ids: Tokenizer မှ ထုတ်ပေးသော tokens တစ်ခုစီ၏ ထူးခြားသော ဂဏန်းဆိုင်ရာ ID များ။
attention_mask: မော်ဒယ်ကို အာရုံစိုက်သင့်သည့် tokens များနှင့် လျစ်လျူရှုသင့်သည့် (padding) tokens များကို ခွဲခြားပေးသည့် binary mask။
Decode: ဂဏန်းဆိုင်ရာ ကိုယ်စားပြုမှု (ဥပမာ- input IDs) ကို လူသားဖတ်နိုင်သော စာသားအဖြစ် ပြန်ပြောင်းခြင်း။
convert_ids_to_tokens() Function: input IDs များကို tokens များအဖြစ် ပြန်ပြောင်းပေးသော tokenizer method။
Unicode Character : SentencePiece tokenizer တွင် word piece ၏ အစကို ဖော်ပြရန် အသုံးပြုသော အထူး character။
End-of-sequence Token </s>: Sequence တစ်ခု၏ အဆုံးကို ဖော်ပြသော special token။
SentencePiece Tokenizer: Google မှ ဖန်တီးထားသော subword tokenizer တစ်ခုဖြစ်ပြီး input text ကို pre-tokenize လုပ်ရန် မလိုအပ်ပါ။
Unigram Segmentation Algorithm: subword tokenization algorithm တစ်မျိုး။
Agnostic: သီးခြားအချက်အလက်တစ်ခုကို ဂရုမစိုက်ခြင်း သို့မဟုတ် မှီခိုခြင်းမရှိခြင်း။
Accents: ဘာသာစကားများတွင် စကားလုံးများအပေါ်တွင် အသုံးပြုသော အသံထွက်ပြောင်းလဲမှုများ။
Punctuation: စာသားများတွင် အသုံးပြုသော သတ်ပုံအမှတ်အသားများ (ဥပမာ- comma, period, question mark)။
Whitespace Characters: နေရာလွတ် (space, tab, newline)။
Maximum Context Size: Model တစ်ခုက တစ်ပြိုင်နက်တည်း လုပ်ဆောင်နိုင်သော အများဆုံး tokens အရေအတွက်။
Truncation: input sequence ၏ အရှည်ကို model ၏ maximum context size သို့ လျှော့ချခြင်း။
text_target Argument: tokenizer တွင် label text များကို input text နှင့် ပြိုင်တူ tokenize လုပ်ရန် အသုံးပြုသော argument။
max_input_length: Input sequence အတွက် သတ်မှတ်ထားသော အများဆုံး tokens အရေအတွက်။
max_target_length: Target (label) sequence အတွက် သတ်မှတ်ထားသော အများဆုံး tokens အရေအတွက်။
preprocess_function(): data ကို preprocessing လုပ်ရန်အတွက် သတ်မှတ်ထားသော function။
Dataset.map() Function: 🤗 Datasets library မှာ ပါဝင်တဲ့ method တစ်ခုဖြစ်ပြီး dataset ရဲ့ element တစ်ခုစီ ဒါမှမဟုတ် batch တစ်ခုစီပေါ်မှာ function တစ်ခုကို အသုံးပြုနိုင်စေသည်။
batched=True: map() method မှာ အသုံးပြုသော argument တစ်ခုဖြစ်ပြီး function ကို dataset ရဲ့ element အများအပြားပေါ်မှာ တစ်ပြိုင်နက်တည်း အသုံးပြုစေသည်။
Multithreading Capabilities: Program တစ်ခု၏ threads များစွာကို တစ်ပြိုင်နက်တည်း လုပ်ဆောင်နိုင်စွမ်း။
ROUGE Score (Recall-Oriented Understudy for Gisting Evaluation): generated summary တစ်ခုကို reference summaries များနှင့် နှိုင်းယှဉ်ခြင်းဖြင့် text summarization model များ၏ စွမ်းဆောင်ရည်ကို တိုင်းတာသော metric။
Text Generation Tasks: AI model များမှ စာသားအသစ်များ ထုတ်လုပ်ခြင်းနှင့် သက်ဆိုင်သော လုပ်ငန်းများ။
Exact Match: စာသားနှစ်ခု သို့မဟုတ် စကားလုံးနှစ်ခု တိတိကျကျ တူညီခြင်း။
Precision: generated summary ထဲက ဘယ်လောက်အတိုင်းအတာအထိ relevant ဖြစ်လဲဆိုတာကို တိုင်းတာသော metric။
Recall: reference summary ရဲ့ ဘယ်လောက်အတိုင်းအတာအထိ generated summary က ဖမ်းယူနိုင်လဲဆိုတာကို တိုင်းတာသော metric။
F1-score: Precision နှင့် Recall တို့၏ harmonic mean ကို တွက်ချက်သော metric။
rouge_score Package: ROUGE score ကို တွက်ချက်ရန်အတွက် Python package။
evaluate Library: Hugging Face မှ metrics များကို load လုပ်ရန်နှင့် တွက်ချက်ရန်အတွက် library။
evaluate.load("rouge"): ROUGE metric ကို load လုပ်ရန် command။
rouge_score.compute() Function: ROUGE scores များကို တွက်ချက်ရန် function။
Confidence Intervals: parameter တစ်ခု၏ ဖြစ်နိုင်ခြေတန်ဖိုးများ ပါဝင်နိုင်သော အတိုင်းအတာတစ်ခု။
low, mid, high Attributes: Confidence interval ၏ အနိမ့်ဆုံး၊ အလယ်အလတ်၊ အမြင့်ဆုံး တန်ဖိုးများ။
Text Granularity: စာသားကို ခွဲခြမ်းစိတ်ဖြာသည့် အဆင့် (ဥပမာ- စကားလုံးများ၊ bigrams, sentences)။
Unigrams: စာသားတစ်ခုမှ တစ်လုံးချင်းစီသော စကားလုံးများ။
Bigrams: စာသားတစ်ခုမှ စကားလုံးအတွဲများ။
rougeL: Longest Common Subsequence (LCS) ကို အခြေခံ၍ တွက်ချက်သော ROUGE score။ sentence တစ်ခုစီ၏ average LCS ကို တိုင်းတာသည်။
rougeLsum: rougeL နှင့် ဆင်တူသော်လည်း summary တစ်ခုလုံး၏ LCS ကို တိုင်းတာသည်။
Longest Common Substrings: စာသားနှစ်ခုကြားရှိ အရှည်ဆုံး တူညီသော စာသားအပိုင်းအစ။
Baseline: Model ၏ စွမ်းဆောင်ရည်ကို နှိုင်းယှဉ်ရန်အတွက် အသုံးပြုသော ရိုးရှင်းသော သို့မဟုတ် အခြေခံ model။
Lead-3 Baseline: Text summarization တွင် article တစ်ခု၏ ပထမဆုံး sentences သုံးခုကို summary အဖြစ် ယူသော baseline။
nltk Library (Natural Language Toolkit): Python အတွက် NLP လုပ်ငန်းများအတွက် ကိရိယာများနှင့် library များ စုစည်းမှု။
punkt (NLTK Tokenizer Model): nltk library အတွင်းရှိ sentence tokenizer အတွက် training data။
sent_tokenize() Function: nltk မှ စာသားတစ်ခုကို sentences များအဖြစ် ပိုင်းခြားပေးသော function။
three_sentence_summary() Function: စာသားတစ်ခုမှ ပထမဆုံး sentences သုံးခုကို ထုတ်ယူရန် သတ်မှတ်ထားသော function။
evaluate_baseline() Function: baseline model ၏ စွမ်းဆောင်ရည်ကို ROUGE metric ဖြင့် အကဲဖြတ်ရန် function။
pandas: Python programming language အတွက် data analysis နှင့် manipulation အတွက် အသုံးပြုသော open-source library။
round() Function: နံပါတ်တစ်ခုကို သတ်မှတ်ထားသော ဒသမနေရာအထိ ချုံ့ရန် Python function။
Skewed: data ၏ ဖြန့်ဝေမှု (distribution) သည် တစ်ဖက်သို့ စောင်းနေခြင်း။
AutoModelForSeq2SeqLM: Hugging Face Transformers library မှာ ပါဝင်တဲ့ class တစ်ခုဖြစ်ပြီး sequence-to-sequence language modeling (ဥပမာ- summarization, translation) အတွက် model တစ်ခုကို အလိုအလျောက် load လုပ်ပေးသည်။
Sequence-to-sequence Task: input sequence တစ်ခုမှ output sequence တစ်ခုကို ထုတ်လုပ်သော task အမျိုးအစား။
TFAutoModelForSeq2SeqLM: TensorFlow framework အတွက် AutoModelForSeq2SeqLM နှင့် တူညီသော လုပ်ဆောင်ချက်များရှိသည်။
Downstream Task: Pretrained model တစ်ခုကို fine-tune လုပ်ရန် အသုံးပြုသော သီးခြားလုပ်ငန်း။
Randomly Initialized Network: အစပိုင်းတွင် weights များကို ကျပန်းတန်ဖိုးများဖြင့် သတ်မှတ်ထားသော neural network။
notebook_login() Function: Jupyter/Colab Notebooks များတွင် Hugging Face Hub သို့ login ဝင်ရန် အသုံးပြုသော function။
Credentials: အသုံးပြုသူအမည်နှင့် စကားဝှက်ကဲ့သို့ အကောင့်ဝင်ရန်အတွက် အချက်အလက်များ။
huggingface-cli login: Hugging Face CLI (Command Line Interface) မှ Hugging Face Hub သို့ login ဝင်ရန် အသုံးပြုသော command။
Seq2SeqTrainingArguments: 🤗 Transformers library မှ sequence-to-sequence models များကို train လုပ်ရန်အတွက် training arguments များကို သတ်မှတ်ပေးသော class။
Seq2SeqTrainer: 🤗 Transformers library မှ sequence-to-sequence models များကို လေ့ကျင့်ရန်အတွက် မြင့်မားသောအဆင့် (high-level) API။
Hyperparameters: Model ကို မလေ့ကျင့်မီ သတ်မှတ်ရသော parameters များ (ဥပမာ- learning rate, batch size)။
output_dir: Training outputs များကို သိမ်းဆည်းမည့် directory။
evaluation_strategy="epoch": Training လုပ်နေစဉ် evaluation ကို epoch တိုင်းတွင် ပြုလုပ်ရန် သတ်မှတ်ခြင်း။
learning_rate: Training လုပ်ငန်းစဉ်အတွင်း model ၏ weights များကို မည်မျှပြောင်းလဲရမည်ကို ထိန်းချုပ်သော parameter။
per_device_train_batch_size: device တစ်ခုစီ (ဥပမာ- GPU) ပေါ်တွင် training အတွက် batch size။
per_device_eval_batch_size: device တစ်ခုစီ (ဥပမာ- GPU) ပေါ်တွင် evaluation အတွက် batch size။
weight_decay: Overfitting ကို လျှော့ချရန်အတွက် optimizer တွင် အသုံးပြုသော regularization နည်းလမ်း။
save_total_limit: training လုပ်နေစဉ် သိမ်းဆည်းထားမည့် checkpoints အရေအတွက်ကို ကန့်သတ်ခြင်း။
num_train_epochs: Model ကို training dataset တစ်ခုလုံးဖြင့် လေ့ကျင့်သည့် အကြိမ်အရေအတွက်။
predict_with_generate=True: Seq2SeqTrainer ကို evaluation လုပ်နေစဉ် model.generate() method ကို အသုံးပြု၍ summaries များ ထုတ်လုပ်ရန် ညွှန်ကြားခြင်း။
logging_steps: training loss ကို မည်မျှ steps ကြာတိုင်း log လုပ်မည်ကို သတ်မှတ်ခြင်း။
push_to_hub=True: training ပြီးနောက် model ကို Hugging Face Hub သို့ push လုပ်ရန် သတ်မှတ်ခြင်း။
hub_model_id Argument: Hugging Face Hub သို့ push လုပ်မည့် repository ၏ ID ကို သတ်မှတ်ရန် argument။
Organization: Hugging Face Hub တွင် models များနှင့် datasets များကို စုစည်းပြီး မျှဝေရန် အသုံးပြုသော အဖွဲ့အစည်းအကောင့်။
compute_metrics() Function: model ၏ predictions များနှင့် ground truth labels များကို အသုံးပြု၍ evaluation metrics များကို တွက်ချက်ရန် function။
eval_pred: compute_metrics function သို့ ပေးပို့သော predictions နှင့် labels အတွဲ။
predictions: Model မှ ထုတ်လုပ်သော ခန့်မှန်းချက်များ (tokens ID များ)။
labels: Ground truth labels (tokens ID များ)။
tokenizer.batch_decode(): batch တစ်ခုအတွင်းရှိ tokens ID များကို text အဖြစ် decode လုပ်သော tokenizer method။
skip_special_tokens=True: decoding လုပ်နေစဉ် special tokens များကို ကျော်သွားရန် သတ်မှတ်ခြင်း။
np.where(labels != -100, labels, tokenizer.pad_token_id): NumPy function တစ်ခုဖြစ်ပြီး labels array ထဲက -100 တွေကို tokenizer.pad_token_id နဲ့ အစားထိုးသည်။
use_stemmer=True: ROUGE score တွက်ချက်ရာတွင် stemming ကို အသုံးပြုရန် သတ်မှတ်ခြင်း။
Median Scores: scores များ၏ အလယ်တန်ဖိုး။
Data Collator: batch တစ်ခုအတွင်း samples များကို စုစည်းပေးသော function။
Encoder-decoder Transformer Model: Transformer architecture တစ်မျိုးဖြစ်ပြီး input sequence ကို encode လုပ်ရန် encoder နှင့် output sequence ကို decode လုပ်ရန် decoder နှစ်ခုပါဝင်သည်။
Ground Truth Labels: မှန်ကန်သော သို့မဟုတ် အမှန်တကယ် labels များ။
Masked Self-attention: Transformer တွင် self-attention mechanism ကို အသုံးပြုပြီး အချို့သော input tokens များကို ဖုံးကွယ်ထားခြင်း။
DataCollatorForSeq2Seq: Hugging Face Transformers library မှ sequence-to-sequence models များအတွက် dynamic padding ကို လုပ်ဆောင်ပေးသော data collator။
Dynamically Pad: batch တစ်ခုအတွင်းရှိ samples များကို အဲဒီ batch ထဲက အရှည်ဆုံး sample ရဲ့ အရှည်အထိသာ padding လုပ်ခြင်း။
tokenized_datasets.remove_columns(): dataset မှ column များကို ဖယ်ရှားရန် method။
column_names: dataset ၏ column အမည်များစာရင်း။
dict: Python ၏ dictionary data structure။
input_ids: Tokenizer မှ ထုတ်ပေးသော tokens တစ်ခုစီ၏ ထူးခြားသော ဂဏန်းဆိုင်ရာ ID များ။
attention_mask: မော်ဒယ်ကို အာရုံစိုက်သင့်သည့် tokens များနှင့် လျစ်လျူရှုသင့်သည့် (padding) tokens များကို ခွဲခြားပေးသည့် binary mask။
[PAD] Token: Padding အတွက် အသုံးပြုသော special token။
pad_token_id: Padding token ၏ ID။
decoder_input_ids: Decoder သို့ input အဖြစ် ပေးပို့သော ID များ။ labels များကို ညာဘက်သို့ ရွှေ့ထားသော ပုံစံဖြစ်သည်။
Seq2SeqTrainer: Hugging Face Transformers library မှ sequence-to-sequence models များကို လေ့ကျင့်ရန်အတွက် မြင့်မားသောအဆင့် (high-level) API။
trainer.train(): training လုပ်ငန်းစဉ်ကို စတင်ရန် method။
trainer.evaluate(): model ၏ စွမ်းဆောင်ရည်ကို အကဲဖြတ်ရန် method။
eval_loss: Evaluation dataset ပေါ်ရှိ loss တန်ဖိုး။
eval_rouge1, eval_rouge2, eval_rougeL, eval_rougeLsum: Evaluation dataset ပေါ်ရှိ ROUGE scores များ။
eval_gen_len: Generated summaries များ၏ ပျမ်းမျှအရှည်။
eval_runtime: Evaluation လုပ်ဆောင်ရန် ကြာမြင့်သော အချိန်။
eval_samples_per_second: တစ်စက္ကန့်လျှင် လုပ်ဆောင်သော samples အရေအတွက်။
eval_steps_per_second: တစ်စက္ကန့်လျှင် လုပ်ဆောင်သော training steps အရေအတွက်။
Lead-3 Baseline: Text summarization တွင် article တစ်ခု၏ ပထမဆုံး sentences သုံးခုကို summary အဖြစ် ယူသော baseline။
trainer.push_to_hub(): model weights များနှင့် configuration files များကို Hugging Face Hub သို့ push လုပ်ရန် method။
commit_message: Git commit အတွက် မက်ဆေ့ခ်ျ။
tags: Hub ပေါ်ရှိ model ကို categorize လုပ်ရန် အသုံးပြုသော tags များ။
Text Generation Pipeline: စာသားအသစ်များ ထုတ်လုပ်ရန်အတွက် ဒီဇိုင်းထုတ်ထားသော pipeline။
Git Commit Hash: Git repository တွင် commit တစ်ခုစီကို ကိုယ်စားပြုသော ထူးခြားသည့် ID။
Model Repository: Git version control system ကို အသုံးပြု၍ model file များ၊ tokenizer file များ၊ model card (README.md) နှင့် အခြားဆက်စပ်ဖိုင်များကို သိမ်းဆည်းထားသော နေရာ။
Low-level Features: library ၏ အသေးစိတ်လုပ်ဆောင်မှုများကို တိုက်ရိုက်ထိန်းချုပ်နိုင်သော functions သို့မဟုတ် methods များ။
🤗 Accelerate: Hugging Face က ထုတ်လုပ်ထားတဲ့ library တစ်ခုဖြစ်ပြီး PyTorch code တွေကို မတူညီတဲ့ training environment (ဥပမာ - GPU အများအပြား၊ distributed training) တွေမှာ အလွယ်တကူ run နိုင်အောင် ကူညီပေးပါတယ်။
tf.data.Dataset: TensorFlow framework တွင် data pipeline များကို ဖန်တီးရန် အသုံးပြုသော dataset object။
model.prepare_tf_dataset(): Hugging Face Dataset object ကို TensorFlow tf.data.Dataset သို့ ပြောင်းလဲရန် method။
collate_fn: tf.data.Dataset တွင် batch တစ်ခုအတွင်း samples များကို စုစည်းရန် အသုံးပြုသော function (Data Collator)။
shuffle=True: dataset ကို shuffle လုပ်ရန် သတ်မှတ်ခြင်း။
batch_size: training လုပ်ငန်းစဉ်တစ်ခုစီတွင် model သို့ ပေးပို့သော input samples အရေအတွက်။
optimizer: Model ၏ parameters များကို update လုပ်ရန် အသုံးပြုသော algorithm။
schedule: Learning rate schedule။
create_optimizer(): Hugging Face Transformers library မှ optimizer နှင့် learning rate schedule ကို ဖန်တီးရန် function။
tensorflow as tf: TensorFlow library ကို tf အဖြစ် import လုပ်ခြင်း။
init_lr: ကနဦး learning rate။
num_warmup_steps: Training အစပိုင်းတွင် learning rate ကို တဖြည်းဖြည်း တိုးမြှင့်မည့် steps အရေအတွက်။
num_train_steps: စုစုပေါင်း training steps အရေအတွက်။
weight_decay_rate: weight_decay အတွက် rate။
model.compile(): Keras model ကို training အတွက် ပြင်ဆင်ရန် method။
tf.keras.mixed_precision.set_global_policy("mixed_float16"): TensorFlow တွင် mixed-precision training ကို ဖွင့်ရန် သတ်မှတ်ခြင်း။
Mixed-precision Float16: training လုပ်နေစဉ် floating-point numbers များကို 16-bit format ဖြင့် အသုံးပြုခြင်း။ ၎င်းသည် memory အသုံးပြုမှုကို လျှော့ချပြီး training ကို အရှိန်မြှင့်သည်။
model.fit(): Keras model ကို training လုပ်ရန် method။
validation_data: Training လုပ်နေစဉ် evaluation အတွက် အသုံးပြုမည့် dataset။
callbacks: Training လုပ်နေစဉ် သတ်မှတ်ထားသော အချိန်များတွင် လုပ်ဆောင်မည့် functions များ (ဥပမာ- PushToHubCallback)။
epochs: Model ကို training dataset တစ်ခုလုံးဖြင့် လေ့ကျင့်သည့် အကြိမ်အရေအတွက်။
PushToHubCallback: Hugging Face Transformers Keras callback တစ်ခုဖြစ်ပြီး training လုပ်နေစဉ် model ကို Hub သို့ push လုပ်ရန်။
tqdm: Python library တစ်ခုဖြစ်ပြီး loops တွေအတွက် progress bar တွေကို ပြသပေးသည်။
XLA (Accelerated Linear Algebra): TensorFlow ရဲ့ compiler တစ်ခုဖြစ်ပြီး deep learning models တွေကို performance မြှင့်တင်ပေးသည်။
Computation Graph: TensorFlow model ၏ operations များနှင့် ၎င်းတို့ ချိတ်ဆက်ပုံကို ဖော်ပြသော graph။
@tf.function(jit_compile=True) Decorator: Python function တစ်ခုကို TensorFlow Graph function အဖြစ် compile လုပ်ရန် သတ်မှတ်ခြင်း။ jit_compile=True က XLA compilation ကို ဖွင့်သည်။
generate_with_xla() Function: XLA compilation ဖြင့် model ၏ generate() method ကို အသုံးပြုသော function။
max_new_tokens: Generated output ၏ အများဆုံး tokens အရေအတွက်။
tqdm: Python library တစ်ခုဖြစ်ပြီး loops တွေအတွက် progress bar တွေကို ပြသပေးသည်။
all_preds: Generated predictions များကို သိမ်းဆည်းထားသော list။
all_labels: Ground truth labels များကို သိမ်းဆည်းထားသော list။
DataLoader (PyTorch): Dataset ကနေ data တွေကို batch အလိုက် load လုပ်ပေးတဲ့ PyTorch utility class။
tokenized_datasets.set_format("torch"): dataset ၏ output format ကို PyTorch tensors အဖြစ် သတ်မှတ်ခြင်း။
torch.utils.data.DataLoader: PyTorch ၏ DataLoader class။
AdamW: PyTorch မှာ အသုံးပြုတဲ့ AdamW optimizer။ Model ၏ parameters များကို training လုပ်ရာမှာ အသုံးပြုသည်။
model.parameters(): model ၏ လေ့ကျင့်နိုင်သော parameters (weights နှင့် biases) များကို ပြန်ပေးသော method။
lr: Learning rate။
Accelerator: 🤗 Accelerate library ၏ class တစ်ခုဖြစ်ပြီး distributed training setting တွင် model, optimizer, dataloaders များကို ပြင်ဆင်ရန်။
accelerator.prepare(): model, optimizer, dataloaders များကို distributed training အတွက် ပြင်ဆင်ရန် Accelerate method။
TPU (Tensor Processing Unit): Google မှ AI/ML workloads များအတွက် အထူးဒီဇိုင်းထုတ်ထားသော processor တစ်မျိုး။
Learning Rate Schedule: Training လုပ်နေစဉ် learning rate ကို မည်သို့ပြောင်းလဲမည်ကို သတ်မှတ်သော strategy။
get_scheduler(): Hugging Face Transformers library မှ learning rate scheduler ကို ဖန်တီးရန် function။
num_update_steps_per_epoch: epoch တစ်ခုစီတွင် update လုပ်သော steps အရေအတွက်။
num_training_steps: စုစုပေါင်း training steps အရေအတွက်။
postprocess_text() Function: generated summaries များကို evaluation အတွက် ပြင်ဆင်ရန် function။
nltk.sent_tokenize(): nltk မှ စာသားတစ်ခုကို sentences များအဖြစ် ပိုင်းခြားပေးသော function။
🤗 Hub Library: Hugging Face Hub နှင့် အပြန်အလှန်ဆက်သွယ်ရန် Python library။
get_full_repo_name() Function: Hugging Face Hub တွင် repository ID ကို user profile နှင့် ပေါင်းစပ်ပြီး repository ၏ အမည်အပြည့်အစုံကို ရယူရန် function။
repo_name: Repository ၏ အမည်အပြည့်အစုံ။
Repository Class: huggingface_hub library မှ Git repository များကို ကိုင်တွယ်ရန်အတွက် class။
output_dir: Training outputs များကို သိမ်းဆည်းမည့် directory။
clone_from: remote repository မှ local repository သို့ clone လုပ်ရန် URL။
repo.push_to_hub() Method: Repository object မှ changes များကို Hub သို့ push လုပ်ရန် method။
blocking=False: push_to_hub() method တွင် asynchronous (တစ်ပြိုင်နက်တည်းမဟုတ်ဘဲ) push ကို ခွင့်ပြုရန် argument။
Asynchronously: လုပ်ငန်းစဉ်တစ်ခု ပြီးဆုံးရန် မစောင့်ဘဲ နောက်ထပ်လုပ်ငန်းစဉ်များကို ဆက်လက်လုပ်ဆောင်ခြင်း။
progress_bar: tqdm library မှ progress bar object။
model.train(): PyTorch model ကို training mode သို့ ပြောင်းလဲခြင်း။
model.eval(): PyTorch model ကို evaluation mode သို့ ပြောင်းလဲခြင်း။
`outputs = model(batch)`**: model ကို input batch ဖြင့် run ပြီး outputs များကို ရယူခြင်း။
loss = outputs.loss: model output မှ loss value ကို ရယူခြင်း။
accelerator.backward(loss): Accelerate ဖြင့် loss ကို backpropagate လုပ်ခြင်း။
optimizer.step(): optimizer မှ model parameters များကို update လုပ်ခြင်း။
lr_scheduler.step(): learning rate scheduler မှ learning rate ကို update လုပ်ခြင်း။
optimizer.zero_grad(): optimizer ၏ gradients များကို သုညသို့ ပြန်သတ်မှတ်ခြင်း။
progress_bar.update(1): progress bar ကို တစ်ဆင့် တိုးမြှင့်ခြင်း။
torch.no_grad(): PyTorch တွင် gradient calculation ကို ပိတ်ခြင်း (evaluation အတွက်)။
accelerator.unwrap_model(model): Accelerate ဖြင့် wrap လုပ်ထားသော model မှ underlying model ကို ရယူခြင်း။
model.generate(): model မှ output sequence ကို ထုတ်လုပ်ရန် method။
accelerator.pad_across_processes(): Distributed training တွင် processes များအလိုက် tensors များကို padding လုပ်ရန် Accelerate method။
dim: padding လုပ်မည့် dimension။
pad_index: padding အတွက် အသုံးပြုမည့် ID။
accelerator.gather(): Distributed training တွင် processes များအလိုက် tensors များကို စုစည်းရန် Accelerate method။
cpu().numpy(): PyTorch tensor ကို CPU သို့ ရွှေ့ပြီး NumPy array အဖြစ် ပြောင်းလဲခြင်း။
isinstance(generated_tokens, tuple): generated_tokens သည် tuple အမျိုးအစားဖြစ်ခြင်းရှိမရှိ စစ်ဆေးခြင်း။
rouge_score.add_batch(): batch တစ်ခုအတွက် predictions နှင့် references များကို ROUGE metric သို့ ထည့်သွင်းခြင်း။
rouge_score.compute(): စုစည်းထားသော predictions နှင့် references များအတွက် ROUGE scores များကို တွက်ချက်ခြင်း။
accelerator.wait_for_everyone(): Distributed training တွင် processes အားလုံး ပြီးဆုံးရန် စောင့်ဆိုင်းခြင်း။
unwrapped_model.save_pretrained(): pretrained model ၏ weights များကို သိမ်းဆည်းရန် method။
save_function=accelerator.save: Accelerate ဖြင့် save လုပ်ရန် function ကို သတ်မှတ်ခြင်း။
accelerator.is_main_process: လက်ရှိ process သည် main process ဖြစ်ခြင်းရှိမရှိ စစ်ဆေးခြင်း။
tokenizer.save_pretrained(output_dir): tokenizer ၏ files များကို သိမ်းဆည်းရန် method။
Inference Widget: Hugging Face Hub ပေါ်တွင် model ကို web interface မှတစ်ဆင့် စမ်းသပ်နိုင်သော ကိရိယာ။
pipeline() Object: Hugging Face Transformers library မှ model ကို အသုံးပြုရလွယ်ကူစေသော object။
hub_model_id: Hugging Face Hub ပေါ်ရှိ model ၏ ID။
summarizer: summarization pipeline object။
summary_text: pipeline မှ ထုတ်ပေးသော summary text။
print_summary() Function: review, title, နှင့် generated summary များကို print လုပ်ရန် function။
Abstractive Summarization: input text မှ စကားလုံးများကို တိုက်ရိုက်ကူးယူခြင်းမဟုတ်ဘဲ၊ model က ကိုယ်ပိုင်စကားလုံးများဖြင့် အဓိပ္ပာယ်ကို နားလည်ပြီး အနှစ်ချုပ်ကို ထုတ်လုပ်ခြင်း။
Extractive Summarization: input text မှ မူရင်းစာကြောင်းများ သို့မဟုတ် စကားလုံးများကို ရွေးထုတ်ပြီး အနှစ်ချုပ်ကို ဖန်တီးခြင်း။
Bilingual: ဘာသာစကားနှစ်မျိုး။

Update on GitHub

←ဘာသာပြန်ခြင်း Causal Language Model တစ်ခုကို အစကနေ Train လုပ်ခြင်း→