kainghour commited on
Commit
ed48588
Β·
verified Β·
1 Parent(s): 39342c3

add basic usage

Browse files
Files changed (1) hide show
  1. README.md +83 -1
README.md CHANGED
@@ -7,7 +7,89 @@ language: km
7
 
8
  PrahokBART is a pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. This model was trained considering the linguistic issues of Khmer by incorporating linguistic components such as word segmentation and normalization. This model can be finetuned to build natural language generation application for Khmer such as English<->Khmer translation, summarization, headline generation, etc. This model is more efficient than mBART50. You can read more about PrahokBART in this [paper](https://aclanthology.org/2025.coling-main.87/).
9
 
10
- Finetuning codes are avaiable in [GitHub](https://github.com/hour/prahokbart).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  # Citation
13
  ```bibtex
 
7
 
8
  PrahokBART is a pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. This model was trained considering the linguistic issues of Khmer by incorporating linguistic components such as word segmentation and normalization. This model can be finetuned to build natural language generation application for Khmer such as English<->Khmer translation, summarization, headline generation, etc. This model is more efficient than mBART50. You can read more about PrahokBART in this [paper](https://aclanthology.org/2025.coling-main.87/).
9
 
10
+ # Basic Usage
11
+
12
+ **Preprocessing**: Input texts should be normalized (encodings) and word segmented. We only perform word segmentation and assume texts have been normalized here. Please try normalization yourself using [this](https://github.com/hour/prahokbart/blob/main/utils/khnormal.py).
13
+
14
+ ```
15
+ from khmernltk import word_tokenize
16
+
17
+ def word_segment(text):
18
+ return " ".join(word_tokenize(text)).replace(" ", " β–‚ ")
19
+
20
+ def word_unsegment(text):
21
+ return text.replace(" ", "").replace("β–‚", " ")
22
+ ```
23
+
24
+ **Load the model using AutoClass**
25
+ ```
26
+ from transformers import AutoModelForSeq2SeqLM
27
+ from transformers import AutoTokenizer
28
+
29
+ model_name="nict-astrec-att/prahokbart_base"
30
+
31
+ tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
32
+ model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
33
+ ```
34
+
35
+ **I/O format**: The format of corpus that PrahokBART was trained on is `Sentence </s> <2xx>` for input and `<2yy> Sentence </s>` for output where `xx` and `yy` are language codes.
36
+
37
+ **Forward pass**
38
+ ```
39
+ inp = tokenizer(
40
+ word_segment("αžαŸ’αž‰αž»αŸ†αž‘αŸ…αžŸαžΆαž›αžΆαžšαŸ€αž“ </s> <2km>"),
41
+ add_special_tokens=False,
42
+ return_tensors="pt",
43
+ padding=True
44
+ )
45
+
46
+ out = tokenizer(
47
+ "<2en> I go to school </s>",
48
+ add_special_tokens=False,
49
+ return_tensors="pt",
50
+ padding=True
51
+ ).input_ids
52
+
53
+ model_output = model(
54
+ input_ids=inp.input_ids,
55
+ attention_mask=inp.attention_mask,
56
+ labels=out,
57
+ ) # forward pass
58
+
59
+ # For loss
60
+ model_output.loss ## This is not label smoothed.
61
+
62
+ # For logits
63
+ model_output.logits
64
+ ```
65
+
66
+ **Mask prediction**: Let's ask the model to predict `[MASK]` parts of an input setence.
67
+ ```
68
+ text = "αžαŸ’αž‰αž»αŸ†αž‘αŸ…αžŸαžΆαž›αžΆαžšαŸ€αž“[MASK] </s> <2km>" # I go to school [MASK]
69
+ inp = tokenizer(
70
+ word_segment(text),
71
+ add_special_tokens=False,
72
+ return_tensors="pt"
73
+ ).input_ids
74
+
75
+ model_output=model.generate(
76
+ inp, num_beams=4,
77
+ decoder_start_token_id=tokenizer._convert_token_to_id_with_added_voc("<2km>")
78
+ )
79
+
80
+ decoded_output=tokenizer.decode(
81
+ model_output[0],
82
+ skip_special_tokens=True,
83
+ clean_up_tokenization_spaces=False
84
+ )
85
+
86
+ print(word_unsegment(decoded_output))
87
+ # Output: αžαŸ’αž‰αž»αŸ†αž‘αŸ…αžŸαžΆαž›αžΆαžšαŸ€αž“ αž αžΎαž™αžαŸ’αž‰αž»αŸ†αž‘αŸ…αžŸαžΆαž›αžΆαžšαŸ€αž“ αž αžΎαž™αžαŸ’αž‰αž»αŸ†αž‘αŸ…αžŸαžΆαž›αžΆαžšαŸ€αž“
88
+ ```
89
+ ***αžαŸ’αž‰αž»αŸ†αž‘αŸ…αžŸαžΆαž›αžΆαžšαŸ€αž“ αž αžΎαž™αžαŸ’αž‰αž»αŸ†αž‘αŸ…αžŸαžΆαž›αžΆαžšαŸ€αž“ αž αžΎαž™αžαŸ’αž‰αž»αŸ†αž‘αŸ…αžŸαžΆαž›αžΆαžšαŸ€αž“ = I go to school and I go to school and I go to school***
90
+
91
+ # Finetuning
92
+ Codes are avaiable in [GitHub](https://github.com/hour/prahokbart).
93
 
94
  # Citation
95
  ```bibtex