add basic usage
Browse files
README.md
CHANGED
|
@@ -7,7 +7,89 @@ language: km
|
|
| 7 |
|
| 8 |
PrahokBART is a pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. This model was trained considering the linguistic issues of Khmer by incorporating linguistic components such as word segmentation and normalization. This model can be finetuned to build natural language generation application for Khmer such as English<->Khmer translation, summarization, headline generation, etc. This model is more efficient than mBART50. You can read more about PrahokBART in this [paper](https://aclanthology.org/2025.coling-main.87/).
|
| 9 |
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
# Citation
|
| 13 |
```bibtex
|
|
|
|
| 7 |
|
| 8 |
PrahokBART is a pre-trained sequence-to-sequence model trained from scratch for Khmer using carefully curated Khmer and English corpora. This model was trained considering the linguistic issues of Khmer by incorporating linguistic components such as word segmentation and normalization. This model can be finetuned to build natural language generation application for Khmer such as English<->Khmer translation, summarization, headline generation, etc. This model is more efficient than mBART50. You can read more about PrahokBART in this [paper](https://aclanthology.org/2025.coling-main.87/).
|
| 9 |
|
| 10 |
+
# Basic Usage
|
| 11 |
+
|
| 12 |
+
**Preprocessing**: Input texts should be normalized (encodings) and word segmented. We only perform word segmentation and assume texts have been normalized here. Please try normalization yourself using [this](https://github.com/hour/prahokbart/blob/main/utils/khnormal.py).
|
| 13 |
+
|
| 14 |
+
```
|
| 15 |
+
from khmernltk import word_tokenize
|
| 16 |
+
|
| 17 |
+
def word_segment(text):
|
| 18 |
+
return " ".join(word_tokenize(text)).replace(" ", " β ")
|
| 19 |
+
|
| 20 |
+
def word_unsegment(text):
|
| 21 |
+
return text.replace(" ", "").replace("β", " ")
|
| 22 |
+
```
|
| 23 |
+
|
| 24 |
+
**Load the model using AutoClass**
|
| 25 |
+
```
|
| 26 |
+
from transformers import AutoModelForSeq2SeqLM
|
| 27 |
+
from transformers import AutoTokenizer
|
| 28 |
+
|
| 29 |
+
model_name="nict-astrec-att/prahokbart_base"
|
| 30 |
+
|
| 31 |
+
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
|
| 32 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
|
| 33 |
+
```
|
| 34 |
+
|
| 35 |
+
**I/O format**: The format of corpus that PrahokBART was trained on is `Sentence </s> <2xx>` for input and `<2yy> Sentence </s>` for output where `xx` and `yy` are language codes.
|
| 36 |
+
|
| 37 |
+
**Forward pass**
|
| 38 |
+
```
|
| 39 |
+
inp = tokenizer(
|
| 40 |
+
word_segment("αααα»ααα
ααΆααΆααα </s> <2km>"),
|
| 41 |
+
add_special_tokens=False,
|
| 42 |
+
return_tensors="pt",
|
| 43 |
+
padding=True
|
| 44 |
+
)
|
| 45 |
+
|
| 46 |
+
out = tokenizer(
|
| 47 |
+
"<2en> I go to school </s>",
|
| 48 |
+
add_special_tokens=False,
|
| 49 |
+
return_tensors="pt",
|
| 50 |
+
padding=True
|
| 51 |
+
).input_ids
|
| 52 |
+
|
| 53 |
+
model_output = model(
|
| 54 |
+
input_ids=inp.input_ids,
|
| 55 |
+
attention_mask=inp.attention_mask,
|
| 56 |
+
labels=out,
|
| 57 |
+
) # forward pass
|
| 58 |
+
|
| 59 |
+
# For loss
|
| 60 |
+
model_output.loss ## This is not label smoothed.
|
| 61 |
+
|
| 62 |
+
# For logits
|
| 63 |
+
model_output.logits
|
| 64 |
+
```
|
| 65 |
+
|
| 66 |
+
**Mask prediction**: Let's ask the model to predict `[MASK]` parts of an input setence.
|
| 67 |
+
```
|
| 68 |
+
text = "αααα»ααα
ααΆααΆααα[MASK] </s> <2km>" # I go to school [MASK]
|
| 69 |
+
inp = tokenizer(
|
| 70 |
+
word_segment(text),
|
| 71 |
+
add_special_tokens=False,
|
| 72 |
+
return_tensors="pt"
|
| 73 |
+
).input_ids
|
| 74 |
+
|
| 75 |
+
model_output=model.generate(
|
| 76 |
+
inp, num_beams=4,
|
| 77 |
+
decoder_start_token_id=tokenizer._convert_token_to_id_with_added_voc("<2km>")
|
| 78 |
+
)
|
| 79 |
+
|
| 80 |
+
decoded_output=tokenizer.decode(
|
| 81 |
+
model_output[0],
|
| 82 |
+
skip_special_tokens=True,
|
| 83 |
+
clean_up_tokenization_spaces=False
|
| 84 |
+
)
|
| 85 |
+
|
| 86 |
+
print(word_unsegment(decoded_output))
|
| 87 |
+
# Output: αααα»ααα
ααΆααΆααα α αΎααααα»ααα
ααΆααΆααα α αΎααααα»ααα
ααΆααΆααα
|
| 88 |
+
```
|
| 89 |
+
***αααα»ααα
ααΆααΆααα α αΎααααα»ααα
ααΆααΆααα α αΎααααα»ααα
ααΆααΆααα = I go to school and I go to school and I go to school***
|
| 90 |
+
|
| 91 |
+
# Finetuning
|
| 92 |
+
Codes are avaiable in [GitHub](https://github.com/hour/prahokbart).
|
| 93 |
|
| 94 |
# Citation
|
| 95 |
```bibtex
|