--- license: mit language: - vi metrics: - accuracy library_name: transformers tags: - poem - vietnamese - classification - evaluation --- # Vietnamese poem classification and evaluation 📜🔍 A Vietnamese poem classifer using [BertForSequenceClassification](https://huggingface.co/trituenhantaoio/bert-base-vietnamese-uncased) with the accuracy of ```99.7%``` This is a side project during the making of our [Vietnamese poem generator](https://github.com/Anshler/poem_generator) ## Features * Classify Vietnamese poem into categories of ```4 chu```, ```5 chu```, ```7 chu```, ```luc bat``` and ```8 chu``` * Score the quality of each poem, based soldly on its conformation to the rigid rule of various types of Vietnamese poem. Using 3 criterias: Length, Tone and Rhyme as follow: ```score = L/10 + 3T/10 + 6R/10``` The rules for each genre are defined below: | Genre | Length | Tone | Rhyme | |------------------|------------------|--------------|------------------------| | 4 chu | - 4 words per line
- 4 lines per stanza (optional) | For each line:
- If the 2nd word is uneven (trắc), the 4th word is even (bằng)
- Vice versa | Last word (4th) of each line:
- Continuous rhyme (gieo vần tiếp)
- Alternating rhyme (gieo vần tréo)
- Three-line rhyme (gieo vần ba)| | 5 chu | - 5 words per line
- 4 lines per stanza (optional) | Same as "4 chu" | Same as "4 chu" | | 7 chu | - 7 words per line
- 4 lines per stanza (optional) | For each line:
- If the 2nd word is uneven (trắc), the 4th word is even (bằng), the 6th word is uneven (trắc)
- 5th word and last word (7th) must have different tone | The last word of 1st, 2nd, 4th line per stanza must have same tone and rhyme | | luc bat | - 6 words in odd line
- 8 words in even line
- 4 lines per stanza (optional) | For 6-word line:
- If the 2nd word is uneven (trắc) the 4th word is even (bằng), the 6th word is uneven (trắc)

For 8-word line:
- Must be same as previous 6-word line
- The last word (8th) mut have same tone as 6th word but different accent | The last word (6th) in 6-word line must rhyme with the 6th word in the next 8-word line and the 8th word in the previous 8-word line | | 8 chu | - 8 words per line
- 4 lines per stanza (optional) | For each line:
- If the 3rd word is uneven (trắc), the 5th word is even (bằng), the 8th word is uneven (trắc)| Same as "4 chu" | ## Data A collection of 171188 Vietnamese poems with different genres: luc-bat, 5-chu, 7-chu, 8-chu, 4-chu. Download [here](https://github.com/fsoft-ailab/Poem-Generator/raw/master/dataset/poems_dataset.zip) For more detail, refer to the _Acknowledgments_ section ## Training Training code is in our repo [Vietnamese poem generator](https://github.com/Anshler/poem_generator) Run: ``` python poem_classifier_training.py ``` ## Installation ``` pip install vietnamese-poem-classifier ``` Or ``` pip install git+https://github.com/Anshler/vietnamese-poem-classifier ``` ## Inference ```python from vietnamese_poem_classifier.poem_classifier import PoemClassifier classifier = PoemClassifier() poem = '''Người đi theo gió đuổi mây Tôi buồn nhặt nhạnh tháng ngày lãng quên Em theo hú bóng kim tiền Bần thần tôi ngẫm triền miên thói đời.''' classifier.predict(poem) #>> [{'label': 'luc bat', 'confidence': 0.9999017715454102, 'poem_score': 0.75, 'l_score': 1.0, 't_score': 1.0, 'r_score': 0.5833333333333333}] ``` ## Model The model's weights are published at Huggingface [Anshler/vietnamese-poem-classifier](https://huggingface.co/Anshler/vietnamese-poem-classifier) ## Acknowledgments _This project was inspired by the evaluation method from ```fsoft-ailab```'s_ [SP-GPT2 Poem-Generator](https://github.com/fsoft-ailab/Poem-Generator) _Dataset also taken from their repo_