Summary

This is a text classifier for assigning a JLPT level. It was trained at the sentence level. A pre-trained cl-tohoku-bert-japanese-v3 is finetuned on ~5000 labeled sentences obtained from language learning websites. Performance on same distribution data is good.

              precision    recall  f1-score   support
          N5       0.88      0.88      0.88        25
          N4       0.90      0.89      0.90        53
          N3       0.78      0.90      0.84        62
          N2       0.71      0.79      0.75        47
          N1       0.95      0.77      0.85        73
    accuracy                           0.84       260
   macro avg       0.84      0.84      0.84       260
weighted avg       0.85      0.84      0.84       260

But on test data consisting of official JLPT material it is not so good.

              precision    recall  f1-score   support
          N5       0.62      0.66      0.64       145
          N4       0.34      0.36      0.35       143
          N3       0.33      0.67      0.45       197
          N2       0.26      0.20      0.23       192
          N1       0.59      0.08      0.15       202
    accuracy                           0.38       879
   macro avg       0.43      0.39      0.36       879
weighted avg       0.42      0.38      0.34       879

Still, it can give a ballpark estimation of sentence difficulty, although not very precise.

Cite

@inproceedings{benedetti-etal-2024-automatically,
    title = "Automatically Suggesting Diverse Example Sentences for {L}2 {J}apanese Learners Using Pre-Trained Language Models",
    author = "Benedetti, Enrico  and
      Aizawa, Akiko  and
      Boudin, Florian",
    editor = "Fu, Xiyan  and
      Fleisig, Eve",
    booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.acl-srw.11",
    pages = "114--131"
}