HanmunRoBERTa (March 2024 Release)
The Big Data Studies Lab at the University of Hong Kong is delighted to introduce this early release of HanmunRoBERTa, a transformer-based model trained exclusively on texts in literary Sinitic authored by Koreans before the 20th century. This version is an early prototype, optimised with data from the Veritable Records (Sillok 實錄) and the Diary of the Royal Secretariat (Sŭngjŏngwŏn ilgi 承政院日記).
HanmunRoBERTa was pretrained from scratch on 443.5 million characters of data from the Veritable Records, the Diary of the Royal Secretariat, A Compendium of Korean Collected Works (Han’guk munjip ch’onggan 韓國文集叢刊), and various Korean hanmun miscellanies. The century prediction task was trained using a data sample from the Veritable Records and the Diary of the Royal Secretariat only. Hence, it performs exceptionally well (~98% accuracy) if you provide court entries but may provide mixed results with Korean munjip or non-Korean examples.
The Inference API provides the following examples.
- Example 1: Sejo sillok 世祖實錄 1467/6/20
- Example 2: A selection from Yu Hŭich'un's 柳希春 (1513-1577) Miam ilgi 眉巖日記
- Example 3: A selection from Sŭngjŏngwŏn ilgi 承政院日記 1782/5/1
At this stage, HanmunRoBERTa is prone to overfitting and requires further adjustments and refinement for improved performance. To test the model, please remove all non-Sinitic characters and special symbols, including punctuation. As HanmunRoBERTa was pretrained and fine-tuned using unpunctuated texts, the test samples must be unpunctuated as well. The Hugging Face Inference API is unable to process your inputted text automatically. If you are interested in testing HanmunRoBERTa, we recommend that you use our HanmunRoBERTa Century Prediction web app.
- Downloads last month
- 0