Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

HanmunRoBERTa (March 2024 Release)

The Big Data Studies Lab at the University of Hong Kong is delighted to introduce this early release of HanmunRoBERTa, a transformer-based model trained exclusively on texts in literary Sinitic authored by Koreans before the 20th century. This version is an early prototype, optimised with data from the Veritable Records (Sillok 實錄) and the Diary of the Royal Secretariat (Sŭngjŏngwŏn ilgi 承政院日記).

HanmunRoBERTa was pretrained from scratch on 443.5 million characters of data from the Veritable Records, the Diary of the Royal Secretariat, A Compendium of Korean Collected Works (Han’guk munjip ch’onggan 韓國文集叢刊), and various Korean hanmun miscellanies. The century prediction task was trained using a data sample from the Veritable Records and the Diary of the Royal Secretariat only. Hence, it performs exceptionally well (~98% accuracy) if you provide court entries but may provide mixed results with Korean munjip or non-Korean examples.

The Inference API provides the following examples.

At this stage, HanmunRoBERTa is prone to overfitting and requires further adjustments and refinement for improved performance. To test the model, please remove all non-Sinitic characters and special symbols, including punctuation. As HanmunRoBERTa was pretrained and fine-tuned using unpunctuated texts, the test samples must be unpunctuated as well. The Hugging Face Inference API is unable to process your inputted text automatically. If you are interested in testing HanmunRoBERTa, we recommend that you use our HanmunRoBERTa Century Prediction web app.

Downloads last month
18
Safetensors
Model size
83.5M params
Tensor type
F32
·

Spaces using bdsl/HanmunRoBERTa 2