--- language: hu license: apache-2.0 datasets: - common_crawl - wikipedia --- # huBERT base model (cased) ## Model description Cased BERT model for Hungarian, trained on the (filtered, deduplicated) Hungarian subset of the Common Crawl and a snapshot of the Hungarian Wikipedia. ## Intended uses & limitations The model can be used as any other (cased) BERT model. It has been tested on the chunking and named entity recognition tasks and set a new state-of-the-art on the former. ## Training Details of the training data and procedure can be found in the PhD thesis linked below. (With the caveat that it only contains preliminary results based on the Wikipedia subcorpus. Evaluation of the full model will appear in a future paper.) ## Eval results When fine-tuned (via `BertForTokenClassification`) on chunking and NER, the model outperforms multilingual BERT, achieves state-of-the-art results on both tasks. The exact scores are | NER | Minimal NP | Maximal NP | |-----|------------|------------| | **97.62%** | **97.14%** | **96.97%** | ### BibTeX entry and citation info If you use the model, please cite the following papers: [Nemeskey, Dávid Márk (2020). "Natural Language Processing Methods for Language Modeling." PhD Thesis. Eötvös Loránd University.](https://hlt.bme.hu/en/publ/nemeskey_2020) Bibtex: ```bibtex @PhDThesis{ Nemeskey:2020, author = {Nemeskey, Dávid Márk}, title = {Natural Language Processing Methods for Language Modeling}, year = {2020}, school = {E\"otv\"os Lor\'and University} } ``` [Nemeskey, Dávid Márk (2021). "Introducing huBERT." In: XVII. Magyar Számítógépes Nyelvészeti Konferencia (MSZNY 2021). Szeged, pp. 3-14](https://hlt.bme.hu/en/publ/hubert_2021) Bibtex: ```bibtex @InProceedings{ Nemeskey:2021a, author = {Nemeskey, Dávid Márk}, title = {Introducing \texttt{huBERT}}, booktitle = {{XVII}.\ Magyar Sz{\'a}m{\'i}t{\'o}g{\'e}pes Nyelv{\'e}szeti Konferencia ({MSZNY}2021)}, year = 2021, pages = {TBA}, address = {Szeged}, } ```