Back to all models

Unable to determine this model’s pipeline type. Check the docs .

Monthly model downloads

SZTAKI-HLT/hubert-base-cc SZTAKI-HLT/hubert-base-cc
851 downloads
last 30 days

pytorch

tf

Contributed by

Human Language Technology Group at SZTAKI university
1 team member · 1 model

How to use this model directly from the 🤗/transformers library:

			
Copy to clipboard
from transformers import AutoTokenizer, AutoModel tokenizer = AutoTokenizer.from_pretrained("SZTAKI-HLT/hubert-base-cc") model = AutoModel.from_pretrained("SZTAKI-HLT/hubert-base-cc")

huBERT base model (cased)

Model description

Cased BERT model for Hungarian, trained on the (filtered, deduplicated) Hungarian subset of the Common Crawl and a snapshot of the Hungarian Wikipedia.

Intended uses & limitations

The model can be used as any other (cased) BERT model. It has been tested on the chunking and named entity recognition tasks and set a new state-of-the-art on the former.

Training

Details of the training data and procedure can be found in the PhD thesis linked below. (With the caveat that it only contains preliminary results based on the Wikipedia subcorpus. Evaluation of the full model will appear in a future paper.)

Eval results

When fine-tuned (via BertForTokenClassification) on chunking and NER, the model outperforms multilingual BERT, achieves state-of-the-art results on the former task and comes within 0.5% F1 to the SotA on the latter. The exact scores are

NER Minimal NP Maximal NP
97.62% 97.14% 96.97%

BibTeX entry and citation info

The training corpus, parameters and the evaluation methods are discussed in the following PhD thesis:

@PhDThesis{ Nemeskey:2020,
  author = {Nemeskey, Dávid Márk},
  title  = {Natural Language Processing Methods for Language Modeling},
  year   = {2020},
  school = {E\"otv\"os Lor\'and University}
}