hplt_bert_base_he / README.md
ltgoslo's picture
Upload folder using huggingface_hub
b7e26b8 verified
|
raw
history blame
2.82 kB
metadata
language:
  - he
inference: false
tags:
  - BERT
  - HPLT
  - encoder
license: apache-2.0
datasets:
  - HPLT/hplt_monolingual_v1_2

HPLT Bert for Hebrew

This is one of the encoder-only monolingual language models trained as a first release by the HPLT project. It is a so called masked language models. In particular, we used the modification of the classic BERT model named LTG-BERT.

A monolingual LTG-BERT model is trained for every major language in the HPLT 1.2 data release (75 models total).

All the HPLT encoder-only models use the same hyper-parameters, roughly following the BERT-base setup:

  • hidden size: 768
  • attention heads: 12
  • layers: 12
  • vocabulary size: 32768

Every model uses its own tokenizer trained on language-specific HPLT data. See sizes of the training corpora, evaluation results and more in our language model training report.

The training code.

The training statistics of all 75 runs

Example usage

This model currently needs a custom wrapper from modeling_ltgbert.py, you should therefore load the model with trust_remote_code=True.

import torch
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("HPLT/hplt_bert_base_en")
model = AutoModelForMaskedLM.from_pretrained("HPLT/hplt_bert_base_en", trust_remote_code=True)

mask_id = tokenizer.convert_tokens_to_ids("[MASK]")
input_text = tokenizer("It's a beautiful[MASK].", return_tensors="pt")
output_p = model(**input_text)
output_text = torch.where(input_text.input_ids == mask_id, output_p.logits.argmax(-1), input_text.input_ids)

# should output: '[CLS] It's a beautiful place.[SEP]'
print(tokenizer.decode(output_text[0].tolist()))

The following classes are currently implemented: AutoModel, AutoModelMaskedLM, AutoModelForSequenceClassification, AutoModelForTokenClassification, AutoModelForQuestionAnswering and AutoModeltForMultipleChoice.

Cite us

@misc{degibert2024new,
      title={A New Massive Multilingual Dataset for High-Performance Language Technologies}, 
      author={Ona de Gibert and Graeme Nail and Nikolay Arefyev and Marta Bañón and Jelmer van der Linde and Shaoxiong Ji and Jaume Zaragoza-Bernabeu and Mikko Aulamo and Gema Ramírez-Sánchez and Andrey Kutuzov and Sampo Pyysalo and Stephan Oepen and Jörg Tiedemann},
      year={2024},
      eprint={2403.14009},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}