DCU-NLP's picture
Update README.md
600973b
|
raw
history blame
2.24 kB
metadata
language:
  - ga
license: apache-2.0
tags:
  - irish
  - bert
widget:
  - text: Ceolt贸ir [MASK] ab ea Johnny Cash.

gaBERT

gaBERT is a BERT-base model trained on 7.9M Irish sentences. For more details, including the hyperparameters and pretraining corpora used please refer to our paper.

How to use gaBERT with HuggingFace

from transformers import AutoModelWithLMHead, AutoTokenizer
import torch

tokenizer = AutoTokenizer.from_pretrained("DCU-NLP/bert-base-irish-cased-v1")
model = AutoModelWithLMHead.from_pretrained("DCU-NLP/bert-base-irish-cased-v1")

sequence = f"Ceolt贸ir {tokenizer.mask_token} ab ea Johnny Cash."

input = tokenizer.encode(sequence, return_tensors="pt")
mask_token_index = torch.where(input == tokenizer.mask_token_id)[1]

token_logits = model(input)[0]
mask_token_logits = token_logits[0, mask_token_index, :]

top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

for token in top_5_tokens:
    print(sequence.replace(tokenizer.mask_token, tokenizer.decode([token])))

Limitations and bias

Some data used to pretrain gaBERT was scraped from the web which potentially contains ethically problematic text (bias, hate, adult content, etc.). Consequently, downstream tasks/applications using gaBERT should be thoroughly tested with respect to ethical considerations.

BibTeX entry and citation info

If you use this model in your research, please consider citing our paper:

@article{DBLP:journals/corr/abs-2107-12930,
  author    = {James Barry and
               Joachim Wagner and
               Lauren Cassidy and
               Alan Cowap and
               Teresa Lynn and
               Abigail Walsh and
               M{\'{\i}}che{\'{a}}l J. {\'{O}} Meachair and
               Jennifer Foster},
  title     = {gaBERT - an Irish Language Model},
  journal   = {CoRR},
  volume    = {abs/2107.12930},
  year      = {2021},
  url       = {https://arxiv.org/abs/2107.12930},
  archivePrefix = {arXiv},
  eprint    = {2107.12930},
  timestamp = {Fri, 30 Jul 2021 13:03:06 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2107-12930.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}