README.md · lanwuwei/GigaBERT-v3-Arabic-and-English at 3fe47fd0087c345daf091c814a24c61a89d7381e

metadata

language:
  - en
  - ar
datasets:
  - gigaword
  - oscar
  - wikipedia

GigaBERT-v3

GigaBERT-v3 is a customized bilingual BERT for English and Arabic. It was pre-trained in a large-scale corpus (Gigaword+Oscar+Wikipedia) with ~10B tokens, showing state-of-the-art zero-shot transfer performance from English to Arabic on information extraction (IE) tasks. More details can be found in the following paper:

@inproceedings{lan2020gigabert,
  author     = {Lan, Wuwei and Chen, Yang and Xu, Wei and Ritter, Alan},
    title      = {An Empirical Study of Pre-trained Transformers for Arabic Information Extraction},
    booktitle  = {Proceedings of The 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP)},
    year       = {2020}
  }

Usage

from transformers import *
tokenizer = BertTokenizer.from_pretrained("lanwuwei/GigaBERT-v3-Arabic-and-English", do_lower_case=True)
model = BertForTokenClassification.from_pretrained("lanwuwei/GigaBERT-v3-Arabic-and-English")

More code examples can be found here.