GigaBERT-v3

GigaBERT-v3 is a customized bilingual BERT for English and Arabic. It was pre-trained in a large-scale corpus (Gigaword+Oscar+Wikipedia) with ~10B tokens, showing state-of-the-art zero-shot transfer performance from English to Arabic on information extraction (IE) tasks. More details can be found in the following paper:

@inproceedings{lan2020gigabert,
  author     = {Lan, Wuwei and Chen, Yang and Xu, Wei and Ritter, Alan},
    title      = {An Empirical Study of Pre-trained Transformers for Arabic Information Extraction},
    booktitle  = {Proceedings of The 2020 Conference on Empirical Methods on Natural Language Processing (EMNLP)},
    year       = {2020}
  } 

Usage

from transformers import *
tokenizer = BertTokenizer.from_pretrained("lanwuwei/GigaBERT-v3-Arabic-and-English", do_lower_case=True)
model = BertForTokenClassification.from_pretrained("lanwuwei/GigaBERT-v3-Arabic-and-English")

More code examples can be found here.

Downloads last month
89
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train lanwuwei/GigaBERT-v3-Arabic-and-English