datasets: | |
- oscar | |
language: | |
- he | |
- ar | |
# HeArBERT | |
A bilingual BERT for Arabic and Hebrew, pretrained on the respective parts of the OSCAR corpus. | |
In order to process Arabic with this model, one would have to transliterate it to Hebrew script. The code for doing so is available on the [preprocessing file](./preprocessing.py) and can be used as follows: | |
```python | |
from transformers import AutoTokenizer | |
from preprocessing import transliterate_arabic_to_hebrew | |
tokenizer = AutoTokenizer.from_pretrained("aviadrom/HeArBERT") | |
text_ar = "مرحبا" | |
text_he = transliterate_arabic_to_hebrew(text_ar) | |
tokenizer(text_he) | |
``` | |
# Citation | |
If you find our work useful in your research, please consider citing: | |
``` | |
@article{rom2024training, | |
title={Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space}, | |
author={Rom, Aviad and Bar, Kfir}, | |
journal={arXiv preprint arXiv:2402.16065}, | |
year={2024} | |
} | |
``` |