aviadrom
/

HeArBERT

Feature Extraction

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

HeArBERT / README.md

aviadrom's picture

add citation section to model card

8e0b867 verified 4 months ago

|

history blame contribute delete

No virus

941 Bytes

	---
	datasets:
	- oscar
	language:
	- he
	- ar
	---
	# HeArBERT

	A bilingual BERT for Arabic and Hebrew, pretrained on the respective parts of the OSCAR corpus.

	In order to process Arabic with this model, one would have to transliterate it to Hebrew script. The code for doing so is available on the [preprocessing file](./preprocessing.py) and can be used as follows:

	```python
	from transformers import AutoTokenizer
	from preprocessing import transliterate_arabic_to_hebrew

	tokenizer = AutoTokenizer.from_pretrained("aviadrom/HeArBERT")

	text_ar = "مرحبا"
	text_he = transliterate_arabic_to_hebrew(text_ar)
	tokenizer(text_he)
	```


	# Citation
	If you find our work useful in your research, please consider citing:

	```
	@article{rom2024training,
	title={Training a Bilingual Language Model by Mapping Tokens onto a Shared Character Space},
	author={Rom, Aviad and Bar, Kfir},
	journal={arXiv preprint arXiv:2402.16065},
	year={2024}
	}
	```