Arabic version !?

#2
by Ali-C137 - opened

Hi buddy, I'm interested to replicate the work you have done for Arabic! will you be able to help out with pointing out & guidance ? Your help would save me tone of time πŸ™ƒ

Owner

Definitely. The code to reproduce the work with another language is straightforward, the harder part is finding data. You will need to:

  1. Find and prepare arabic pretraining data, or translate redpajama/similar datasets to make synthetic data (https://github.com/iocuydi/amharic-llama-llava/tree/main/translation)
  2. Create a custom tokenizer with sentencepiece (https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/Training-Details#preparation-vocabulary-expansion)
  3. Extend llama2 pretraining with custom tokenizer https://github.com/iocuydi/amharic-llama-llava/blob/main/run_pt.sh
  4. Create arabic finetuning data (https://github.com/iocuydi/amharic-llama-llava/blob/main/data/prepare_amharic_data.py)
  5. [Optional if you want finetuned text only model] Finetune (https://github.com/iocuydi/amharic-llama-llava/blob/main/finetune.py)
  6. Create arabic visual instruction tuning data from the llava dataset (https://github.com/iocuydi/amharic-llama-llava/blob/main/translation/translate_structured.py)
  7. Train arabic llava from your arabic llama model (https://github.com/iocuydi/amharic-llama-llava/blob/main/llava/README.md)

I would recommend skipping large scale translation for pretraining to begin with, and see if you can get decent results with just public Arabic data, since the language is more widely spoken than Amharic there is likely to be more data. You will likely need to do some translation for finetuning, but it is much cheaper

Happy to help with specific questions

Thank you so much dear @iocuydi that was super helpful πŸ™πŸ»
Can I get back on you here or through GitHub issues if i had any other questions?
Went through the paper earlier today and i must say i really enjoyed how it's written πŸ”₯ super straightforward, no fancy expressions !
Thank you for your work πŸ€—

Owner

of course!

Sign up or log in to comment