Fine-tuning on a new language

#35
by AliMirlou - opened

Hello everyone,

Can anyone help for instructions on how to fine-tune this model on a new language please?

Aside from the code for fine-tuning, there are some other things that I don't know, like the format of the texts in the dataset, the approximate minimum number of tokens needed in the dataset for a fairly satisfying result and the changes that I might need to do to the tokenizer for each language.

Thanks in advance.

I never use it, but it seems it was possible to fine tune it with new file. Please see https://github.com/rmihaylov/falcontune

As the website said, the format of the text is simply same as alpaca-finetune.json.

[
    {
        "instruction": "Give three tips for staying healthy.",
        "input": "",
        "output": "1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night."
    },
    {
        "instruction": "What are the three primary colors?",
        "input": "",
        "output": "The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the additive color system, used for light, the primary colors are red, green, and blue (RGB)."
    }
]

How to run it.

falcontune finetune \
    --model=falcon-40b \
    --weights=tiiuae/falcon-40b \
    --dataset=./alpaca-finetune.json \
    --data_type=alpaca \
    --lora_out_dir=./falcon-40b-alpaca/ \
    --mbatch_size=1 \
    --batch_size=2 \
    --epochs=3 \
    --lr=3e-4 \
    --cutoff_len=256 \
    --lora_r=8 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --warmup_steps=5 \
    --save_steps=50 \
    --save_total_limit=3 \
    --logging_steps=5 \
    --target_modules='["query_key_value"]'
Technology Innovation Institute org

You could also have a look at this blogpost regarding finetuning.

Finetuning to an European language should be enough, but any languages with a different character set (e.g., Chinese, Arabic, etc.) could be difficult.

What about programming languages?

Same as i need to fine tune new language(Gujarati) can you help me or please give sample code

Sign up or log in to comment