tiiuae/falcon-40b · Fine-tuning on a new language

Jun 1, 2023

Hello everyone,

Can anyone help for instructions on how to fine-tune this model on a new language please?

Aside from the code for fine-tuning, there are some other things that I don't know, like the format of the texts in the dataset, the approximate minimum number of tokens needed in the dataset for a fairly satisfying result and the changes that I might need to do to the tokenizer for each language.

Thanks in advance.

Ichsan2895

Jun 4, 2023

•

edited Jun 4, 2023

I never use it, but it seems it was possible to fine tune it with new file. Please see https://github.com/rmihaylov/falcontune

As the website said, the format of the text is simply same as alpaca-finetune.json.

[
    {
        "instruction": "Give three tips for staying healthy.",
        "input": "",
        "output": "1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.\n\n2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.\n\n3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night."
    },
    {
        "instruction": "What are the three primary colors?",
        "input": "",
        "output": "The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various proportions. In the additive color system, used for light, the primary colors are red, green, and blue (RGB)."
    }
]

How to run it.

falcontune finetune \
    --model=falcon-40b \
    --weights=tiiuae/falcon-40b \
    --dataset=./alpaca-finetune.json \
    --data_type=alpaca \
    --lora_out_dir=./falcon-40b-alpaca/ \
    --mbatch_size=1 \
    --batch_size=2 \
    --epochs=3 \
    --lr=3e-4 \
    --cutoff_len=256 \
    --lora_r=8 \
    --lora_alpha=16 \
    --lora_dropout=0.05 \
    --warmup_steps=5 \
    --save_steps=50 \
    --save_total_limit=3 \
    --logging_steps=5 \
    --target_modules='["query_key_value"]'

FalconLLM

Technology Innovation Institute org Jun 9, 2023

You could also have a look at this blogpost regarding finetuning.

Finetuning to an European language should be enough, but any languages with a different character set (e.g., Chinese, Arabic, etc.) could be difficult.

guilleguells

Jun 14, 2023

What about programming languages?

shaileshp

Jul 25, 2023

Same as i need to fine tune new language(Gujarati) can you help me or please give sample code