File size: 1,806 Bytes
025aedd f5f3030 025aedd 6d856db 025aedd f5f3030 025aedd f5f3030 025aedd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 |
---
base_model:
- mistralai/Mistral-7B-v0.3
datasets:
- wikimedia/wikipedia
- FreedomIntelligence/alpaca-gpt4-arabic
language:
- ar
- en
license: apache-2.0
tags:
- text-generation-inference
- transformers
- unsloth
- mistral
- trl
---
Experimenting with pre-training Arabic language + finetuning on instructions using the quantized model `mistralai/Mistral-7B-v0.3` from `unsloth`. First time trying pre-training, expect issues and low quality outputs. The repo contains the merged, quantized model and a GGUF format.
See [spaces demo](https://huggingface.co/spaces/nazimali/mistral-7b-v0.3-instruct-arabic) example.
### Example usage
#### llama-cpp-python
```python
from llama_cpp import Llama
inference_prompt = """فيما يلي تعليمات تصف مهمة. اكتب استجابة تكمل الطلب بشكل مناسب.
### تعليمات:
{}
### إجابة:
"""
llm = Llama.from_pretrained(
repo_id="nazimali/mistral-7b-v0.3-instruct-arabic",
filename="Q8_0.gguf",
)
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": inference_prompt.format("السلام عليكم، هيا نموء")
}
]
)
```
#### llama.cpp
```shell
./llama-cli \
--hf-repo "nazimali/mistral-7b-v0.3-instruct-arabic" \
--hf-file Q8_0.gguf \
-p "السلام عليكم، هيا نموء" \
--conversation
```
### Training
#### Pre-training data:
- `wikimedia/wikipedia`
- `20231101.ar`
- Used 6,096 rows, 0.05% of the total data
#### Finetuning data:
- `FreedomIntelligence/alpaca-gpt4-arabic`
- Used 49,969 rows, 100% of all the data
#### Finetuning instruction format:
```python
finetune_prompt = """فيما يلي تعليمات تصف مهمة. اكتب استجابة تكمل الطلب بشكل مناسب.
### تعليمات:
{}
### إجابة:
"""
``` |