File size: 2,590 Bytes
d223b73 7f5a41e d23f44b 7f5a41e d748c03 f47005f 7f5a41e 91b70bf d748c03 f47005f d748c03 4466cd3 d748c03 f47005f 7f5a41e d748c03 5ed8781 d748c03 bc753a8 ab1ef5f bc753a8 ab1ef5f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 |
---
license: mit
---
This model is finetuned from HuggingFaceH4/zephyr-7b-gemma-v0.1 and is finetuned on 9 Indian languages (Hindi, Tamil, Punjabi, Bengali, Gujarati, Oriya, Telugu, Kannada, Malayalam) plus English.
To improve the resoning and maths skills, we first SFT tune the gemma on Microsoft's Orca datasets.
We utilize Orca maths Hindi dataset: GenVRadmin/Aryabhatta-Orca-Maths-Hindi \
And original Orca maths dataset: microsoft/orca-math-word-problems-200k
This pushes the MATHS score from 24.3 in Gemma-7B to 25.5 in Zephyr-Gemma and 31.6 in GemmaOrca.
The model is then finetuned on GenVR's Samvaad datasets (GenVRadmin/Samvaad-Indic-Positive and GenVRadmin/Samvaad-Tamil-Mixtral and a subset of GenVRadmin/Samvaad-Mixed-Language-3).
This is then finetuned on various open sourced datasets like:
Telugu-LLM-Labs/yahma_alpaca_cleaned_telugu_filtered_and_romanized \
Telugu-LLM-Labs/teknium_GPTeacher_general_instruct_telugu_filtered_and_romanized \
abhinand/tamil-alpaca \
Tensoic/airoboros-3.2_kn \
Tensoic/gpt-teacher_kn \
Tensoic/Alpaca-Gujarati \
HydraIndicLM/bengali_alpaca_dolly_67k \
Open-Orca/OpenOrca \
pankajmathur/alpaca_orca \
OdiaGenAI/Odia_Alpaca_instructions_52k \
OdiaGenAI/gpt-teacher-roleplay-odia-3k \
GenVRadmin/Samvaad-Punjabi-Mini \
pankajmathur/WizardLM_Orca
The model achieves following scores on benchmarks:
Model AGIEval GPT4All TruthfulQA BigBench Average ⬇️ \
AryaBhatta-GemmaOrca 35.9 72.26 53.85 40.35 50.59 \
zephyr-7b-beta 37.52 71.77 55.26 39.77 51.08 \
zephyr-7b-gemma-v0.1 34.22 66.37 52.19 37.10 47.47 \
mlabonne/Gemmalpaca-7B 21.6 40.87 44.85 30.49 34.45 \
google/gemma-7b-it 21.33 40.84 41.70 30.25 33.53
How to use:-
```
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
model = AutoPeftModelForCausalLM.from_pretrained(
"GenVRadmin/AryaBhatta-GemmaOrca",
load_in_4bit = False,
token = hf_token
)
tokenizer = AutoTokenizer.from_pretrained("GenVRadmin/AryaBhatta-GemmaOrca")
input_prompt = """
### Instruction:
{}
### Input:
{}
### Response:
{}"""
input_text = input_prompt.format(
"Answer this question about India.", # instruction
"Who is the Prime Minister of India", # input
"", # output - leave this blank for generation!
)
inputs = tokenizer([input_text], return_tensors = "pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 300, use_cache = True)
response = tokenizer.batch_decode(outputs)[0]
``` |