--- license: llama3 language: - tr --- CEREBRUM LLM # CERE-LLMA-3-8b-TR This model is an fine-tuned version of a Llama3 8b Large Language Model (LLM) for Turkish. It was trained on a high quality Turkish instruction sets created from various open-source and internal resources. Turkish Instruction dataset carefully annotated to carry out Turkish instructions in an accurate and organized manner. ## Model Details - **Base Model**: LLMA 3 7B based LLM - **Tokenizer Extension**: Specifically extended for Turkish - **Training Dataset**: Cleaned Turkish raw data with 5 billion tokens, custom Turkish instruction sets - **Training Method**: Initially with DORA, followed by fine-tuning with LORA ## Benchmark Results - **Winogrande_tr**: 56.16 - **TruthfulQA_tr_v0.2**: 47.46 - **Mmlu_tr_v0.2**: 46.46 - **HellaSwag_tr_v0.2**: 48.87 - **GSM8k_tr_v0.2**: 25.43 - **Arc_tr_v0.2**: 41.97 ## Usage Examples ```python from transformers import AutoModelForCausalLM, AutoTokenizer device = "cuda" # the device to load the model onto model = AutoModelForCausalLM.from_pretrained( "Cerebrum/cere-llama-3-8b-tr", torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained("Cerebrum/cere-llama-3-8b-tr") prompt = "Python'da ekrana 'Merhaba Dünya' nasıl yazılır?" messages = [ {"role": "system", "content": "Sen, Cerebrum Tech tarafından üretilen ve verilen talimatları takip ederek en iyi cevabı üretmeye çalışan yardımcı bir yapay zekasın."}, {"role": "user", "content": prompt} ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) model_inputs = tokenizer([text], return_tensors="pt").to(device) generated_ids = model.generate( model_inputs.input_ids, temperature=0.3, top_k=50, top_p=0.9, max_new_tokens=512, repetition_penalty=1, ) generated_ids = [ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids) ] response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] ```