metadata

license: llama2
datasets:
  - uonlp/CulturaX
language:
  - tr
  - en
metrics:
  - chrf
  - accuracy
  - bleu

SambaLingo-Turkish-Base

SambaLingo-Turkish-Base is a pretrained Bi-lingual Turkish and English model that adapted Llama 2 to Turkish by training on Billions of tokens of Cultura-X dataset. This model reports state of the art evaluation results in perplexity and FLORES-200 translation. For the chat version of this model please see sambanovasystems/SambaLingo-Turkish-Chat.

Model Description

Developed by: SambaNova Systems
Model type: Language Model
Language(s): Turkish, English
Finetuned from model: Llama 2
Blog Post: Will be released soon!

Getting Started

Loading in model with Huggingface

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("sambanovasystems/SambaLingo-Turkish-Base")
model = AutoModelForCausalLM.from_pretrained("sambanovasystems/SambaLingo-Turkish-Base", device_map="auto", torch_dtype="auto")

Suggested Inference Parameters

Temperature: 0.8
Repetition penalty: 1.0
Top-p: 0.9

Suggested Prompting

This model is a pretrained checkpoint, so to use it effectively please use few shot promting with exemplars. If you want to interact with this model with direct questions or queries, please use the chat version of the model that has been aligned with human preferences sambanovasystems/SambaLingo-Turkish-Chat.

Evaluation Results

Training Details

All pre-training is done on the Cultura-X dataset. We mix the data to be 75% data from the language we are adapting to, and 25% English as suggested by Csaki et al. We pack the data into sequences of length 4096, and ensure that when learning a token we only attend to previous tokens in the context of the corresponding text document. We train with a global batch size of 1024, sequence length of 4096, maximum learning rate of 1e-4 with cosine decay, warmup ratio of 0.01 and a weight decay of 0.1.

Uses

Direct Use

This model is intended for commercial and research use.

Out-of-Scope Use

SambaLingo should NOT be used for:

Mission-critical applications
Applications that involve the safety of others
Making highly important decisions

Bias, Risks, and Limitations

Like all LLMs, SambaLingo has certain limitations:

Hallucination: Model may sometimes generate responses that contain plausible-sounding but factually incorrect or irrelevant information.
Code Switching: The model might unintentionally switch between languages or dialects within a single response, affecting the coherence and understandability of the output.
Repetition: The Model may produce repetitive phrases or sentences, leading to less engaging and informative responses.
Coding and Math: The model's performance in generating accurate code or solving complex mathematical problems may be limited.
Toxicity: The model could inadvertently generate responses containing inappropriate or harmful content.

Acknowledgments

We extend our heartfelt gratitude to the open-source AI community; this endeavor would not have been achievable without open source. SambaNova embraces the open-source community and aspires to actively contribute to this initiative.

We would like to give a special thanks to the following groups Meta for open sourcing LLama 2 and open sourcing FLORES-200 dataset Nguyen et al for open sourcing CulturaX dataset CohereAI for their amazing work with AYA-101 and open sourcing a multilingual instruction tuning dataset EleutherAI for their open source evaluation framework Hugging Face-H4 team for open source the zephyr training recipe and alignment handbook repo

Cite SambaLingo

@software{sambalingo,
  title = {{SambaLingo: Language Experts Adapted From Llama}},
  author = {SambaNova Systems},
  url = {https://huggingface.co/sambanovasystems/SambaLingo-Turkish-Base}
  month = {2},
  year = {2024},
  version = {1.0},
}