Open-source base language model pre-trained for Turkish by NeuroTürk

Ekran_görüntüsü_2026-05-17_141827-removebg-preview

License Language HuggingFace


1. Introduction

HYZ-01-0.6B-Base is the base (pre-trained only) version of the HYZ-01 series developed by NeuroTürk. It is a raw language model that has undergone multi-stage Turkish continual pre-training (CPT) on top of a multilingual foundation, without any instruction tuning or alignment. It is intended for researchers and developers who wish to fine-tune the model for their own tasks.

The model is built on a multilingual foundation covering 119 languages and has been continuously pre-trained with a focus on Turkish. The tokenizer has been extended specifically for Turkish morphological structure and advanced use cases. HYZ-01-0.6B-Base is the lightweight, open-source base version of HYZ-01, developed by NeuroTürk for Turkish.

Note: This is the base pre-trained version. For the instruction-tuned version, see: HYZ-01-0.6B


2. Model Summary

Continual Pre-Training

  • Base model: 4-stage Turkish continual pre-training (CPT) applied on top of a multilingual foundation.
  • Training stages include general Turkish web corpus, curated domain data, Wikipedia, and high-quality filtered text.
  • Optimization: bfloat16, flash-attention-2, AdamW.

Tokenizer Extension

New special tokens were added to the tokenizer for two purposes:

  • Language-structure tokens: To represent Turkish morphological features more efficiently.
  • Task and structure tokens: To support structural use cases such as chain-of-thought, code blocks, section markers, and language labels.

The following 20 tokens have been added to the vocabulary and are reserved as infrastructure for future advanced capabilities:

Group Tokens Future Use
Brand <|neuroturk|> <|hyz01|> <|tr|> <|en|> Model identity and multilingual control
Chain-of-Thought <|think|> <|/think|> <|step|> <|answer|> Step-by-step reasoning (CoT)
Dialogue <|system|> <|user|> <|assistant|> <|end|> Multi-turn dialogue and role management
Code <|code|> <|/code|> <|output|> <|error|> Structured code generation and debugging
Structure <|title|> <|section|> <|list|> <|note|> Long-form and structured text generation (reports, articles, etc.)

3. Model Details

Feature Value
Total parameters 595,798,016 (~0.6B)
Non-embedding parameters 440,467,456 (~0.44B)
Hidden dimension 1,024
Number of layers 28
Attention heads (Q) 16
Attention heads (KV) 8 (GQA)
Head dimension 128
Activation SiLU
Normalization RMSNorm (ε = 1 × 10⁻⁶)
Positional encoding RoPE (θ = 1,000,000)
Vocabulary size 151,690
Training context length 4,096 tokens
Theoretical max context 32,768 tokens
Precision BFloat16
VRAM usage (fp16) ~1.11 GB
Disk size ~1.11 GB

4. Training Details

Setting Value
Training type Continual Pre-Training (CPT)
Number of stages 4
Optimization AdamW
Precision BFloat16
LR schedule Cosine with warmup
Context length 4,096 tokens

5. Usage

Warning: This is a base model. It is not instruction-tuned and will not follow instructions reliably. For conversational or task-oriented use, use the instruction-tuned version: HYZ-01-0.6B

Installation

pip install transformers torch accelerate

Text Generation (Completion)

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_name = "neuroturk/HYZ-01-0.6B-Base"

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    trust_remote_code=True,
    fix_mistral_regex=True 
)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

prompt = "Yapay zeka, bilgisayar sistemlerinin"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=200,
    temperature=0.8,
    top_p=0.95,
    do_sample=True,
    repetition_penalty=1.1,
)


new_tokens = outputs[0][inputs['input_ids'].shape[1]:]
print(tokenizer.decode(new_tokens, skip_special_tokens=True))

Low VRAM (4-bit Quantization)

from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

tokenizer = AutoTokenizer.from_pretrained(
    "neuroturk/HYZ-01-0.6B-Base",
    trust_remote_code=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "neuroturk/HYZ-01-0.6B-Base",
    quantization_config=bnb_config,
    device_map="auto",
)

Fine-Tuning with Unsloth

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="neuroturk/HYZ-01-0.6B-Base",
    max_seq_length=4096,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    lora_alpha=64,
    lora_dropout=0.0,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    use_gradient_checkpointing="unsloth",
)

6. Limitations

  • This is a base model without instruction tuning — it will not follow instructions reliably.
  • Complex multi-step reasoning may be limited with 0.6B parameters.
  • Biases present in the training data may be reflected in outputs.
  • Performance drops significantly in languages other than Turkish.
  • Human verification of outputs is recommended for critical applications.

7. Citation

@misc{neuroturk2026hyz01,
  author       = {NeuroTürk},
  title        = {HYZ-01-0.6B: A Lightweight Turkish Base Model},
  year         = 2026,

}

NeuroTürk · HYZ01 · 2026
Downloads last month
7
Safetensors
Model size
0.6B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for neuroturk/HYZ-01-0.6B-Base

Quantizations
1 model

Collection including neuroturk/HYZ-01-0.6B-Base