NEDOQwen 0.8B Pretrained SFT

NEDOQwen 0.8B Pretrained SFT is an experimental Turkish instruction-tuned checkpoint built from the NEDOQwen 0.8B Base Pretrained model.

This repository contains a custom PyTorch checkpoint. It is not yet a Hugging Face Transformers-native model.

Summary

  • Parameters: 824,256,000
  • Language: Turkish
  • Architecture: Qwen/Llama-style decoder-only causal language model
  • Base model: Ethosoft/nedoqwen_0.8b_base_pretrained
  • SFT dataset: Ethosoft/nedo-turkish-sft-mixtures
  • SFT file used: data/tr_sft_clean_20k.jsonl
  • SFT examples: 20,000
  • Tokenizer: NEDO Turkish Tokenizer, 65K typed_surface vocabulary
  • Checkpoint file: checkpoint.pt
  • Checkpoint dtype: bfloat16
  • Checkpoint format: model-only PyTorch state dict
  • Status: experimental research checkpoint

Important compatibility note

This model is not yet compatible with AutoModelForCausalLM.from_pretrained.

Use the included custom scripts for loading and generation.

Files

  • checkpoint.pt: bf16 model-only PyTorch checkpoint
  • config.json: architecture configuration
  • metadata/model_info.json: model metadata
  • scripts/30_sample_qwen_style.py: sampling script
  • scripts/20_train_qwen_style.py: model definition and training script
  • scripts/42_sft_instruction_smoke.py: SFT training script
  • tokenizer/vocab_65536.jsonl: NEDO Turkish tokenizer vocabulary
  • nedo_turkish_tokenizer/: tokenizer Python package

Why bf16 model-only?

The original SFT checkpoint was stored in float32 and was about twice as large. This release stores bfloat16 model-only weights to make the checkpoint smaller and easier to load.

The released checkpoint keeps the same architecture and parameter count.

Training setup

The model was fine-tuned from:

Ethosoft/nedoqwen_0.8b_base_pretrained

using the clean Turkish SFT subset:

Ethosoft/nedo-turkish-sft-mixtures
data/tr_sft_clean_20k.jsonl

The SFT prompt format was:

Kullanıcı talimatı:
{instruction}

Asistan cevabı:
{output}

If the input field was non-empty, an additional context section was used:

Ek bilgi:
{input}

During training, the prompt portion was masked and loss was computed only on assistant answer tokens.

Example prompt

Kullanıcı talimatı:
Fransa'nın başkenti nedir?

Asistan cevabı:

Expected answer style:

Fransa'nın başkenti Paris'tir.

Local usage

Install dependencies:

pip install torch numpy

Run sampling:

PYTHONPATH=. python3 scripts/30_sample_qwen_style.py --ckpt checkpoint.pt --vocab tokenizer/vocab_65536.jsonl --prompt "Kullanıcı talimatı:
Fransa'nın başkenti nedir?

Asistan cevabı:
" --temperature 0.5 --top-p 0.85 --max-new-tokens 30

Hardware note

This is a raw custom PyTorch checkpoint, not an optimized GGUF, Ollama, or Transformers release.

CPU-only inference can be slow and may require significant RAM during checkpoint loading. A CUDA GPU is recommended.

Qualitative behavior

Compared to the base pretrained checkpoint, this SFT model has improved Turkish instruction-following behavior.

Internal project notes:

  • Better than the broader noisy HF-mix SFT run
  • Can answer simple factual Turkish QA prompts
  • Learns the basic Turkish instruction-answer format
  • Still weak on structured multi-part responses
  • Still needs domain-specific SFT for stronger SLM, LLM, and agent behavior

Recommended use

This checkpoint is useful for:

  • Turkish SFT research
  • continued instruction tuning
  • domain-SFT experiments
  • comparing base vs SFT behavior
  • studying small Turkish language models

Not intended as

  • a production assistant
  • a safety-aligned chatbot
  • a fully benchmarked model
  • a model with guaranteed factual accuracy
  • a Transformers-native checkpoint

Known limitations

  • Not Hugging Face Transformers-compatible yet
  • No production inference wrapper
  • No systematic benchmark suite yet
  • Can still repeat or produce incomplete structured answers
  • Domain-specific behavior is limited
  • Upstream SFT dataset licenses should be checked by downstream users

License note

The model weights are released as a research artifact.

The SFT data is a mixture/derivative of upstream Turkish instruction datasets. Downstream users are responsible for checking upstream dataset licenses and terms before commercial use or redistribution.

Related repositories

  • Base model: Ethosoft/nedoqwen_0.8b_base_pretrained
  • SFT datasets: Ethosoft/nedo-turkish-sft-mixtures
  • Pretraining dataset: Ethosoft/nedo-turkish-65k-tokenized-60b

Citation and attribution

If you use this checkpoint, please attribute:

  • NEDO Turkish SLM project
  • NEDOQwen 0.8B Base Pretrained
  • NEDO Turkish SFT Mixtures
  • NEDO Turkish 65K tokenizer

Suggested attribution:

NEDOQwen 0.8B Pretrained SFT.
Experimental Turkish instruction-tuned checkpoint based on NEDOQwen 0.8B Base Pretrained.
Downloads last month
61
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Ethosoft/nedoqwen_0.8b_pretrained_sft

Finetuned
(1)
this model

Datasets used to train Ethosoft/nedoqwen_0.8b_pretrained_sft