NEDOQwen 0.8B Pretrained SFT
NEDOQwen 0.8B Pretrained SFT is an experimental Turkish instruction-tuned checkpoint built from the NEDOQwen 0.8B Base Pretrained model.
This repository contains a custom PyTorch checkpoint. It is not yet a Hugging Face Transformers-native model.
Summary
- Parameters: 824,256,000
- Language: Turkish
- Architecture: Qwen/Llama-style decoder-only causal language model
- Base model: Ethosoft/nedoqwen_0.8b_base_pretrained
- SFT dataset: Ethosoft/nedo-turkish-sft-mixtures
- SFT file used: data/tr_sft_clean_20k.jsonl
- SFT examples: 20,000
- Tokenizer: NEDO Turkish Tokenizer, 65K typed_surface vocabulary
- Checkpoint file: checkpoint.pt
- Checkpoint dtype: bfloat16
- Checkpoint format: model-only PyTorch state dict
- Status: experimental research checkpoint
Important compatibility note
This model is not yet compatible with AutoModelForCausalLM.from_pretrained.
Use the included custom scripts for loading and generation.
Files
- checkpoint.pt: bf16 model-only PyTorch checkpoint
- config.json: architecture configuration
- metadata/model_info.json: model metadata
- scripts/30_sample_qwen_style.py: sampling script
- scripts/20_train_qwen_style.py: model definition and training script
- scripts/42_sft_instruction_smoke.py: SFT training script
- tokenizer/vocab_65536.jsonl: NEDO Turkish tokenizer vocabulary
- nedo_turkish_tokenizer/: tokenizer Python package
Why bf16 model-only?
The original SFT checkpoint was stored in float32 and was about twice as large. This release stores bfloat16 model-only weights to make the checkpoint smaller and easier to load.
The released checkpoint keeps the same architecture and parameter count.
Training setup
The model was fine-tuned from:
Ethosoft/nedoqwen_0.8b_base_pretrained
using the clean Turkish SFT subset:
Ethosoft/nedo-turkish-sft-mixtures
data/tr_sft_clean_20k.jsonl
The SFT prompt format was:
Kullanıcı talimatı:
{instruction}
Asistan cevabı:
{output}
If the input field was non-empty, an additional context section was used:
Ek bilgi:
{input}
During training, the prompt portion was masked and loss was computed only on assistant answer tokens.
Example prompt
Kullanıcı talimatı:
Fransa'nın başkenti nedir?
Asistan cevabı:
Expected answer style:
Fransa'nın başkenti Paris'tir.
Local usage
Install dependencies:
pip install torch numpy
Run sampling:
PYTHONPATH=. python3 scripts/30_sample_qwen_style.py --ckpt checkpoint.pt --vocab tokenizer/vocab_65536.jsonl --prompt "Kullanıcı talimatı:
Fransa'nın başkenti nedir?
Asistan cevabı:
" --temperature 0.5 --top-p 0.85 --max-new-tokens 30
Hardware note
This is a raw custom PyTorch checkpoint, not an optimized GGUF, Ollama, or Transformers release.
CPU-only inference can be slow and may require significant RAM during checkpoint loading. A CUDA GPU is recommended.
Qualitative behavior
Compared to the base pretrained checkpoint, this SFT model has improved Turkish instruction-following behavior.
Internal project notes:
- Better than the broader noisy HF-mix SFT run
- Can answer simple factual Turkish QA prompts
- Learns the basic Turkish instruction-answer format
- Still weak on structured multi-part responses
- Still needs domain-specific SFT for stronger SLM, LLM, and agent behavior
Recommended use
This checkpoint is useful for:
- Turkish SFT research
- continued instruction tuning
- domain-SFT experiments
- comparing base vs SFT behavior
- studying small Turkish language models
Not intended as
- a production assistant
- a safety-aligned chatbot
- a fully benchmarked model
- a model with guaranteed factual accuracy
- a Transformers-native checkpoint
Known limitations
- Not Hugging Face Transformers-compatible yet
- No production inference wrapper
- No systematic benchmark suite yet
- Can still repeat or produce incomplete structured answers
- Domain-specific behavior is limited
- Upstream SFT dataset licenses should be checked by downstream users
License note
The model weights are released as a research artifact.
The SFT data is a mixture/derivative of upstream Turkish instruction datasets. Downstream users are responsible for checking upstream dataset licenses and terms before commercial use or redistribution.
Related repositories
- Base model: Ethosoft/nedoqwen_0.8b_base_pretrained
- SFT datasets: Ethosoft/nedo-turkish-sft-mixtures
- Pretraining dataset: Ethosoft/nedo-turkish-65k-tokenized-60b
Citation and attribution
If you use this checkpoint, please attribute:
- NEDO Turkish SLM project
- NEDOQwen 0.8B Base Pretrained
- NEDO Turkish SFT Mixtures
- NEDO Turkish 65K tokenizer
Suggested attribution:
NEDOQwen 0.8B Pretrained SFT.
Experimental Turkish instruction-tuned checkpoint based on NEDOQwen 0.8B Base Pretrained.
- Downloads last month
- 61
Model tree for Ethosoft/nedoqwen_0.8b_pretrained_sft
Base model
Ethosoft/nedoqwen_0.8b_base_pretrained