russian dolls.webp

A 0.5B parameter draft (speculative decoding) model for use with deepseek-ai/DeepSeek-R1.

NOTE: This is a draft model for the full-sized DeepSeek-R1 model and not the smaller "distilled" models!

See jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0-GGUF for the models in GGUF format.


How the model was created

1. The initial model was created from Qwen/Qwen2.5-0.5B-Instruct using transplant-vocab:

python ./transplant_vocab.py \
    Qwen2.5-0.5B-Instruct \
    DeepSeek-R1-BF16 \
    DeepSeek-R1-DRAFT-0.5B-UNTRAINED \
    --trim-hidden-size 768 \
    --override "<|▁pad▁|>" "<|endoftext|>" \
    --override "<|fim▁hole|>" "<|fim_middle|>" \
    --override "<|fim▁begin|>" "<|fim_prefix|>" \
    --override "<|fim▁end|>" "<|fim_suffix|>" \
    --override "<|User|>" "<|im_start|>user\\n" \
    --override "<|Assistant|>" "<|im_start|>assistant\\n" \
    --override "<|EOT|>" "<|endoftext|>" \
    --override "<|tool▁calls▁begin|>" "<tool_call>" \
    --override "<|tool▁call▁begin|>" "<tool_call>" \
    --override "<|tool▁outputs▁begin|>" "<tool_call>" \
    --override "<|tool▁output▁begin|>" "<tool_call>" \
    --override "<|tool▁calls▁end|>" "</tool_call>" \
    --override "<|tool▁call▁end|>" "</tool_call>" \
    --override "<|tool▁outputs▁end|>" "</tool_call>" \
    --override "<|tool▁output▁end|>" "</tool_call>" \
    --override "<|tool▁sep|>" "</tool_call>"

NOTE: The reason for trimming the hidden-size to 768 (and the number of heads to 12), is so we can use the more advanced GGUF quants. After fine-tuning, the difference in top-1 eval was only around 2% (71% vs 73%), and this small gain is then lost by being forced to use Q4_0 which has a much higher PPL.

NOTE: I also tried trimming the hidden-size to 512 (and the heads to 8), but the top-1 eval was significantly lower (63%).

2. The following datasets were merged to create a fine-tuning dataset of ~5B tokens:

NOTE: The first two datasets were formatted just between <|end▁of▁sentence|> tags, and the second two datasets using the proper deepseek-r1 Jinga template (with <think> tags added around the reasoning, etc).

This mix of data was chosen based on the ideas presented in FastDraft: How to Train Your Draft. My first attempt at this did not include the raw-code data from bigcode/the-stack-smol-xl and did not perform as well as a result. This confirms their findings:

image/png

3. The model was then trained using qlora-pipe for 1 epoch with a batch size of 120 and a sequence length of 32k (~4M tokens per step):

# Resume a prior run
resume_from_checkpoint = false

# Paths
model = 'DeepSeek-R1-DRAFT-0.5B-UNTRAINED'
output_dir = 'DeepSeek-R1-DRAFT-0.5B'

# Optimization configuration
full_fine_tune = true
epochs = 1
lr_scheduler = 'cosine'
warmup_steps = 100

# Performance settings
pipeline_stages = 1
logging_steps = 1
eval_steps = 100
save_steps = 100
checkpoint_every_n_minutes = 60
eval_before_first_step = true
eval_after_last_step = true
model_weight_dtype = 'bfloat16'
keep_states = 3
group_by_length = true
activation_checkpointing = 'unsloth'

# Dataset configuration
dataset_combination_mode = 'concatenate'
eval_gradient_accumulation_steps = 20

[optimizer]
type = 'adamw_kahan'
lr = 1e-4
beta1 = 0.9
beta2 = 0.999
weight_decay = 0.01

[[datasets]]
name = 'mixed_data'
dataset_type = 'textfile'
dataset_path = 'mixed_data/*.txt'
sequence_len = 32768
eval_size = 0.01
{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 20,
    "gradient_clipping": 1.0,
    "steps_per_print": 1
}

I used six RTX A6000 GPUs over three nodes and hence the 120 batch size (6 x 20 gradient accumulation steps = 120).

image/png

As you can see 5B tokens was overkill and around 1-1.5B would have been OK (the 8-headed 0.33B model needed at least 2-3B tokens to recover performance though).

Downloads last month
5
Safetensors
Model size
501M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0

Base model

Qwen/Qwen2.5-0.5B
Finetuned
(325)
this model
Quantizations
1 model

Datasets used to train jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0

Collection including jukofyork/DeepSeek-R1-DRAFT-0.5B-v1.0