0399 Team Victory Full-Parameter SFT Checkpoint

This repository contains intermediate/full-parameter SFT checkpoints from LLM-jp experiment 0399, using Team Victory valid-clean math reasoning data.

Model Variants

Two model repositories were trained under the same data and training recipe, differing only in initialization:

Current uploaded checkpoint:

  • checkpoint-500/
  • Upload excludes large optimizer/FSDP duplicate states from Hub.
  • Local full training state, including optimizer state, is retained under the ABCI experiment directory.

Experiment Links

Intended Use

These checkpoints are intended for research on Japanese/English mathematical reasoning and reinforcement-learning initialization.

Primary intended uses:

  • Compare thinking initialization vs base initialization after identical Team Victory SFT.
  • Use as candidate initial checkpoints for later RL / GRPO-style math reasoning experiments.
  • Evaluate whether Team Victory SFT improves AIME-style pass@1 while preserving broader ability.

Not intended uses:

  • Production deployment without additional evaluation.
  • Safety-critical mathematical, financial, legal, medical, or educational grading decisions.
  • Claims of benchmark superiority before full evaluation is complete.

Training Data

Source dataset:

Filtered dataset:

  • TV_valid_clean
  • Rows: 5,461,079
  • Clean parquet SHA-256: b1fbc4d5c05dbacbf9366055200a034551a46d70bfb2bff4c6432f2175a10d9b
  • Filter rule:
    • is_valid == 1
    • contamination quarantine applied against target benchmark problems
  • Tokenized reusable dataset:

The tokenized dataset was prepared on CPU and uploaded to Hugging Face to avoid repeated expensive tokenization on GPU nodes.

Training Procedure

Training mode:

  • Full-parameter SFT
  • No LoRA / PEFT
  • Hugging Face Trainer
  • FSDP: full_shard auto_wrap
  • Assistant response tokens only are trained.
  • Prompt tokens are masked with -100.

Core hyperparameters:

Parameter Value
Epochs 1
Max length 4096
Per-device train batch size 1
Gradient accumulation steps 16
Learning rate 2.0e-5
Warmup ratio 0.03
LR scheduler cosine
Precision bf16
Optimizer adamw_torch
Save steps 500
Save total limit 3
Seed 20260629

Infrastructure:

  • ABCI 3.0
  • Group: gcg51557
  • Reserved queue: R9920261000
  • Experiment directory: /groups/gcg51557/experiments/0399_tv_sft
  • SFT jobs:
    • Thinking: 2004401.pbs1
    • Base: 2004402.pbs1
  • HF checkpoint sync job:
    • 2004479.pbs1

Monitoring Status

The initial production SFT run was monitored past checkpoint-500.

Observed stability:

Run Step observed Memory plateau Error status
Thinking 600+ ~`303GB/1.92TB` No OOM/NCCL/Traceback observed
Base 590+ ~`294GB/1.92TB` No OOM/NCCL/Traceback observed

Notes:

  • Both runs skipped GPU-side JSONL regeneration.
  • Both runs used the uploaded tokenized dataset.
  • W&B online logging was confirmed.
  • checkpoint-500 was written locally and uploaded to Hugging Face Hub.
  • Base run showed grad_norm=inf in early logs while loss remained finite. This should be considered during downstream quality review.

Uploaded Checkpoint Contents

Each checkpoint-500/ directory on Hub contains:

  • model.safetensors
  • config.json
  • generation_config.json
  • tokenizer files
  • trainer_state.json
  • training_args.bin
  • RNG states
  • scheduler.pt

The following large local training-state files are intentionally not uploaded to Hub:

  • optimizer.bin
  • pytorch_model_fsdp.bin

They are retained in the ABCI experiment directory for local recovery/debugging.

Evaluation

Full benchmark evaluation is not yet included in this model card.

Planned gates for experiment 0399:

Primary math gates:

  • AIME 2024
  • AIME 2025
  • AIME 2026
  • MATH-500

Regression gates:

  • LiveCodeBench
  • IFEval
  • MT-Bench

Final-candidate-only gates:

  • GPQA Diamond
  • BBH
  • MMLU-Pro

Do not treat this checkpoint as validated until these evaluations are complete.

Limitations

  • This is an intermediate/full-param SFT checkpoint from an active experiment.
  • The checkpoint is optimized for math reasoning style data and may regress in non-math tasks.
  • Training data may contain long chain-of-thought style solutions; generated outputs may be verbose.
  • Benchmark contamination mitigation was applied, but no contamination process is perfect.
  • The uploaded checkpoint-500 is early in a longer training run and should not be interpreted as final model quality.
  • Safety alignment was not the primary target of this experiment.

Citation / Attribution

Base models:

@misc{llmjp4,
  title = {LLM-jp-4 8B Models},
  author = {LLM-jp},
  year = {2026},
  url = {https://huggingface.co/llm-jp}
}

Experiment tracking:

Reproducibility Metadata

Experiment ID: 0399

Experiment slug: tv_sft

Canonical experiment directory:

/groups/gcg51557/experiments/0399_tv_sft

Key manifests:

/groups/gcg51557/experiments/0399_tv_sft/manifests/tokenized_sft_tv_valid_clean_llmjp4_8b.json
/groups/gcg51557/experiments/0399_tv_sft/manifests/sft_stability_monitor_20260701.json
/groups/gcg51557/experiments/0399_tv_sft/manifests/hf_checkpoint_sync_2004479.pbs1.json

Training configs:

/groups/gcg51557/experiments/0399_tv_sft/configs/sft_full_thinking.yaml
/groups/gcg51557/experiments/0399_tv_sft/configs/sft_full_base.yaml
Downloads last month
834
Safetensors
Model size
9B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for argo11/0399-tv-full-thinking-fp

Finetuned
(3)
this model

Datasets used to train argo11/0399-tv-full-thinking-fp