gpt-oss-20b-uav-doctrine (LoRA adapter)

LoRA adapter fine-tuning openai/gpt-oss-20b for Q&A on Unmanned Aerial Vehicle combat doctrine, drawing on US military, NATO, and allied publicly-released doctrinal publications.

πŸš€ Live demo: Try it on HuggingFace Spaces β€” ask v4 a UAV-doctrine question, or compare it side by side with the un-adapted base model.

⚠️ Scope and intent. This is a research artifact for studying domain-specific SFT on a 20B MoE model. It is trained exclusively on publicly-released, unclassified doctrinal publications and academic / think-tank analysis. It must not be used for operational decision-making. Outputs may be confidently wrong; doctrine evolves and this adapter is a snapshot.

TL;DR

  • Base: openai/gpt-oss-20b (MoE, 32 experts)
  • Method: LoRA, r=16, Ξ±=32 (scaling 2.0), dropout=0.0, lr=2e-4, 2 epochs, gpt_oss_no_sysprompt harmony renderer (answers are emitted directly in the final channel β€” no analysis/CoT trace)
  • Training data: 18,730 synthetic Q&A pairs derived from a 58-document public-domain UAV/UAS doctrine corpus
  • Eval: 82.6% win rate (95% CI 80.9–84.4%) over the un-adapted base model on a 1,873-example held-out test set, judged by GPT-5.4 on factual_accuracy / completeness / grounding (1–5). Position-bias-corrected via randomized-position re-judging: 80.7% (95% CI 76.3–85.0%).
  • Adapter size: ~895 MB (fp32 LoRA tensors; 4,802 tensors across attention q/k/v/o, all 32 per-layer experts' gate/up/down, and lm_head)

Quick start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "openai/gpt-oss-20b"
ADAPTER = "meftun/gpt-oss-20b-uav-doctrine"

# System prompt MUST match the one used during training.
SYSTEM_PROMPT = (
    "You are an expert assistant on Unmanned Aerial Vehicle (UAV) combat "
    "doctrine. Provide accurate, concise answers grounded in US, NATO, and "
    "allied doctrinal publications."
)

tok = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "What is operational design in joint planning?"},
]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=300, temperature=0.3, do_sample=True)
print(tok.decode(out[0, inputs.shape[1]:], skip_special_tokens=True))

# Sample answer (v4, temperature=0.3):
# "Operational design is the process of developing a campaign or operation
#  plan by understanding the problem, visualizing the desired end state, and
#  determining the steps needed to get there."

Loading note (MoE LoRA). This adapter places LoRA on the per-expert projections (experts.{0..31}.{gate,up,down}_proj). The attention and lm_head LoRA load cleanly under vanilla peft+transformers. The per-expert LoRA is in the layout produced by tinker-cookbook's build_lora_adapter, intended for serving stacks with native per-expert LoRA (vLLM --lora-modules, SGLang --lora-paths). If your runtime represents gpt-oss experts as a single fused module, you may need to merge the adapter or use a serving framework that supports per-expert LoRA. The model was trained and evaluated through the Tinker sampling API.

Training data

The corpus is 58 publicly-released documents (~3.39M extracted tokens, ~3.22M after cleaning β€” 95.2% retention). It combines a 30-document Phase-1 baseline with a documented 28-document expansion. The expansion spans four tiers:

  • Tier 1 β€” Official doctrine: 12 documents. US Army FMs/ATPs (FM 3-60 Targeting, FM 3-81, ATP 3-04.x), UK JDPs/JDNs (JDP 2-00, JDN 3/19), USAF AFDPs (3-01 Counterair, 3-03 Counterland), NATO AJPs (3.9 Targeting, 3.10 Info Ops, 3.20 Cyberspace).
  • Tier 2 β€” Academic & war college: 5 documents. NPS theses (swarm-vs-swarm, MAGTF swarm comms), SAMS monograph, DTIC, HDIAC.
  • Tier 3 β€” Think tank: 6 documents. CRS (R48477 DoD C-UAS, IN12661 law-enforcement C-UAS), ICRC AWS/IHL position papers (2022, 2025, 2026), CNAS ethical-autonomy paper.
  • Tier 4 β€” Modern case studies: 5 documents. RUSI Ukraine lessons (2022, 2024, 2025), US Army Infantry Magazine, LIIA drone-revolution paper.

The 30-document baseline additionally covers US Joint Pubs (JP 3-0, 3-01, 3-09, 3-30, 3-52, 3-60, 3-85, 5-0, 2-0, 3-13.1), more NATO AJPs, UK air-power doctrine, EASA/ICAO/FAA civil-aviation UAS material, the DoD Dictionary, and DoDD 3000.09. All source documents are publicly available; the corpus contains no classified, FOUO, CUI, or NOFORN material.

Documents were cleaned (header/footer detection, hyphenation repair, ligature normalization) and chunked at a 1,200-token target with 150-token overlap (max 1,500) using the openai/gpt-oss-20b tokenizer, producing 4,051 chunks. A multi-signal quality filter (TOC, bibliography, low-content, page-ref, cover-metadata detectors; every detector except the <3-sentence floor requires β‰₯2 independent signals) passed 3,747 / 4,051 chunks (92.5%); only passing chunks were used for Q&A generation.

GPT-5.4 (via Azure OpenAI) generated 5 Q&A pairs per passing chunk under a strict rubric (no yes/no, no document-structure references, paraphrase rather than verbatim quote). Yield: 18,730 pairs (one chunk β€” JP 3-09 Joint Fire Support material β€” was lost to the Azure content filter). Question types are balanced: 26.8% factual / 23.5% definitional / 24.6% procedural / 25.1% conceptual.

Train/val/test split is chunk-level (group-aware), 85 / 5 / 10: 15,905 train / 935 val / 1,875 test pairs. Two entire small sources were held out as an additional out-of-distribution eval β€” crs_2026_law_enforcement_cuas_IN12661 (counter-UAS) and faa_notice_uas_lost_link (lost-link procedures), 15 pairs total. Chunk-level splitting makes pair leakage impossible by construction.

Training procedure

  • Platform: Tinker (Thinking Machines Lab) β€” managed LoRA training on distributed GPU
  • Method: LoRA, r=16, Ξ±=32 (Tinker default of 2Γ—rank; the training config set rank only), dropout=0.0. Trained on all-linear modules (attention + MLP experts + unembedding); the exported PEFT target_modules resolve to q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, lm_head.
  • Optimizer: Adam (Ξ²1=0.9, Ξ²2=0.95, Ξ΅=1e-8), gradient clip 1.0, weight decay 0.0
  • Learning rate: 2e-4, linear warmup 100 steps β†’ linear decay
  • Epochs: 2 (994 steps, batch size 32, max sequence length 512 β€” corpus p99 is ~240 tokens)
  • Effective training tokens: ~3.98M; 31,743 loss tokens
  • Wall time: 1h06m on Tinker infrastructure
  • The training data contains no chain-of-thought traces; the model was fine-tuned to answer directly in the final channel without using the analysis channel.

Why these hyperparameters

A four-variant single-axis ablation chose the production checkpoint. Test/held-out NLL are computed on the inference path (compute_logprobs), not the training-time forward/backward path:

variant change vs v1 test NLL held-out NLL verdict
v1 baseline (r=16, 2 epochs, lr 1e-4) 2.004 1.990 baseline
v2 rank 16 β†’ 32 2.003 2.006 rank capacity not the bottleneck
v3 2 β†’ 3 epochs 2.081 2.039 overfit on the inference path
v4 lr 1e-4 β†’ 2e-4 1.987 1.987 winner

v4 had the lowest NLL on both test and held-out (no overfitting signature) and is what this repo publishes.

Evaluation

Eval compares v4 against the un-adapted base model (gpt-oss-20b, no adapter, same renderer, same 300-token budget). For every test example, both models generate one answer; GPT-5.4 then judges the pair side-by-side on three 1–5 axes plus a winner.

Aggregate scores (GPT-5.4 judge, 1–5 each axis; delta = v4 βˆ’ baseline)

axis split v4 baseline delta 95% CI on delta
factual_accuracy test 2.719 1.081 +1.638 [+1.585, +1.691]
completeness test 2.340 1.061 +1.279 [+1.231, +1.324]
grounding test 2.613 1.054 +1.559 [+1.504, +1.611]
factual_accuracy held-out 2.200 1.000 +1.200 [+0.667, +1.800]
completeness held-out 1.733 1.000 +0.733 [+0.333, +1.267]
grounding held-out 2.200 1.000 +1.200 [+0.667, +1.800]

Win rates

eval set n v4 wins baseline wins tie 95% CI on v4 win rate
test (original) 1,873 82.6% 1.3% 16.1% 80.9–84.4%
test (bias-corrected) 300 80.7% 1.7% 17.7% 76.3–85.0%
held-out source 15 66.7% 0.0% 33.3% 40.0–93.3%

Bias correction: a 300-example stratified sample (75 per question type) was re-judged with model positions swapped. The position-A preference rate was 48.5% β€” indistinguishable from chance β€” and within-example agreement between original and swapped judgments was 94.7%. The original win rate is therefore not materially inflated by position bias.

Per-question-type win rates (test)

question_type n v4 win rate
conceptual 480 90.0%
definitional 434 86.4%
procedural 455 81.5%
factual 504 73.2%

Limitations and known failure modes

  1. Specificity drift. v4 reliably gets the shape of an answer right but routinely drops doctrinal specifics: named codes, specific cargo categories, certificate-lookup procedures. Judge scores cluster at 2–3 on the factual axis for this reason. Treat outputs as a starting point that needs verification against the source document, not a finished answer.

  2. Baseline budget exhaustion confounds the headline win rate. The un-adapted base model exhausts its 300-token budget inside the analysis channel 91.8% of the time on this rubric, producing no final answer. The 82.6% win rate includes this effect. Restricted to the 154 test cases where the baseline produced a final answer, v4 still wins 77.9% β€” that is the cleaner doctrine-knowledge signal.

  3. Held-out-source generalization is anecdotal. Only 15 held-out examples; the wide CI (40.0–93.3%) reflects this.

  4. No safety / red-teaming pass. This adapter has not been red-teamed for refusal robustness, jailbreak resistance, or capability misuse. It is a research model.

  5. No CoT in training data. The training Q&A pairs are direct answers without reasoning traces. The adapter has effectively learned to suppress the analysis channel (mean output dropped from ~297 tokens for the base model to ~49 tokens for v4). If you need reasoning behavior, this is not the adapter for you.

  6. English only. Training data is English-only.

  7. Doctrine evolves. This snapshot reflects publications available as of May 2026. Some targeted sources (several USMC MCDPs, several AFDPs, some CRS reports) were blocked by host-level WAFs during corpus collection and are not represented.

Intended use and out-of-scope use

Intended: research on domain-specific SFT methodology; studying how a 20B MoE responds to LoRA fine-tuning on technical-document Q&A; blog/paper material.

Out-of-scope: operational military decision support; training material substituting for doctrine itself; any claim of doctrinal authority. The model is a paraphrase generator over public doctrine; it is not doctrine.

Methodological findings worth noting

  1. Training-time val NLL is on a different scale than inference-path NLL. v4's training-time val NLL (1.28, forward/backward path) sat ~0.71 nats below its inference-path test NLL (1.99, compute_logprobs) on the same checkpoint. Do not use training-time val NLL for cross-variant decisions β€” always re-evaluate via the inference path.
  2. Eval NLL did not track judge-perceived quality. Procedural questions had among the highest NLL but were not the lowest-scoring under the judge; factual questions were the lowest-scoring (73.2% v4 win rate, highest tie rate).
  3. The largest practical win from SFT on a reasoning-tuned base with non-CoT data was teaching the model to skip the analysis channel entirely (mean output 297 β†’ 49 tokens, quality up).
  4. Position bias in LLM-as-judge appears to be a property of free-form preference judging that does not transfer to strict structured-rubric judging. On a 300-example randomized-position re-judging pass, the position-A preference rate was 48.5% (indistinguishable from chance).

Reproducibility

  • Training/eval journals: docs/journal/01..11 in the project repository (private at time of release), covering data inventory, corpus expansion, cleaning/chunking, quality filtering, Q&A generation, ChatML formatting, the SFT run, the ablation sweep, the full eval, the position-bias control, and this HF push.
  • Base model: openai/gpt-oss-20b
  • Adapter export: tinker-cookbook build_lora_adapter (Tinker LoRA β†’ PEFT format).

Citation

@misc{gpt-oss-20b-uav-doctrine,
  author = {meftun},
  title = {gpt-oss-20b-uav-doctrine: LoRA adapter for UAV combat doctrine Q&A},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/meftun/gpt-oss-20b-uav-doctrine}
}

Acknowledgments

  • OpenAI for the open-weight gpt-oss-20b base model
  • Thinking Machines Lab for the Tinker training API
  • Authors and publishers of the source doctrinal documents (full list in docs/journal/02-data-expansion.md)
Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for meftun/gpt-oss-20b-uav-doctrine

Adapter
(227)
this model

Space using meftun/gpt-oss-20b-uav-doctrine 1