gpt-oss-20b-uav-doctrine (LoRA adapter)

LoRA adapter fine-tuning openai/gpt-oss-20b for Q&A on Unmanned Aerial Vehicle combat doctrine, drawing on US military, NATO, and allied publicly-released doctrinal publications.

🚀 Live demo: Try it on HuggingFace Spaces — ask v4 a UAV-doctrine question, or compare it side by side with the un-adapted base model.

⚠️ Scope and intent. This is a research artifact for studying domain-specific SFT on a 20B MoE model. It is trained exclusively on publicly-released, unclassified doctrinal publications and academic / think-tank analysis. It must not be used for operational decision-making. Outputs may be confidently wrong; doctrine evolves and this adapter is a snapshot.

TL;DR

Base: openai/gpt-oss-20b (MoE, 32 experts)
Method: LoRA, r=16, α=32 (scaling 2.0), dropout=0.0, lr=2e-4, 2 epochs, gpt_oss_no_sysprompt harmony renderer (answers are emitted directly in the final channel — no analysis/CoT trace)
Training data: 18,730 synthetic Q&A pairs derived from a 58-document public-domain UAV/UAS doctrine corpus
Eval: 82.6% win rate (95% CI 80.9–84.4%) over the un-adapted base model on a 1,873-example held-out test set, judged by GPT-5.4 on factual_accuracy / completeness / grounding (1–5). Position-bias-corrected via randomized-position re-judging: 80.7% (95% CI 76.3–85.0%).
Adapter size: ~895 MB (fp32 LoRA tensors; 4,802 tensors across attention q/k/v/o, all 32 per-layer experts' gate/up/down, and lm_head)

Quick start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "openai/gpt-oss-20b"
ADAPTER = "meftun/gpt-oss-20b-uav-doctrine"

# System prompt MUST match the one used during training.
SYSTEM_PROMPT = (
    "You are an expert assistant on Unmanned Aerial Vehicle (UAV) combat "
    "doctrine. Provide accurate, concise answers grounded in US, NATO, and "
    "allied doctrinal publications."
)

tok = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "What is operational design in joint planning?"},
]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=300, temperature=0.3, do_sample=True)
print(tok.decode(out[0, inputs.shape[1]:], skip_special_tokens=True))

# Sample answer (v4, temperature=0.3):
# "Operational design is the process of developing a campaign or operation
#  plan by understanding the problem, visualizing the desired end state, and
#  determining the steps needed to get there."

Loading note (MoE LoRA). This adapter places LoRA on the per-expert projections (experts.{0..31}.{gate,up,down}_proj). The attention and lm_head LoRA load cleanly under vanilla peft+transformers. The per-expert LoRA is in the layout produced by tinker-cookbook's build_lora_adapter, intended for serving stacks with native per-expert LoRA (vLLM --lora-modules, SGLang --lora-paths). If your runtime represents gpt-oss experts as a single fused module, you may need to merge the adapter or use a serving framework that supports per-expert LoRA. The model was trained and evaluated through the Tinker sampling API.

Training data

The corpus is 58 publicly-released documents (~3.39M extracted tokens, ~3.22M after cleaning — 95.2% retention). It combines a 30-document Phase-1 baseline with a documented 28-document expansion. The expansion spans four tiers:

Tier 1 — Official doctrine: 12 documents. US Army FMs/ATPs (FM 3-60 Targeting, FM 3-81, ATP 3-04.x), UK JDPs/JDNs (JDP 2-00, JDN 3/19), USAF AFDPs (3-01 Counterair, 3-03 Counterland), NATO AJPs (3.9 Targeting, 3.10 Info Ops, 3.20 Cyberspace).
Tier 2 — Academic & war college: 5 documents. NPS theses (swarm-vs-swarm, MAGTF swarm comms), SAMS monograph, DTIC, HDIAC.
Tier 3 — Think tank: 6 documents. CRS (R48477 DoD C-UAS, IN12661 law-enforcement C-UAS), ICRC AWS/IHL position papers (2022, 2025, 2026), CNAS ethical-autonomy paper.
Tier 4 — Modern case studies: 5 documents. RUSI Ukraine lessons (2022, 2024, 2025), US Army Infantry Magazine, LIIA drone-revolution paper.

The 30-document baseline additionally covers US Joint Pubs (JP 3-0, 3-01, 3-09, 3-30, 3-52, 3-60, 3-85, 5-0, 2-0, 3-13.1), more NATO AJPs, UK air-power doctrine, EASA/ICAO/FAA civil-aviation UAS material, the DoD Dictionary, and DoDD 3000.09. All source documents are publicly available; the corpus contains no classified, FOUO, CUI, or NOFORN material.

Documents were cleaned (header/footer detection, hyphenation repair, ligature normalization) and chunked at a 1,200-token target with 150-token overlap (max 1,500) using the openai/gpt-oss-20b tokenizer, producing 4,051 chunks. A multi-signal quality filter (TOC, bibliography, low-content, page-ref, cover-metadata detectors; every detector except the <3-sentence floor requires ≥2 independent signals) passed 3,747 / 4,051 chunks (92.5%); only passing chunks were used for Q&A generation.

GPT-5.4 (via Azure OpenAI) generated 5 Q&A pairs per passing chunk under a strict rubric (no yes/no, no document-structure references, paraphrase rather than verbatim quote). Yield: 18,730 pairs (one chunk — JP 3-09 Joint Fire Support material — was lost to the Azure content filter). Question types are balanced: 26.8% factual / 23.5% definitional / 24.6% procedural / 25.1% conceptual.

Train/val/test split is chunk-level (group-aware), 85 / 5 / 10: 15,905 train / 935 val / 1,875 test pairs. Two entire small sources were held out as an additional out-of-distribution eval — crs_2026_law_enforcement_cuas_IN12661 (counter-UAS) and faa_notice_uas_lost_link (lost-link procedures), 15 pairs total. Chunk-level splitting makes pair leakage impossible by construction.

Training procedure

Platform: Tinker (Thinking Machines Lab) — managed LoRA training on distributed GPU
Method: LoRA, r=16, α=32 (Tinker default of 2×rank; the training config set rank only), dropout=0.0. Trained on all-linear modules (attention + MLP experts + unembedding); the exported PEFT target_modules resolve to q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, lm_head.
Optimizer: Adam (β1=0.9, β2=0.95, ε=1e-8), gradient clip 1.0, weight decay 0.0
Learning rate: 2e-4, linear warmup 100 steps → linear decay
Epochs: 2 (994 steps, batch size 32, max sequence length 512 — corpus p99 is ~240 tokens)
Effective training tokens: ~3.98M; 31,743 loss tokens
Wall time: 1h06m on Tinker infrastructure
The training data contains no chain-of-thought traces; the model was fine-tuned to answer directly in the final channel without using the analysis channel.

Why these hyperparameters

A four-variant single-axis ablation chose the production checkpoint. Test/held-out NLL are computed on the inference path (compute_logprobs), not the training-time forward/backward path:

variant	change vs v1	test NLL	held-out NLL	verdict
v1	baseline (r=16, 2 epochs, lr 1e-4)	2.004	1.990	baseline
v2	rank 16 → 32	2.003	2.006	rank capacity not the bottleneck
v3	2 → 3 epochs	2.081	2.039	overfit on the inference path
v4	lr 1e-4 → 2e-4	1.987	1.987	winner

v4 had the lowest NLL on both test and held-out (no overfitting signature) and is what this repo publishes.

Evaluation

Eval compares v4 against the un-adapted base model (gpt-oss-20b, no adapter, same renderer, same 300-token budget). For every test example, both models generate one answer; GPT-5.4 then judges the pair side-by-side on three 1–5 axes plus a winner.

Aggregate scores (GPT-5.4 judge, 1–5 each axis; delta = v4 − baseline)

axis	split	v4	baseline	delta	95% CI on delta
factual_accuracy	test	2.719	1.081	+1.638	[+1.585, +1.691]
completeness	test	2.340	1.061	+1.279	[+1.231, +1.324]
grounding	test	2.613	1.054	+1.559	[+1.504, +1.611]
factual_accuracy	held-out	2.200	1.000	+1.200	[+0.667, +1.800]
completeness	held-out	1.733	1.000	+0.733	[+0.333, +1.267]
grounding	held-out	2.200	1.000	+1.200	[+0.667, +1.800]

Win rates

eval set	n	v4 wins	baseline wins	tie	95% CI on v4 win rate
test (original)	1,873	82.6%	1.3%	16.1%	80.9–84.4%
test (bias-corrected)	300	80.7%	1.7%	17.7%	76.3–85.0%
held-out source	15	66.7%	0.0%	33.3%	40.0–93.3%

Bias correction: a 300-example stratified sample (75 per question type) was re-judged with model positions swapped. The position-A preference rate was 48.5% — indistinguishable from chance — and within-example agreement between original and swapped judgments was 94.7%. The original win rate is therefore not materially inflated by position bias.

Per-question-type win rates (test)

question_type	n	v4 win rate
conceptual	480	90.0%
definitional	434	86.4%
procedural	455	81.5%
factual	504	73.2%

Limitations and known failure modes

Specificity drift. v4 reliably gets the shape of an answer right but routinely drops doctrinal specifics: named codes, specific cargo categories, certificate-lookup procedures. Judge scores cluster at 2–3 on the factual axis for this reason. Treat outputs as a starting point that needs verification against the source document, not a finished answer.
Baseline budget exhaustion confounds the headline win rate. The un-adapted base model exhausts its 300-token budget inside the analysis channel 91.8% of the time on this rubric, producing no final answer. The 82.6% win rate includes this effect. Restricted to the 154 test cases where the baseline produced a final answer, v4 still wins 77.9% — that is the cleaner doctrine-knowledge signal.
Held-out-source generalization is anecdotal. Only 15 held-out examples; the wide CI (40.0–93.3%) reflects this.
No safety / red-teaming pass. This adapter has not been red-teamed for refusal robustness, jailbreak resistance, or capability misuse. It is a research model.
No CoT in training data. The training Q&A pairs are direct answers without reasoning traces. The adapter has effectively learned to suppress the analysis channel (mean output dropped from ~297 tokens for the base model to ~49 tokens for v4). If you need reasoning behavior, this is not the adapter for you.
English only. Training data is English-only.
Doctrine evolves. This snapshot reflects publications available as of May 2026. Some targeted sources (several USMC MCDPs, several AFDPs, some CRS reports) were blocked by host-level WAFs during corpus collection and are not represented.

Intended use and out-of-scope use

Intended: research on domain-specific SFT methodology; studying how a 20B MoE responds to LoRA fine-tuning on technical-document Q&A; blog/paper material.

Out-of-scope: operational military decision support; training material substituting for doctrine itself; any claim of doctrinal authority. The model is a paraphrase generator over public doctrine; it is not doctrine.

Methodological findings worth noting

Training-time val NLL is on a different scale than inference-path NLL. v4's training-time val NLL (1.28, forward/backward path) sat ~0.71 nats below its inference-path test NLL (1.99, compute_logprobs) on the same checkpoint. Do not use training-time val NLL for cross-variant decisions — always re-evaluate via the inference path.
Eval NLL did not track judge-perceived quality. Procedural questions had among the highest NLL but were not the lowest-scoring under the judge; factual questions were the lowest-scoring (73.2% v4 win rate, highest tie rate).
The largest practical win from SFT on a reasoning-tuned base with non-CoT data was teaching the model to skip the analysis channel entirely (mean output 297 → 49 tokens, quality up).
Position bias in LLM-as-judge appears to be a property of free-form preference judging that does not transfer to strict structured-rubric judging. On a 300-example randomized-position re-judging pass, the position-A preference rate was 48.5% (indistinguishable from chance).

Reproducibility

Training/eval journals: docs/journal/01..11 in the project repository (private at time of release), covering data inventory, corpus expansion, cleaning/chunking, quality filtering, Q&A generation, ChatML formatting, the SFT run, the ablation sweep, the full eval, the position-bias control, and this HF push.
Base model: openai/gpt-oss-20b
Adapter export: tinker-cookbook build_lora_adapter (Tinker LoRA → PEFT format).

Citation

@misc{gpt-oss-20b-uav-doctrine,
  author = {meftun},
  title = {gpt-oss-20b-uav-doctrine: LoRA adapter for UAV combat doctrine Q&A},
  year = {2026},
  publisher = {HuggingFace},
  url = {https://huggingface.co/meftun/gpt-oss-20b-uav-doctrine}
}

Acknowledgments

OpenAI for the open-weight gpt-oss-20b base model
Thinking Machines Lab for the Tinker training API
Authors and publishers of the source doctrinal documents (full list in docs/journal/02-data-expansion.md)

Downloads last month: 32

Model tree for meftun/gpt-oss-20b-uav-doctrine

Base model

openai/gpt-oss-20b

Adapter

(227)

this model

meftun
/

gpt-oss-20b-uav-doctrine