Instructions to use meftun/gpt-oss-20b-uav-doctrine with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use meftun/gpt-oss-20b-uav-doctrine with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("openai/gpt-oss-20b") model = PeftModel.from_pretrained(base_model, "meftun/gpt-oss-20b-uav-doctrine") - Notebooks
- Google Colab
- Kaggle
gpt-oss-20b-uav-doctrine (LoRA adapter)
LoRA adapter fine-tuning openai/gpt-oss-20b for Q&A on Unmanned Aerial Vehicle combat doctrine, drawing on US military, NATO, and allied publicly-released doctrinal publications.
π Live demo: Try it on HuggingFace Spaces β ask v4 a UAV-doctrine question, or compare it side by side with the un-adapted base model.
β οΈ Scope and intent. This is a research artifact for studying domain-specific SFT on a 20B MoE model. It is trained exclusively on publicly-released, unclassified doctrinal publications and academic / think-tank analysis. It must not be used for operational decision-making. Outputs may be confidently wrong; doctrine evolves and this adapter is a snapshot.
TL;DR
- Base: openai/gpt-oss-20b (MoE, 32 experts)
- Method: LoRA, r=16, Ξ±=32 (scaling 2.0), dropout=0.0, lr=2e-4, 2 epochs,
gpt_oss_no_syspromptharmony renderer (answers are emitted directly in thefinalchannel β no analysis/CoT trace) - Training data: 18,730 synthetic Q&A pairs derived from a 58-document public-domain UAV/UAS doctrine corpus
- Eval: 82.6% win rate (95% CI 80.9β84.4%) over the un-adapted base model on a 1,873-example held-out test set, judged by GPT-5.4 on factual_accuracy / completeness / grounding (1β5). Position-bias-corrected via randomized-position re-judging: 80.7% (95% CI 76.3β85.0%).
- Adapter size: ~895 MB (fp32 LoRA tensors; 4,802 tensors across attention q/k/v/o, all 32 per-layer experts' gate/up/down, and lm_head)
Quick start
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE = "openai/gpt-oss-20b"
ADAPTER = "meftun/gpt-oss-20b-uav-doctrine"
# System prompt MUST match the one used during training.
SYSTEM_PROMPT = (
"You are an expert assistant on Unmanned Aerial Vehicle (UAV) combat "
"doctrine. Provide accurate, concise answers grounded in US, NATO, and "
"allied doctrinal publications."
)
tok = AutoTokenizer.from_pretrained(BASE)
model = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "What is operational design in joint planning?"},
]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=300, temperature=0.3, do_sample=True)
print(tok.decode(out[0, inputs.shape[1]:], skip_special_tokens=True))
# Sample answer (v4, temperature=0.3):
# "Operational design is the process of developing a campaign or operation
# plan by understanding the problem, visualizing the desired end state, and
# determining the steps needed to get there."
Loading note (MoE LoRA). This adapter places LoRA on the per-expert projections (
experts.{0..31}.{gate,up,down}_proj). The attention andlm_headLoRA load cleanly under vanillapeft+transformers. The per-expert LoRA is in the layout produced bytinker-cookbook'sbuild_lora_adapter, intended for serving stacks with native per-expert LoRA (vLLM--lora-modules, SGLang--lora-paths). If your runtime represents gpt-oss experts as a single fused module, you may need to merge the adapter or use a serving framework that supports per-expert LoRA. The model was trained and evaluated through the Tinker sampling API.
Training data
The corpus is 58 publicly-released documents (~3.39M extracted tokens, ~3.22M after cleaning β 95.2% retention). It combines a 30-document Phase-1 baseline with a documented 28-document expansion. The expansion spans four tiers:
- Tier 1 β Official doctrine: 12 documents. US Army FMs/ATPs (FM 3-60 Targeting, FM 3-81, ATP 3-04.x), UK JDPs/JDNs (JDP 2-00, JDN 3/19), USAF AFDPs (3-01 Counterair, 3-03 Counterland), NATO AJPs (3.9 Targeting, 3.10 Info Ops, 3.20 Cyberspace).
- Tier 2 β Academic & war college: 5 documents. NPS theses (swarm-vs-swarm, MAGTF swarm comms), SAMS monograph, DTIC, HDIAC.
- Tier 3 β Think tank: 6 documents. CRS (R48477 DoD C-UAS, IN12661 law-enforcement C-UAS), ICRC AWS/IHL position papers (2022, 2025, 2026), CNAS ethical-autonomy paper.
- Tier 4 β Modern case studies: 5 documents. RUSI Ukraine lessons (2022, 2024, 2025), US Army Infantry Magazine, LIIA drone-revolution paper.
The 30-document baseline additionally covers US Joint Pubs (JP 3-0, 3-01, 3-09, 3-30, 3-52, 3-60, 3-85, 5-0, 2-0, 3-13.1), more NATO AJPs, UK air-power doctrine, EASA/ICAO/FAA civil-aviation UAS material, the DoD Dictionary, and DoDD 3000.09. All source documents are publicly available; the corpus contains no classified, FOUO, CUI, or NOFORN material.
Documents were cleaned (header/footer detection, hyphenation repair, ligature normalization) and chunked at a 1,200-token target with 150-token overlap (max 1,500) using the openai/gpt-oss-20b tokenizer, producing 4,051 chunks. A multi-signal quality filter (TOC, bibliography, low-content, page-ref, cover-metadata detectors; every detector except the <3-sentence floor requires β₯2 independent signals) passed 3,747 / 4,051 chunks (92.5%); only passing chunks were used for Q&A generation.
GPT-5.4 (via Azure OpenAI) generated 5 Q&A pairs per passing chunk under a strict rubric (no yes/no, no document-structure references, paraphrase rather than verbatim quote). Yield: 18,730 pairs (one chunk β JP 3-09 Joint Fire Support material β was lost to the Azure content filter). Question types are balanced: 26.8% factual / 23.5% definitional / 24.6% procedural / 25.1% conceptual.
Train/val/test split is chunk-level (group-aware), 85 / 5 / 10: 15,905 train / 935 val / 1,875 test pairs. Two entire small sources were held out as an additional out-of-distribution eval β crs_2026_law_enforcement_cuas_IN12661 (counter-UAS) and faa_notice_uas_lost_link (lost-link procedures), 15 pairs total. Chunk-level splitting makes pair leakage impossible by construction.
Training procedure
- Platform: Tinker (Thinking Machines Lab) β managed LoRA training on distributed GPU
- Method: LoRA, r=16, Ξ±=32 (Tinker default of 2Γrank; the training config set rank only), dropout=0.0. Trained on all-linear modules (attention + MLP experts + unembedding); the exported PEFT
target_modulesresolve toq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, lm_head. - Optimizer: Adam (Ξ²1=0.9, Ξ²2=0.95, Ξ΅=1e-8), gradient clip 1.0, weight decay 0.0
- Learning rate: 2e-4, linear warmup 100 steps β linear decay
- Epochs: 2 (994 steps, batch size 32, max sequence length 512 β corpus p99 is ~240 tokens)
- Effective training tokens: ~3.98M; 31,743 loss tokens
- Wall time: 1h06m on Tinker infrastructure
- The training data contains no chain-of-thought traces; the model was fine-tuned to answer directly in the
finalchannel without using the analysis channel.
Why these hyperparameters
A four-variant single-axis ablation chose the production checkpoint. Test/held-out NLL are computed on the inference path (compute_logprobs), not the training-time forward/backward path:
| variant | change vs v1 | test NLL | held-out NLL | verdict |
|---|---|---|---|---|
| v1 | baseline (r=16, 2 epochs, lr 1e-4) | 2.004 | 1.990 | baseline |
| v2 | rank 16 β 32 | 2.003 | 2.006 | rank capacity not the bottleneck |
| v3 | 2 β 3 epochs | 2.081 | 2.039 | overfit on the inference path |
| v4 | lr 1e-4 β 2e-4 | 1.987 | 1.987 | winner |
v4 had the lowest NLL on both test and held-out (no overfitting signature) and is what this repo publishes.
Evaluation
Eval compares v4 against the un-adapted base model (gpt-oss-20b, no adapter, same renderer, same 300-token budget). For every test example, both models generate one answer; GPT-5.4 then judges the pair side-by-side on three 1β5 axes plus a winner.
Aggregate scores (GPT-5.4 judge, 1β5 each axis; delta = v4 β baseline)
| axis | split | v4 | baseline | delta | 95% CI on delta |
|---|---|---|---|---|---|
| factual_accuracy | test | 2.719 | 1.081 | +1.638 | [+1.585, +1.691] |
| completeness | test | 2.340 | 1.061 | +1.279 | [+1.231, +1.324] |
| grounding | test | 2.613 | 1.054 | +1.559 | [+1.504, +1.611] |
| factual_accuracy | held-out | 2.200 | 1.000 | +1.200 | [+0.667, +1.800] |
| completeness | held-out | 1.733 | 1.000 | +0.733 | [+0.333, +1.267] |
| grounding | held-out | 2.200 | 1.000 | +1.200 | [+0.667, +1.800] |
Win rates
| eval set | n | v4 wins | baseline wins | tie | 95% CI on v4 win rate |
|---|---|---|---|---|---|
| test (original) | 1,873 | 82.6% | 1.3% | 16.1% | 80.9β84.4% |
| test (bias-corrected) | 300 | 80.7% | 1.7% | 17.7% | 76.3β85.0% |
| held-out source | 15 | 66.7% | 0.0% | 33.3% | 40.0β93.3% |
Bias correction: a 300-example stratified sample (75 per question type) was re-judged with model positions swapped. The position-A preference rate was 48.5% β indistinguishable from chance β and within-example agreement between original and swapped judgments was 94.7%. The original win rate is therefore not materially inflated by position bias.
Per-question-type win rates (test)
| question_type | n | v4 win rate |
|---|---|---|
| conceptual | 480 | 90.0% |
| definitional | 434 | 86.4% |
| procedural | 455 | 81.5% |
| factual | 504 | 73.2% |
Limitations and known failure modes
Specificity drift. v4 reliably gets the shape of an answer right but routinely drops doctrinal specifics: named codes, specific cargo categories, certificate-lookup procedures. Judge scores cluster at 2β3 on the factual axis for this reason. Treat outputs as a starting point that needs verification against the source document, not a finished answer.
Baseline budget exhaustion confounds the headline win rate. The un-adapted base model exhausts its 300-token budget inside the analysis channel 91.8% of the time on this rubric, producing no final answer. The 82.6% win rate includes this effect. Restricted to the 154 test cases where the baseline produced a final answer, v4 still wins 77.9% β that is the cleaner doctrine-knowledge signal.
Held-out-source generalization is anecdotal. Only 15 held-out examples; the wide CI (40.0β93.3%) reflects this.
No safety / red-teaming pass. This adapter has not been red-teamed for refusal robustness, jailbreak resistance, or capability misuse. It is a research model.
No CoT in training data. The training Q&A pairs are direct answers without reasoning traces. The adapter has effectively learned to suppress the analysis channel (mean output dropped from ~297 tokens for the base model to ~49 tokens for v4). If you need reasoning behavior, this is not the adapter for you.
English only. Training data is English-only.
Doctrine evolves. This snapshot reflects publications available as of May 2026. Some targeted sources (several USMC MCDPs, several AFDPs, some CRS reports) were blocked by host-level WAFs during corpus collection and are not represented.
Intended use and out-of-scope use
Intended: research on domain-specific SFT methodology; studying how a 20B MoE responds to LoRA fine-tuning on technical-document Q&A; blog/paper material.
Out-of-scope: operational military decision support; training material substituting for doctrine itself; any claim of doctrinal authority. The model is a paraphrase generator over public doctrine; it is not doctrine.
Methodological findings worth noting
- Training-time val NLL is on a different scale than inference-path NLL. v4's training-time val NLL (1.28, forward/backward path) sat ~0.71 nats below its inference-path test NLL (1.99,
compute_logprobs) on the same checkpoint. Do not use training-time val NLL for cross-variant decisions β always re-evaluate via the inference path. - Eval NLL did not track judge-perceived quality. Procedural questions had among the highest NLL but were not the lowest-scoring under the judge; factual questions were the lowest-scoring (73.2% v4 win rate, highest tie rate).
- The largest practical win from SFT on a reasoning-tuned base with non-CoT data was teaching the model to skip the analysis channel entirely (mean output 297 β 49 tokens, quality up).
- Position bias in LLM-as-judge appears to be a property of free-form preference judging that does not transfer to strict structured-rubric judging. On a 300-example randomized-position re-judging pass, the position-A preference rate was 48.5% (indistinguishable from chance).
Reproducibility
- Training/eval journals:
docs/journal/01..11in the project repository (private at time of release), covering data inventory, corpus expansion, cleaning/chunking, quality filtering, Q&A generation, ChatML formatting, the SFT run, the ablation sweep, the full eval, the position-bias control, and this HF push. - Base model:
openai/gpt-oss-20b - Adapter export:
tinker-cookbookbuild_lora_adapter(Tinker LoRA β PEFT format).
Citation
@misc{gpt-oss-20b-uav-doctrine,
author = {meftun},
title = {gpt-oss-20b-uav-doctrine: LoRA adapter for UAV combat doctrine Q&A},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/meftun/gpt-oss-20b-uav-doctrine}
}
Acknowledgments
- OpenAI for the open-weight gpt-oss-20b base model
- Thinking Machines Lab for the Tinker training API
- Authors and publishers of the source doctrinal documents (full list in
docs/journal/02-data-expansion.md)
- Downloads last month
- 32
Model tree for meftun/gpt-oss-20b-uav-doctrine
Base model
openai/gpt-oss-20b