mistral-axel-1

A LoRA adapter that teaches Ministral-3-8B-Instruct how to write COBOL. Lifts CobolEval pass@5 from 0.68% → 21.23% (+20.55pp) — a ~31× improvement over the base model on a 146-task HumanEval-style COBOL benchmark.

TL;DR

	Value
Base model	`unsloth/Ministral-3-8B-Instruct-2512` (BF16; trained on 4-bit BnB quant)
Adapter type	LoRA (PEFT)
Rank / α	16 / 16
Trainable params	~28 M (≈ 0.35 % of base)
Adapter size	107 MB (`adapter_model.safetensors`, 812 tensors, BF16)
Training data	105 examples of COBOL string-processing (custom, see below)
Training time	65 s on 1× H100 80GB (Modal)
Task	COBOL code completion in fixed-format (Area B / column 7)
License	Apache-2.0 (matches base model)

What problem does it solve?

The base Ministral-3-8B-Instruct-2512 scores 0.68% pass@5 on CobolEval — out of 146 tasks, exactly one succeeds. The failure mode is almost universal: the model produces COBOL starting at column 1 (Area A) instead of column 7 (Area B), so GnuCOBOL's fixed-format parser rejects nearly every program before tests can even run.

Compile rate on CobolEval (sample 1 of 5): base ≈ 5.5% → axel-1 ≈ 63.7%

A 105-example LoRA on column convention + a few string-processing idioms is enough to fix this — without touching base weights and without harming the model on its original capabilities.

The benchmarks (one-line each)

CobolEval — a 146-task HumanEval-style benchmark for COBOL code completion. Each task is a function signature in a COBOL skeleton; the model fills in the body. Compiled with GnuCOBOL, run against held-out test stdins.
Component B — 24 hand-curated COBOL → Java translation tasks across 8 hard mainframe failure modes (comp3_precision, OCCURS DEPENDING, REDEFINES, PIC formatting…). Frontier-only difficulty: Devstral-2 (123B) scores 8.3% pass@5.
Component C — 107 easier COBOL → Java tasks across 16 string-processing failure modes (case change, char check, palindrome, concat, accumulator, arithmetic…). Calibrated for small-model discrimination: Ministral-3B base scores 24.3%, Mistral-Medium 89.5%.
Component D — 18 medium-difficulty COBOL → Java tasks across 8 skills (loop accumulate-filter, multi-format input, string search, boundary branches…). Sits between B and C in difficulty.

All four use the same Modal inference harness; only the LoRA flag changes between base / axel-1 rows.

Headline evaluation (pass@5, temperature 0.7, k=5)

Benchmark	n	Base	mistral-axel-1	Δ vs base
CobolEval (HumanEval-style COBOL)	146	0.68%	21.23%	+20.55pp
Component C (string-processing, 16 failure modes)	107	24.30%	24.30%	+0.00pp

For reference, on the same harness:

Mistral-Medium: 30.5% CobolEval
Devstral-2 (123B): 31.6% CobolEval
Claude Sonnet 4.6: 65.8% CobolEval

A 107 MB LoRA on an 8B model closes roughly two thirds of the gap between the base 8B model and Mistral-Medium on CobolEval.

Why Component C is flat

Component C is harder than CobolEval — it tests 16 specific COBOL failure modes (accumulators, arithmetic precision, OCCURS DEPENDING, REDEFINES…) where the base model already gets the column-7 convention right. Most of axel-1's CobolEval lift comes from teaching column formatting, which Component C doesn't need. Net effect on Component C: zero — neither a lift nor a regression. The training did not damage capabilities outside the SFT distribution.

Training data

105 examples of COBOL string-processing instruction-completion pairs (95 train / 10 val), built specifically to address CobolEval failure modes.

sub-skill	count	source mix
`case_change` (upper/lower transforms)	26	synthetic + 8 real CobolEval
`char_check` (per-char predicates)	27	synthetic + 2 real
`palindrome` (string reversal / detection)	24	synthetic + 1 real
`string_concat` (build, join, append)	23	synthetic + 11 real
non-string holdouts (anti-forgetting)	15	from existing CobolEval coverage

Composition by provenance: 68 synthetic (Claude Opus reasoning in-session, validated locally with cobc), 12 real CobolEval, 15 holdout, 10 hand-written edge cases. Every example compiles under GnuCOBOL fixed-format and has at least one passing reference test (9/9 verifier passes; 0 rejected).

No API spend on data generation. Synthesis was zero-cost reasoning; validation was a local cobc toolchain.

Training procedure

# modal_app/finetune.py (single H100 80GB)
trainer = SFTTrainer(
    model=model,                       # Ministral-3-8B-Instruct, 4-bit BnB quant
    args=SFTConfig(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=1,
        max_steps=40,
        learning_rate=1e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        bf16=True,
        seed=12627998,
    ),
    peft_config=LoraConfig(
        r=16, lora_alpha=16,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.0, bias="none",
    ),
)

Framework: Unsloth + TRL SFTTrainer, PEFT 0.19.1, Transformers 5.8
Steps: 40 (~1.7 epochs over 95 train examples)
Precision: BF16 weights, 4-bit BnB during training, BF16 adapter on disk
Hardware: 1× NVIDIA H100 80GB on Modal
Wall clock: 65 seconds
Estimated cost: < $0.50 of compute

Usage

This is a PEFT/LoRA adapter — you load the base model first, then apply the adapter on top.

With PEFT + Transformers

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained(
    "mistralai/Ministral-3-8B-Instruct-2512",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "axeltta/mistral-axel-1")
tokenizer = AutoTokenizer.from_pretrained("axeltta/mistral-axel-1")

prompt = "Write a COBOL program that reads a string from stdin and prints it reversed."
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    return_tensors="pt", add_generation_prompt=True,
).to(model.device)

out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tokenizer.decode(out[0], skip_special_tokens=True))

With Unsloth (faster, 4-bit)

from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="axeltta/mistral-axel-1",
    max_seq_length=2048,
    load_in_4bit=True,
)

With vLLM

pip install vllm
vllm serve axeltta/mistral-axel-1

Intended use

Primary: COBOL code completion / generation, especially fixed-format programs.
Secondary: Research baseline for low-data domain SFT on legacy programming languages.
Companion adapter: axeltta/mistral-axel-2 — same base, trained for COBOL→Java translation (mixed results, see its card).

Limitations and risks

Narrow training distribution. 105 examples is very small; the lift comes mostly from teaching column convention, not deep COBOL semantics. The model can still generate plausible-looking but incorrect business logic.
Not for production COBOL. This is research-grade. Always compile, run real tests, and have a human review the output before touching mainframe code.
No safety tuning beyond the base model. Inherits all biases and risks of Ministral-3-8B-Instruct-2512.
CobolEval lift is partly compilation, not full correctness. pass@5 of 21.23% is real, but compile rate (~64%) is higher than pass rate — many programs compile and run yet produce the wrong output.

Reproducibility

All artifacts in this repo are deterministic byproducts of:

seed: 12627998
95 train / 10 val examples (sha256 in funnex_manifest.json)
40 steps with the hyperparameters above

A second training run from the same seed produced byte-identical adapter weights (verified during the v1 audit).

Upload was independently verified on Modal: a fresh snapshot_download of this repo loaded via PEFT against base BF16 and emitted valid Area-B COBOL on one generate call.

Citation

@software{mistral_axel_1_2026,
  author = {Axelsson, A.},
  title  = {mistral-axel-1: A LoRA adapter for COBOL code completion on Ministral-3-8B-Instruct},
  year   = {2026},
  url    = {https://huggingface.co/axeltta/mistral-axel-1},
}

Framework versions

PEFT 0.19.1
Transformers 5.8
TRL (SFTTrainer)
Unsloth (training pipeline)
PyTorch ≥ 2.7

Downloads last month: 33

Evaluation results

pass@5 on CobolEval (HumanEval-style COBOL, 146 tasks)
self-reported

21.230
pass@1 on CobolEval (HumanEval-style COBOL, 146 tasks)
self-reported

9.040
pass@5 on Component C (107 tasks, 16 failure modes)
self-reported

24.300
pass@1 on Component C (107 tasks, 16 failure modes)
self-reported

6.730