mistral-axel-1

A LoRA adapter that teaches Ministral-3-8B-Instruct how to write COBOL. Lifts CobolEval pass@5 from 0.68% → 21.23% (+20.55pp) — a ~31× improvement over the base model on a 146-task HumanEval-style COBOL benchmark.

Pass@5 across four COBOL→Java benchmarks


TL;DR

Value
Base model unsloth/Ministral-3-8B-Instruct-2512 (BF16; trained on 4-bit BnB quant)
Adapter type LoRA (PEFT)
Rank / α 16 / 16
Trainable params ~28 M (≈ 0.35 % of base)
Adapter size 107 MB (adapter_model.safetensors, 812 tensors, BF16)
Training data 105 examples of COBOL string-processing (custom, see below)
Training time 65 s on 1× H100 80GB (Modal)
Task COBOL code completion in fixed-format (Area B / column 7)
License Apache-2.0 (matches base model)

What problem does it solve?

The base Ministral-3-8B-Instruct-2512 scores 0.68% pass@5 on CobolEval — out of 146 tasks, exactly one succeeds. The failure mode is almost universal: the model produces COBOL starting at column 1 (Area A) instead of column 7 (Area B), so GnuCOBOL's fixed-format parser rejects nearly every program before tests can even run.

Compile rate on CobolEval (sample 1 of 5): base ≈ 5.5% → axel-1 ≈ 63.7%

A 105-example LoRA on column convention + a few string-processing idioms is enough to fix this — without touching base weights and without harming the model on its original capabilities.

CobolEval outcome breakdown


The benchmarks (one-line each)

  • CobolEval — a 146-task HumanEval-style benchmark for COBOL code completion. Each task is a function signature in a COBOL skeleton; the model fills in the body. Compiled with GnuCOBOL, run against held-out test stdins.
  • Component B — 24 hand-curated COBOL → Java translation tasks across 8 hard mainframe failure modes (comp3_precision, OCCURS DEPENDING, REDEFINES, PIC formatting…). Frontier-only difficulty: Devstral-2 (123B) scores 8.3% pass@5.
  • Component C — 107 easier COBOL → Java tasks across 16 string-processing failure modes (case change, char check, palindrome, concat, accumulator, arithmetic…). Calibrated for small-model discrimination: Ministral-3B base scores 24.3%, Mistral-Medium 89.5%.
  • Component D — 18 medium-difficulty COBOL → Java tasks across 8 skills (loop accumulate-filter, multi-format input, string search, boundary branches…). Sits between B and C in difficulty.

All four use the same Modal inference harness; only the LoRA flag changes between base / axel-1 rows.

Headline evaluation (pass@5, temperature 0.7, k=5)

Benchmark n Base mistral-axel-1 Δ vs base
CobolEval (HumanEval-style COBOL) 146 0.68% 21.23% +20.55pp
Component C (string-processing, 16 failure modes) 107 24.30% 24.30% +0.00pp

For reference, on the same harness:

  • Mistral-Medium: 30.5% CobolEval
  • Devstral-2 (123B): 31.6% CobolEval
  • Claude Sonnet 4.6: 65.8% CobolEval

A 107 MB LoRA on an 8B model closes roughly two thirds of the gap between the base 8B model and Mistral-Medium on CobolEval.

Why Component C is flat

Component C is harder than CobolEval — it tests 16 specific COBOL failure modes (accumulators, arithmetic precision, OCCURS DEPENDING, REDEFINES…) where the base model already gets the column-7 convention right. Most of axel-1's CobolEval lift comes from teaching column formatting, which Component C doesn't need. Net effect on Component C: zero — neither a lift nor a regression. The training did not damage capabilities outside the SFT distribution.


Training data

105 examples of COBOL string-processing instruction-completion pairs (95 train / 10 val), built specifically to address CobolEval failure modes.

sub-skill count source mix
case_change (upper/lower transforms) 26 synthetic + 8 real CobolEval
char_check (per-char predicates) 27 synthetic + 2 real
palindrome (string reversal / detection) 24 synthetic + 1 real
string_concat (build, join, append) 23 synthetic + 11 real
non-string holdouts (anti-forgetting) 15 from existing CobolEval coverage

Composition by provenance: 68 synthetic (Claude Opus reasoning in-session, validated locally with cobc), 12 real CobolEval, 15 holdout, 10 hand-written edge cases. Every example compiles under GnuCOBOL fixed-format and has at least one passing reference test (9/9 verifier passes; 0 rejected).

No API spend on data generation. Synthesis was zero-cost reasoning; validation was a local cobc toolchain.


Training procedure

# modal_app/finetune.py (single H100 80GB)
trainer = SFTTrainer(
    model=model,                       # Ministral-3-8B-Instruct, 4-bit BnB quant
    args=SFTConfig(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=1,
        max_steps=40,
        learning_rate=1e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        bf16=True,
        seed=12627998,
    ),
    peft_config=LoraConfig(
        r=16, lora_alpha=16,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.0, bias="none",
    ),
)
  • Framework: Unsloth + TRL SFTTrainer, PEFT 0.19.1, Transformers 5.8
  • Steps: 40 (~1.7 epochs over 95 train examples)
  • Precision: BF16 weights, 4-bit BnB during training, BF16 adapter on disk
  • Hardware: 1× NVIDIA H100 80GB on Modal
  • Wall clock: 65 seconds
  • Estimated cost: < $0.50 of compute

Usage

This is a PEFT/LoRA adapter — you load the base model first, then apply the adapter on top.

With PEFT + Transformers

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained(
    "mistralai/Ministral-3-8B-Instruct-2512",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "axeltta/mistral-axel-1")
tokenizer = AutoTokenizer.from_pretrained("axeltta/mistral-axel-1")

prompt = "Write a COBOL program that reads a string from stdin and prints it reversed."
inputs = tokenizer.apply_chat_template(
    [{"role": "user", "content": prompt}],
    return_tensors="pt", add_generation_prompt=True,
).to(model.device)

out = model.generate(inputs, max_new_tokens=512, do_sample=True, temperature=0.7)
print(tokenizer.decode(out[0], skip_special_tokens=True))

With Unsloth (faster, 4-bit)

from unsloth import FastModel
model, tokenizer = FastModel.from_pretrained(
    model_name="axeltta/mistral-axel-1",
    max_seq_length=2048,
    load_in_4bit=True,
)

With vLLM

pip install vllm
vllm serve axeltta/mistral-axel-1

Intended use

  • Primary: COBOL code completion / generation, especially fixed-format programs.
  • Secondary: Research baseline for low-data domain SFT on legacy programming languages.
  • Companion adapter: axeltta/mistral-axel-2 — same base, trained for COBOL→Java translation (mixed results, see its card).

Limitations and risks

  • Narrow training distribution. 105 examples is very small; the lift comes mostly from teaching column convention, not deep COBOL semantics. The model can still generate plausible-looking but incorrect business logic.
  • Not for production COBOL. This is research-grade. Always compile, run real tests, and have a human review the output before touching mainframe code.
  • No safety tuning beyond the base model. Inherits all biases and risks of Ministral-3-8B-Instruct-2512.
  • CobolEval lift is partly compilation, not full correctness. pass@5 of 21.23% is real, but compile rate (~64%) is higher than pass rate — many programs compile and run yet produce the wrong output.

Reproducibility

All artifacts in this repo are deterministic byproducts of:

  • seed: 12627998
  • 95 train / 10 val examples (sha256 in funnex_manifest.json)
  • 40 steps with the hyperparameters above

A second training run from the same seed produced byte-identical adapter weights (verified during the v1 audit).

Upload was independently verified on Modal: a fresh snapshot_download of this repo loaded via PEFT against base BF16 and emitted valid Area-B COBOL on one generate call.


Citation

@software{mistral_axel_1_2026,
  author = {Axelsson, A.},
  title  = {mistral-axel-1: A LoRA adapter for COBOL code completion on Ministral-3-8B-Instruct},
  year   = {2026},
  url    = {https://huggingface.co/axeltta/mistral-axel-1},
}

Framework versions

  • PEFT 0.19.1
  • Transformers 5.8
  • TRL (SFTTrainer)
  • Unsloth (training pipeline)
  • PyTorch ≥ 2.7
Downloads last month
33
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

  • pass@5 on CobolEval (HumanEval-style COBOL, 146 tasks)
    self-reported
    21.230
  • pass@1 on CobolEval (HumanEval-style COBOL, 146 tasks)
    self-reported
    9.040
  • pass@5 on Component C (107 tasks, 16 failure modes)
    self-reported
    24.300
  • pass@1 on Component C (107 tasks, 16 failure modes)
    self-reported
    6.730