mistral-axel-2 (v2.1)

A LoRA adapter for COBOL → Java translation, trained on Ministral-3-8B-Instruct.

This card includes the full negative result: axel-2 did learn the v2.1 training distribution (+19pp on perturbed train-set translations vs base), but the distribution did not generalize the way we hoped — it regressed Component C by 1.87pp and lost most of axel-1's CobolEval lift. We publish the full numbers because the negative result is itself a useful data point about narrow-domain SFT.

TL;DR

	Value
Base model	`unsloth/Ministral-3-8B-Instruct-2512` (BF16; trained on 4-bit BnB quant)
Adapter type	LoRA (PEFT)
Rank / α	16 / 16
Trainable params	~28 M (≈ 0.35 % of base)
Adapter size	107 MB (`adapter_model.safetensors`, 812 tensors, BF16)
Training data	137 examples of COBOL→Java translation + anti-forgetting holdouts
Training time	84 s on 1× H100 80GB (Modal)
Task	COBOL→Java translation (preserve stdin/stdout behavior)
License	Apache-2.0 (matches base model)

The benchmarks (one-line each)

CobolEval — 146-task HumanEval-style COBOL completion benchmark. Compiled with GnuCOBOL, run against held-out test stdins.
Component B — 24 hand-curated COBOL → Java translation tasks across 8 hard mainframe failure modes (comp3_precision, OCCURS DEPENDING, REDEFINES, PIC formatting, EVALUATE WHEN OTHER…). Frontier-only difficulty: Devstral-2 (123B) scores 8.3% pass@5.
Component C — 107 easier COBOL → Java tasks across 16 string-processing failure modes (case change, char check, palindrome, concat, accumulator, arithmetic, min/max…). Calibrated for small-model discrimination: Ministral-3B base scores 24.3%, Mistral-Medium 89.5%.
Component D — 18 medium-difficulty COBOL → Java tasks across 8 skills (loop accumulate-filter, multi-format input, string search, boundary branches…). Sits between B and C in difficulty; Mistral-Large hits 94.4% here.

All four use the same Modal inference harness; only the LoRA flag changes between rows.

Headline evaluation (pass@5, temperature 0.7, k=5)

Benchmark	n	Base	axel-1 (v1)	mistral-axel-2 (v2.1)	axel-2 vs base
CobolEval (completion)	146	0.68%	21.23%	2.05%	+1.37pp
Component C (string-processing)	107	24.30%	24.30%	22.43%	−1.87pp
Component B (hard COBOL idioms → Java)	24	0.00%	n/a	4.17%	+4.17pp
Component D (medium difficulty)	18	0.00%	n/a	0.00%	+0.00pp

The headline target — lifting Component C from 24.30% to 40%+ — was missed. Component B gained 1 task (occurs_depending_00), making axel-2 the only Ministral-3-8B variant we ran that scores non-zero on Component B.

The interesting finding: catastrophic forgetting between sibling string skills

The full Component C delta is a clean trade between two sub-skills:

failure mode	base passes	axel-2 passes	delta
`string_concat` (n=19)	0/19	19/19	+100pp
`string_palindrome` (n=23)	23/23	3/23	−87pp
`string_case_change` (n=7)	1/7	0/7	−14pp
every other failure mode	2	2	0pp

axel-2 perfectly learned string concatenation (every single Component C concat task passes) and catastrophically forgot palindrome detection (which base already had). Net: −2 tasks.

This is a tidy example of how a narrow SFT distribution can pull the model's prior toward one pattern at the expense of a sibling pattern — even when 35 of the 137 training examples were explicitly chosen as "anti-forgetting" holdouts.

Memorization audit (clean)

We perturbed 10 v2.1 training examples (rename PROGRAM-ID + 1 working-storage variable), re-translated with axel-2, then ran the resulting Java against the original test stdins.

metric	base	axel-2
test cases passed (of 21 valid)	17	21
pass rate	80.95%	100.00%
byte-identical to gold Java	n/a	0/10

axel-2 generalizes within the training distribution (+19pp lift after perturbation, 0 byte-identical copies). The narrow result is therefore not a memorization artifact — the training distribution itself is just narrow.

Training procedure

# modal_app/finetune_axel2.py (single H100 80GB)
trainer = SFTTrainer(
    model=model,                       # Ministral-3-8B-Instruct, 4-bit BnB quant
    args=SFTConfig(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=1,
        max_steps=60,                  # ~1.94 epochs over 124 train
        learning_rate=1e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        bf16=True,
        seed=12627998,
    ),
    peft_config=LoraConfig(
        r=16, lora_alpha=16,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.0, bias="none",
    ),
)

System prompt at training time:

"You are a senior COBOL modernization engineer. Translate the user's standalone COBOL program into Java. Preserve stdin/stdout behavior exactly under loose whitespace comparison and numeric tolerance 1e-3. Output only a single Java file with public class Solution. Do not use markdown fences."

metric	value
steps	60
learning rate	1e-4 cosine, warmup 3%
batch size	4
train / val examples	124 / 13
precision	BF16
hardware	1× H100 80GB on Modal
wall clock	83.5 s
train loss	1.05 → 0.16
eval loss step 15 / 30 / 45 / 60	0.34 / 0.17 / 0.14 / 0.13
estimated cost	< $0.50

Training data

137 examples of COBOL → Java translation pairs (124 train / 13 val), generated and calibrated against base-model pass-rate to avoid both "too easy" and "too hard" extremes.

failure mode	count
`comp3_precision`	19
`redefines_aliasing`	16
`evaluate_when_other`	15
`packed_vs_display`	15
`occurs_depending`	12
`multi_format_input`	11
`eighty_eight_levels`	8
`perform_varying_complex`	3
`string_pic_width`	3
anti-forgetting holdouts (from SFT v1.5)	35

Calibration: every candidate example was run through the base model 5 times. Examples passing all 5 (44) or none of 5 (93) were rejected, leaving 137 in the difficulty-1/2/3 sweet spot.

Source mix: 102 v2.1 hard-synthesis + 35 anti-forgetting holdouts from SFT v1.5 (axel-1's parent dataset).

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained(
    "mistralai/Ministral-3-8B-Instruct-2512",
    torch_dtype=torch.bfloat16, device_map="auto",
)
model = PeftModel.from_pretrained(base, "axeltta/mistral-axel-2")
tokenizer = AutoTokenizer.from_pretrained("axeltta/mistral-axel-2")

SYSTEM = ("You are a senior COBOL modernization engineer. "
          "Translate the user's standalone COBOL program into Java. "
          "Preserve stdin/stdout behavior exactly. "
          "Output only a single Java file with public class Solution.")
COBOL = """       IDENTIFICATION DIVISION.
       PROGRAM-ID. CONCAT.
       ...
"""

inputs = tokenizer.apply_chat_template(
    [{"role": "system", "content": SYSTEM},
     {"role": "user",   "content": COBOL}],
    return_tensors="pt", add_generation_prompt=True,
).to(model.device)

out = model.generate(inputs, max_new_tokens=1024, do_sample=True, temperature=0.7)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Match the training system prompt. axel-2 is sensitive to system-prompt style; using a different prompt at inference time pulls it back toward base behavior (this is part of why CobolEval — which uses a completion prompt, not a translation prompt — regressed from axel-1's lift).

What this rules out, and what it doesn't

Rules out:

"axel-2 didn't train" → eval loss decreased monotonically to 0.13; perturbed train-set lift +19pp.
"axel-2 memorized" → 0/10 byte-identical outputs after perturbation.
"the eval pipeline is broken" → same harness, base 24.30% / axel-1 24.30% / axel-2 22.43% — only the LoRA flag differs.

Doesn't rule out:

System-prompt mismatch. v2.1 trained on a translation system prompt; Component C harness uses a different one.
Distribution gap. The 9 hard failure modes in v2.1 cover Component B's structure, not Component C's (which is dominated by string_concat / string_palindrome / string_case_change).
Step count too high. 60 steps × batch 4 on 124 examples ≈ 1.94 epochs. v1 used 40 steps on 95 (~1.7 epochs) and got the CobolEval lift; v2.1 may have over-trained on a narrow distribution.

Recommended directions for a hypothetical v3

If you're considering training a v3 on this base, the audit suggests:

Match the harness system prompt at training time — not the "senior engineer" variant.
Explicitly include string_palindrome examples to preserve the base capability.
Lower step count (~30) to avoid over-training a narrow distribution.
Re-mix CobolEval-style completion examples to keep the v1 +20pp CobolEval lift.

Companion adapter

axeltta/mistral-axel-1 — same base, trained for COBOL code completion (positive result, +20.55pp on CobolEval).

Limitations and risks

Negative result on the headline metric. axel-2 should not be assumed to improve every COBOL task; on Component C it costs 2 tasks compared to base.
Narrow training distribution. 137 examples is small; the model strongly fits the 9 v2.1 failure modes and weakens elsewhere.
Not for production mainframe modernization. Research-grade. Always compile, run real tests, and have a human review.
Inherits base biases. No safety tuning beyond Ministral-3-8B-Instruct-2512.

Citation

@software{mistral_axel_2_2026,
  author = {Axelsson, A.},
  title  = {mistral-axel-2: A LoRA adapter for COBOL→Java translation on Ministral-3-8B-Instruct},
  year   = {2026},
  url    = {https://huggingface.co/axeltta/mistral-axel-2},
  note   = {Negative result on Component C; positive on Component B.},
}

Framework versions

PEFT 0.19.1
Transformers 5.8
TRL (SFTTrainer)
Unsloth (training pipeline)
PyTorch ≥ 2.7

Downloads last month: 35

Evaluation results

pass@5 on Component B (24 tasks, 9 failure modes)
self-reported

4.170
pass@5 on Component C (107 tasks)
self-reported

22.430
pass@1 on Component C (107 tasks)
self-reported

7.660
pass@5 on CobolEval (146 tasks)
self-reported

2.050
pass@1 on CobolEval (146 tasks)
self-reported

1.370