mistral-axel-2 (v2.1)

A LoRA adapter for COBOL → Java translation, trained on Ministral-3-8B-Instruct.

This card includes the full negative result: axel-2 did learn the v2.1 training distribution (+19pp on perturbed train-set translations vs base), but the distribution did not generalize the way we hoped — it regressed Component C by 1.87pp and lost most of axel-1's CobolEval lift. We publish the full numbers because the negative result is itself a useful data point about narrow-domain SFT.

Pass@5 across four COBOL→Java benchmarks


TL;DR

Value
Base model unsloth/Ministral-3-8B-Instruct-2512 (BF16; trained on 4-bit BnB quant)
Adapter type LoRA (PEFT)
Rank / α 16 / 16
Trainable params ~28 M (≈ 0.35 % of base)
Adapter size 107 MB (adapter_model.safetensors, 812 tensors, BF16)
Training data 137 examples of COBOL→Java translation + anti-forgetting holdouts
Training time 84 s on 1× H100 80GB (Modal)
Task COBOL→Java translation (preserve stdin/stdout behavior)
License Apache-2.0 (matches base model)

The benchmarks (one-line each)

  • CobolEval — 146-task HumanEval-style COBOL completion benchmark. Compiled with GnuCOBOL, run against held-out test stdins.
  • Component B — 24 hand-curated COBOL → Java translation tasks across 8 hard mainframe failure modes (comp3_precision, OCCURS DEPENDING, REDEFINES, PIC formatting, EVALUATE WHEN OTHER…). Frontier-only difficulty: Devstral-2 (123B) scores 8.3% pass@5.
  • Component C — 107 easier COBOL → Java tasks across 16 string-processing failure modes (case change, char check, palindrome, concat, accumulator, arithmetic, min/max…). Calibrated for small-model discrimination: Ministral-3B base scores 24.3%, Mistral-Medium 89.5%.
  • Component D — 18 medium-difficulty COBOL → Java tasks across 8 skills (loop accumulate-filter, multi-format input, string search, boundary branches…). Sits between B and C in difficulty; Mistral-Large hits 94.4% here.

All four use the same Modal inference harness; only the LoRA flag changes between rows.

Headline evaluation (pass@5, temperature 0.7, k=5)

Benchmark n Base axel-1 (v1) mistral-axel-2 (v2.1) axel-2 vs base
CobolEval (completion) 146 0.68% 21.23% 2.05% +1.37pp
Component C (string-processing) 107 24.30% 24.30% 22.43% −1.87pp
Component B (hard COBOL idioms → Java) 24 0.00% n/a 4.17% +4.17pp
Component D (medium difficulty) 18 0.00% n/a 0.00% +0.00pp

The headline target — lifting Component C from 24.30% to 40%+ — was missed. Component B gained 1 task (occurs_depending_00), making axel-2 the only Ministral-3-8B variant we ran that scores non-zero on Component B.


The interesting finding: catastrophic forgetting between sibling string skills

The full Component C delta is a clean trade between two sub-skills:

Component C by failure mode

failure mode base passes axel-2 passes delta
string_concat (n=19) 0/19 19/19 +100pp
string_palindrome (n=23) 23/23 3/23 −87pp
string_case_change (n=7) 1/7 0/7 −14pp
every other failure mode 2 2 0pp

axel-2 perfectly learned string concatenation (every single Component C concat task passes) and catastrophically forgot palindrome detection (which base already had). Net: −2 tasks.

This is a tidy example of how a narrow SFT distribution can pull the model's prior toward one pattern at the expense of a sibling pattern — even when 35 of the 137 training examples were explicitly chosen as "anti-forgetting" holdouts.


Memorization audit (clean)

We perturbed 10 v2.1 training examples (rename PROGRAM-ID + 1 working-storage variable), re-translated with axel-2, then ran the resulting Java against the original test stdins.

metric base axel-2
test cases passed (of 21 valid) 17 21
pass rate 80.95% 100.00%
byte-identical to gold Java n/a 0/10

axel-2 generalizes within the training distribution (+19pp lift after perturbation, 0 byte-identical copies). The narrow result is therefore not a memorization artifact — the training distribution itself is just narrow.


Training procedure

# modal_app/finetune_axel2.py (single H100 80GB)
trainer = SFTTrainer(
    model=model,                       # Ministral-3-8B-Instruct, 4-bit BnB quant
    args=SFTConfig(
        per_device_train_batch_size=4,
        gradient_accumulation_steps=1,
        max_steps=60,                  # ~1.94 epochs over 124 train
        learning_rate=1e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        bf16=True,
        seed=12627998,
    ),
    peft_config=LoraConfig(
        r=16, lora_alpha=16,
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                        "gate_proj", "up_proj", "down_proj"],
        lora_dropout=0.0, bias="none",
    ),
)

System prompt at training time:

"You are a senior COBOL modernization engineer. Translate the user's standalone COBOL program into Java. Preserve stdin/stdout behavior exactly under loose whitespace comparison and numeric tolerance 1e-3. Output only a single Java file with public class Solution. Do not use markdown fences."

metric value
steps 60
learning rate 1e-4 cosine, warmup 3%
batch size 4
train / val examples 124 / 13
precision BF16
hardware 1× H100 80GB on Modal
wall clock 83.5 s
train loss 1.05 → 0.16
eval loss step 15 / 30 / 45 / 60 0.34 / 0.17 / 0.14 / 0.13
estimated cost < $0.50

eval loss curve


Training data

137 examples of COBOL → Java translation pairs (124 train / 13 val), generated and calibrated against base-model pass-rate to avoid both "too easy" and "too hard" extremes.

failure mode count
comp3_precision 19
redefines_aliasing 16
evaluate_when_other 15
packed_vs_display 15
occurs_depending 12
multi_format_input 11
eighty_eight_levels 8
perform_varying_complex 3
string_pic_width 3
anti-forgetting holdouts (from SFT v1.5) 35

Calibration: every candidate example was run through the base model 5 times. Examples passing all 5 (44) or none of 5 (93) were rejected, leaving 137 in the difficulty-1/2/3 sweet spot.

Source mix: 102 v2.1 hard-synthesis + 35 anti-forgetting holdouts from SFT v1.5 (axel-1's parent dataset).


Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained(
    "mistralai/Ministral-3-8B-Instruct-2512",
    torch_dtype=torch.bfloat16, device_map="auto",
)
model = PeftModel.from_pretrained(base, "axeltta/mistral-axel-2")
tokenizer = AutoTokenizer.from_pretrained("axeltta/mistral-axel-2")

SYSTEM = ("You are a senior COBOL modernization engineer. "
          "Translate the user's standalone COBOL program into Java. "
          "Preserve stdin/stdout behavior exactly. "
          "Output only a single Java file with public class Solution.")
COBOL = """       IDENTIFICATION DIVISION.
       PROGRAM-ID. CONCAT.
       ...
"""

inputs = tokenizer.apply_chat_template(
    [{"role": "system", "content": SYSTEM},
     {"role": "user",   "content": COBOL}],
    return_tensors="pt", add_generation_prompt=True,
).to(model.device)

out = model.generate(inputs, max_new_tokens=1024, do_sample=True, temperature=0.7)
print(tokenizer.decode(out[0], skip_special_tokens=True))

Match the training system prompt. axel-2 is sensitive to system-prompt style; using a different prompt at inference time pulls it back toward base behavior (this is part of why CobolEval — which uses a completion prompt, not a translation prompt — regressed from axel-1's lift).


What this rules out, and what it doesn't

Rules out:

  • "axel-2 didn't train" → eval loss decreased monotonically to 0.13; perturbed train-set lift +19pp.
  • "axel-2 memorized" → 0/10 byte-identical outputs after perturbation.
  • "the eval pipeline is broken" → same harness, base 24.30% / axel-1 24.30% / axel-2 22.43% — only the LoRA flag differs.

Doesn't rule out:

  • System-prompt mismatch. v2.1 trained on a translation system prompt; Component C harness uses a different one.
  • Distribution gap. The 9 hard failure modes in v2.1 cover Component B's structure, not Component C's (which is dominated by string_concat / string_palindrome / string_case_change).
  • Step count too high. 60 steps × batch 4 on 124 examples ≈ 1.94 epochs. v1 used 40 steps on 95 (~1.7 epochs) and got the CobolEval lift; v2.1 may have over-trained on a narrow distribution.

Recommended directions for a hypothetical v3

If you're considering training a v3 on this base, the audit suggests:

  1. Match the harness system prompt at training time — not the "senior engineer" variant.
  2. Explicitly include string_palindrome examples to preserve the base capability.
  3. Lower step count (~30) to avoid over-training a narrow distribution.
  4. Re-mix CobolEval-style completion examples to keep the v1 +20pp CobolEval lift.

Companion adapter

  • axeltta/mistral-axel-1 — same base, trained for COBOL code completion (positive result, +20.55pp on CobolEval).

Limitations and risks

  • Negative result on the headline metric. axel-2 should not be assumed to improve every COBOL task; on Component C it costs 2 tasks compared to base.
  • Narrow training distribution. 137 examples is small; the model strongly fits the 9 v2.1 failure modes and weakens elsewhere.
  • Not for production mainframe modernization. Research-grade. Always compile, run real tests, and have a human review.
  • Inherits base biases. No safety tuning beyond Ministral-3-8B-Instruct-2512.

Citation

@software{mistral_axel_2_2026,
  author = {Axelsson, A.},
  title  = {mistral-axel-2: A LoRA adapter for COBOL→Java translation on Ministral-3-8B-Instruct},
  year   = {2026},
  url    = {https://huggingface.co/axeltta/mistral-axel-2},
  note   = {Negative result on Component C; positive on Component B.},
}

Framework versions

  • PEFT 0.19.1
  • Transformers 5.8
  • TRL (SFTTrainer)
  • Unsloth (training pipeline)
  • PyTorch ≥ 2.7
Downloads last month
35
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results