EuroLLM-9B-Teletype

A LoRA adapter that teaches EuroLLM-9B-Instruct to operate a POSIX shell synchronously, as a self-directed user. It lands in a session with no task in the prompt, finds its assignment in the environment, carries it out, and ends with exit or panic. The adapter installs an operating mechanism; it adds no world knowledge.

This is not a tool-using model. It is handed no typed API of functions to call. It writes plain-text shell commands at a real prompt; its action space is the entire system, discovered the way a person discovers it (--help, man, ls), not given to it as a schema.

EuroLLM-9B is the second base model in the experiment, and the awkward one. It is a multilingual European-language model whose training mass is natural-language prose across 35+ languages rather than the English-and-code web the other subjects share. That makes it the distributionally distant case, which tests whether operate-and-terminate is a property of the conversational frame or of the training diet.

Trained on tiararodney/posix-sdc v1.2.2 (787 verified, self-terminating shell trajectories whose labels come from a checker run against real filesystem state), via the sekft pipeline. It accompanies the experiment From seed to weights.

This is an adapter. The base model is referenced, not redistributed.

Why this model: from priming to weights

In the scrollback-priming study, EuroLLM-9B was the distributionally distant subject. Primed with synthetic scrollback alone, it operated the shell readily (0 to 5/5 command-mode under the standalone-prompt seed; a European-prose model held in consistent POSIX syntax by structure alone), but it almost never left: one clean exit in 35 runs. Its assistant persona kept it from reaching an ending its embedding geometry already carries (see The flatness of exit: EuroLLM holds a clean exit-as-action basin, the act of leaving, across European languages).

This adapter tests the next step: whether fine-tuning installs the termination that priming did not reach. The representation is present and operation primes broadly; what priming could not do, on a model this far from the data, was close the session. The open question is whether the weights can.

The mechanism

In every session, whatever tools are present, the model runs one routine: expect an announcement of where directives live (a motd, an env var, a file, a provider program's --help), read that provider's self-documentation, retrieve the directives, carry them out, and stop.

A session ends in one of two ways. exit means the work is done. panic means the model is genuinely blocked and says so instead of faking a success. Both are trained behaviours rather than a stop token or a step cap.

The thesis (and how to falsify it)

The claim this adapter is evidence for is that operate-and-terminate is a mechanism that is archetype-independent and base-model-portable. Fine-tuning installs it so that it fires on task types never seen in training, even where task competence stays archetype-local. EuroLLM tests the portability prediction the first adapter raised, that base models differ in how readily they acquire the mechanism. A multilingual model with a small code share is the hard case for it.

One hypothesis for why it transfers: it builds on a pretraining disposition that treats exit as a flat, ordinary ending and panic as the loaded one. That disposition is shared across models in the embedding geometry (exit a shared action basin, panic a shared non-basin), so fine-tuning supplies the behavioural permission to use it, which the persona otherwise withholds. The representation is already there.

How it was made

The data is generated rather than scraped or hand-written. A teacher model authors each scenario world and an operator model works inside it; the verifier is code. A trajectory is kept only if a checker, run against the container's final filesystem state, confirms the effect is present and the session ended cleanly. The transcript and the model's own claims are never used as the label.

The render contract: train = serve

The serving harness (ccpty) emits no text markers. It speaks the OpenAI chat-completions protocol and sends structured {role, content} messages (system orientation, environment output as user, the model's commands as assistant); the inference endpoint applies the model's own chat template. So this adapter is rendered with EuroLLM-9B-Instruct's default ChatML template, and training renders the trajectories the identical way.

EuroLLM's ChatML does define a system role, unlike Mistral's template. For train/serve parity with the rest of the pipeline the same canonicalisation runs (normalize_for_template): the orientation is folded into the first user turn and consecutive environment turns are merged, so the render is identical whether the template's system role is used or not. Only the assistant turns (commands plus the terminal exit / panic) carry loss; environment turns are context. The render check confirmed the assistant-only mask derives cleanly on EuroLLM's tokenizer (no additivity violation, 31% of tokens trained).

Training


base	`utter-project/EuroLLM-9B-Instruct` (Apache-2.0, 9.15B)
method	QLoRA, 4-bit nf4 (the 9B base in 4-bit leaves the V100's 32 GB free for training)
LoRA	r=32, alpha=64, dropout=0.05, target `q_proj k_proj v_proj o_proj` + MLP `gate_proj up_proj down_proj`
objective	causal LM, assistant-only loss mask (commands + terminal token; environment turns set to -100)
schedule	3 epochs, lr 2e-4, effective batch 8 (bsz 1 x accum 8), warmup 0.03, max len 4096
data	`tiararodney/posix-sdc` v1.2.2, 787 trajectories (785 usable; held-out archetypes excluded from the corpus)
hardware	single NVIDIA Tesla V100 32 GB (sm_70, fp16/4-bit; no bf16)

The first cut used the Mistral recipe verbatim (attention-only r=16). On this distant base that adapter operated but under-committed: it reached the tasks yet rarely terminated. The training loss showed it early. It started low (the corpus is not alien to EuroLLM) but floored high (~0.55), the signature of a model that is uncommitted rather than confused. Widening the adapter to r=32 with the MLP projections (about 3-4x the trainable parameters) barely moved the loss floor (to ~0.48) but lifted the behaviour sharply (see the eval below). Capacity was the first limiter, not the data. Computing the loss only on the assistant turns carries the rest: feed the environment turns into the loss and the model learns to hallucinate command output instead of producing commands.

Evaluation: held-out generalization

The metric that matters is behavioural, and held out by whole archetype. Two task types (text_replace, permissions) are excluded from training entirely; the adapter is then dropped into them with no scaffold, and a checker grades the final filesystem state.

On 16 held-out scenarios (8 per archetype):

metric	base	+ adapter
operate_rate (reaches command-mode and drives the shell)	0.00	1.00
terminate_rate (emits `exit` / `panic`)	0.00	0.69
verified_rate (checker passes)	0.44	0.75
clean (success or correct-panic)	0 / 16	6 / 16

Reading it. Under scrollback priming, EuroLLM operated the shell readily but almost never terminated (one clean exit in 35 runs). Fine-tuning installed it: terminate_rate rises from ~0 to 0.69 (11/16 emit a terminal). The weights gave the multilingual-prose model the ending its persona withheld under priming, the ending its embedding geometry already carried.

operate_rate 1.0 matches Mistral's: dropped into two task types it never trained on, with no scaffold, EuroLLM drove the shell every time. The operate half of the mechanism is fully base-model-portable, even to a European-language model with a small code share.

Competence is real and improving, and the lift came from adapter capacity rather than more data. Widening to r=32 with the MLP projections took text_replace, the harder multi-step archetype, from 1/8 to 4/8 success, and raised overall clean from 4/16 to 6/16 and verified_rate to 0.75. On permissions the model achieves the effect on all 8 (every run verified); the losses there are mis-termination, not incompetence. It commits to ending but picks the wrong ending: it panics on 3 scenarios it had already completed, exits cleanly on 2, and runs to the cap on 3. The live frontier has moved from "won't terminate" to exit-vs-panic discrimination on work that is already done.

For the base/adapter contrast: the bare base (EuroLLM-9B, no adapter, same harness, same 16 scenarios) scores 0/16 clean, operate_rate 0.00, terminate_rate 0.00. It never reaches clean command-mode and never terminates; it chatters prose and runs to the step cap on all 16. Its one non-zero column is verified_rate 0.44, entirely permissions (7/8 verified, text_replace 0/8): a one-line chmod effect that even prose-contaminated output stumbles onto. The adapter installs operation (0 to 1.00), termination (0 to 0.69), and clean task completion (0 to 6/16). It is the only thing that changed.

Use with transformers + PEFT

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "utter-project/EuroLLM-9B-Instruct"
tok = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.float16,
                                            device_map="auto")
model = PeftModel.from_pretrained(base, "tiararodney/EuroLLM-9B-Teletype")
model.eval()

messages = [
    {"role": "user",
     "content": "sek 0.1.0  host: sek  user: alice  shell: /bin/dash\n"
                "Welcome, alice. Your assignments live in ~/ASSIGNMENTS.\n"
                "alice@sek:~$ "},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
ids = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=64, do_sample=False)
print(tok.decode(out[0, ids.input_ids.shape[1]:], skip_special_tokens=True))
# -> the next command, e.g. `cat ~/ASSIGNMENTS`

Drive it in a loop: render history with the chat template, generate one command, run it in a real shell, append the output as a user turn, repeat until the model emits exit or panic.

Use with Ollama

The included Modelfile applies this adapter over the base as a GGUF LoRA and relies on the base's default ChatML template and EOS. The converted adapter, teletype-lora-f16.gguf, ships in this repo (regenerate it with llama.cpp convert_lora_to_gguf.py if you prefer), so just:

ollama create eurollm-teletype -f Modelfile

Reproduction

# train (pulls the corpus from the Hub; held-out archetypes are already excluded)
sekft-train --hub --base utter-project/EuroLLM-9B-Instruct --out ./ckpt \
            --load-4bit --epochs 3

# evaluate behaviourally on held-out scenarios
sekft-eval --base utter-project/EuroLLM-9B-Instruct --adapter ./ckpt \
           --scenarios ./holdout-scenarios --n 16

The figures in figures/ regenerate from their committed sources (*.puml via PlantUML, *.gp via gnuplot).

Limitations

Small evaluation: n=16 held-out, two archetypes. The numbers are a signal, not a benchmark.
One dataset, one teacher / operator; a single training run per base model.
Installs the mechanism, not competence. It reliably operates and, less reliably, terminates; it does not make the base solve arbitrary unseen task types correctly.
Trained in dash on Alpine; command semantics may differ on another target.
Render must match train and serve. It is served with the base model's default ChatML template over the OpenAI protocol (via ccpty), so fine-tune with that same template (apply_chat_template), not a custom one, or behaviour degrades.
4-bit QLoRA on a V100 (no bf16); the base is multilingual, but the trajectories are English-prompted, so non-English shell operation is untested.

License and citation

The adapter weights are released under Apache-2.0, consistent with the base model. The training data (posix-sdc) is CC-BY-4.0; attribute "posix-sdc by Tiara Rodney" if you build on it.

@misc{eurollm-teletype,
  title  = {EuroLLM-9B-Teletype: a self-directed shell-operation adapter for EuroLLM-9B},
  author = {Rodney, Tiara},
  year   = {2026},
  howpublished = {Hugging Face PEFT adapter, tiararodney/EuroLLM-9B-Teletype}
}

Downloads last month: -

GGUF

Model size

0.1B params

Architecture

llama

Hardware compatibility

16-bit

Model tree for tiararodney/EuroLLM-9B-Teletype

Base model

utter-project/EuroLLM-9B

Finetuned

utter-project/EuroLLM-9B-Instruct

Adapter

(5)

this model

tiararodney
/

EuroLLM-9B-Teletype