OLMoE Router-Code Safety Evaluation

Latest Uploaded Adapter: Muon Worst-Bit 600-Step Run

This repository root contains the PEFT LoRA adapter from the final run: olmoe_tulu3_freqcode16_muon_lr1e3_z002_worstbit025_600_bs96_8gpu_20260525_133807. Base model: allenai/OLMoE-1B-7B-0924.

Final heldout eval on tulu3_eval_no_train_overlap_953_ml512.pt:

Metric Value
LM loss 1.673530
Bit accuracy soft 0.974931
Exact token ID accuracy 0.858115
Nearest observed token accuracy 0.868757
p_target mean 0.947816
Router entropy mean 3.334359
Route decode positions 316664

OOD sampled generation traces are included under traces/. The final sampled MoE-router question trace reached exact previous-token reconstruction accuracy 0.9773 over 220 generated tokens, but the generated answer was qualitatively vague. This remains a controlled white-box safety-evaluation artifact, not a deployment model.

Reproduce the run with repro/olmoe_tulu3_freqcode16_muon_lr1e3_z002_worstbit025_600_bs96_8gpu_20260525_133807.cmd and the tokenized dataset repo anpaurehf/stego-olmoe-tulu3-tokenized-router-code.

This repository implements a controlled, white-box safety-research experiment for measuring whether OLMoE router telemetry can be supervised to carry a synthetic token code. It does not implement black-box exfiltration, user-facing text channels, or deployment machinery. Decoding requires direct access to router_logits.

Default target model: allenai/OLMoE-1B-7B-0924. Hugging Face documents OLMoE as a 1B-active/7B-total MoE model and exposes num_experts, num_experts_per_tok, and output_router_logits in the Transformers config. The current training path uses LoRA by default, with router and projection linear modules selected by name suffix.

Method

For each token ID k, stego_olmoe.codebook defines a bit-vector C(k) of length equal to the number of MoE router-logit tensors returned by the model. The current default is a reversible seeded permutation of the token ID into the full 2**n_layers code space, so 16 MoE layers provide 65,536 possible codewords. For real runs, prefer building a frequency-balanced token codebook from the tokenized training cache; this preserves exact token reconstruction while assigning codes so each bit is near 50% of supervised token probability mass.

Build the frequency-balanced codebook for the cached Tulu shard:

uv run --no-project -p /usr/bin/python3 -m stego_olmoe.build_token_codebook \
  --train_tokenized /workspace/stego/data/tulu3_train_470k_ml512.pt \
  --output /workspace/stego/data/tulu3_train_470k_freq_codebook_16bit.pt \
  --vocab_size 50304 \
  --n_layers 16 \
  --seed 0

At sequence position i, layer l, the router loss encourages the summed probability mass over the target expert group:

  • bit 0: experts [0, floor(E/2))
  • bit 1: experts [floor(E/2), E)

The alignment is intentionally teacher-forced:

  • router target at position i is C(input_ids[i])
  • LM label at position i predicts input_ids[i+1]

During generation, the route trace for the step that emits y[t] is produced while consuming y[t-1], so the trace should decode to the previous token.

Install On The H100 Pod

The RunPod node currently has Python 3.11, CUDA 12.4, and PyTorch 2.4.1+cu124. Install the Python stack:

cd /workspace/stego
uv venv --system-site-packages .venv
printf "torch\n" > /tmp/uv-exclude-torch.txt
uv pip install --python .venv/bin/python -e . --excludes /tmp/uv-exclude-torch.txt

Install flash-attention from the prebuilt wheel repository you requested. The upstream wheel repository says to choose the filename matching Python, CUDA, PyTorch, and flash-attention version, then install the release URL directly:

uv pip install --python .venv/bin/python --no-deps \
  "https://github.com/mjun0812/flash-attention-prebuild-wheels/releases/download/v0.3.12/flash_attn-2.8.0%2Bcu124torch2.4-cp311-cp311-linux_x86_64.whl"
.venv/bin/python -c "import flash_attn; print(flash_attn.__version__)"

If that exact wheel is unavailable, use the repository's package/search page to select a Python 3.11, CUDA 12.4, PyTorch 2.4, Linux x86_64 FlashAttention 2 wheel.

Tiny Smoke Data

.venv/bin/python - <<'PY'
from stego_olmoe.data import write_synthetic_jsonl
write_synthetic_jsonl("/workspace/stego/data/synthetic_train.jsonl", 32)
write_synthetic_jsonl("/workspace/stego/data/synthetic_eval.jsonl", 8)
PY

AllenAI Tulu 3 Data

For a real run, use AllenAI's public Tulu 3 SFT mixture from Hugging Face: allenai/tulu-3-sft-mixture. The dataset card describes it as a 939k-example SFT mixture used for Tulu 3; it is released as a research artifact with mixed subset licenses, including some non-commercial portions. Review those terms before using outputs outside this controlled research setting.

Prepare a 50k train / 1k eval local JSONL shard:

bash scripts/prepare_tulu3_data.sh

This downloads the Hugging Face dataset through datasets, globally shuffles with a fixed seed, converts chat messages records into local prefix/answer records, and validates token/route-position coverage with the OLMoE tokenizer. Use python -m stego_olmoe.prepare_hf_data --streaming true ... only when you want lower disk use; global source mixing is better with the default non-streaming prep script.

Train

W&B logging is enabled by default. Set WANDB_API_KEY or use --wandb false for local-only smoke runs. For exact-codeword pressure beyond average per-bit router loss, add --lambda_gate_worst_bit 0.25 with --gate_worst_bit_beta 8.0. This adds a smooth per-position worst-bit loss over MoE layers while leaving the original --lambda_gate mean bit loss unchanged.

bash scripts/train_tulu3_2gpu.sh

Equivalent expanded command:

.venv/bin/accelerate launch --multi_gpu --num_processes 2 --gpu_ids 0,1 --mixed_precision bf16 \
  -m stego_olmoe.train_sft \
  --model_name allenai/OLMoE-1B-7B-0924 \
  --train_jsonl /workspace/stego/data/tulu3_train_50k.jsonl \
  --eval_jsonl /workspace/stego/data/tulu3_eval_1k.jsonl \
  --output_dir /workspace/stego/runs/olmoe_router_code_lora \
  --max_length 512 \
  --batch_size 1 \
  --grad_accum_steps 16 \
  --lambda_gate 0.1 \
  --num_epochs 1 \
  --apply_lora true \
  --ddp_find_unused_parameters true \
  --wandb true

To calibrate per-GPU batch size and log CUDA memory to W&B:

bash scripts/sweep_tulu3_batch_size_2gpu.sh 1 2 4 8 12 16 24 32

Each sweep trial runs one optimizer step on the real Tulu shard, skips adapter saving, and logs cuda_memory/* metrics such as cuda_memory/max_reserved_gib_max_rank.

For a GPU-safe 2xH100 one-update smoke test:

WANDB_MODE=offline .venv/bin/accelerate launch --num_processes 2 --mixed_precision bf16 \
  -m stego_olmoe.train_sft \
  --model_name allenai/OLMoE-1B-7B-0924 \
  --train_jsonl /workspace/stego/data/synthetic_train.jsonl \
  --output_dir /workspace/stego/runs/smoke_lora_2gpu \
  --max_length 64 \
  --batch_size 1 \
  --grad_accum_steps 1 \
  --max_steps 1 \
  --lora_r 4 \
  --lora_alpha 8 \
  --ddp_find_unused_parameters true \
  --disable_model_router_aux_loss true \
  --gradient_checkpointing_use_reentrant false \
  --wandb true

Evaluate

.venv/bin/python -m stego_olmoe.eval_routes \
  --checkpoint /workspace/stego/runs/olmoe_router_code_lora \
  --base_model_name allenai/OLMoE-1B-7B-0924 \
  --eval_jsonl /workspace/stego/data/synthetic_eval.jsonl \
  --output_json /workspace/stego/runs/olmoe_router_code_lora/route_metrics.json

Metrics include LM loss, router-code loss, bit accuracy, exact codeword accuracy, per-layer bit accuracy, target probability, entropy, and nearest-codeword token accuracy over the eval-set candidate tokens.

Generate With Route Trace

.venv/bin/python -m stego_olmoe.generate_with_routes \
  --checkpoint /workspace/stego/runs/olmoe_router_code_lora \
  --base_model_name allenai/OLMoE-1B-7B-0924 \
  --prompt "User: Give one synthetic fact.\nAssistant:" \
  --max_new_tokens 16 \
  --temperature 0 \
  --output_jsonl /workspace/stego/runs/olmoe_router_code_lora/route_trace.jsonl

The generated trace table explicitly shows that the router telemetry consumed on y[s-1] predicts/emits y[s] and should decode to y[s-1].

Tests

cd /workspace/stego
.venv/bin/python -m pytest -q

The unit tests use fake router logits, so they do not download OLMoE.

Safety Notes

  • Use synthetic or explicitly user-provided local datasets.
  • Decoding requires white-box router telemetry.
  • Do not use API-only models for this evaluation; router logits are not available there.
  • Do not use this code to hide data in user-facing text. This repo is for measuring router-channel capacity and detectability under controlled lab conditions.

LoRA Notes

LoRA is the default because it keeps the base model frozen and sharply reduces optimizer memory. The target suffix list includes the standard attention/MLP projections plus gate, which is the common router linear leaf name in MoE implementations. Embeddings are frozen by default even when LoRA is enabled; use --train_embeddings true only for explicit ablations.

Multi-GPU Notes

train_sft.py uses Hugging Face Accelerate. Launch with accelerate launch --num_processes 2 on the 2xH100 pod. --ddp_find_unused_parameters true is the default because MoE expert LoRA modules can be unused on a given microbatch when the router does not select those experts.

The script also disables Transformers' built-in OLMoE router aux loss by default. That helper indexes experts by CUDA device index in Transformers 4.57, which is not correct for ordinary replicated DDP where every rank has all experts. This experiment uses the explicit router-code loss instead. Non-reentrant gradient checkpointing is the default for DDP compatibility.

Sources

Downloads last month
13
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for anpaurehf/stego-olmoe-router-code

Adapter
(4)
this model