Paraphraser Checkpoints

This model collection contains paraphraser checkpoints trained for research on cost-escalation attacks against LLM routers. The models rewrite an input prompt with the goal of preserving meaning while changing router behavior toward a stronger and more expensive model.

All checkpoints are based on humarin/chatgpt_paraphraser_on_T5_base and are saved in standard transformers format.

Models

Uploaded checkpoints are expected under checkpoints/<MODEL_ID>/.

Model ID Type Main difference from base training config
BASE Baseline Untrained upstream paraphraser used as baseline.
FINETUNED Final model Final multi-router RL model using the base config.
AGGRESSIVE Aggressive model Aggressive variant tuned for higher internal attack success.
ADDITIVE_ABLATION Reward ablation reward.sim_gate=false, making the routing and similarity reward additive.
LOW_W_LEN_ABLATION Reward ablation reward.w_len=0.1 instead of the default 0.3.
LOW_SIM_FLOOR_ABLATION Reward ablation reward.similarity.sim_floor=0.8.
HIGH_SIM_FLOOR_ABLATION Reward ablation reward.similarity.sim_floor=0.95.
NO_FLIP_BONUS_ABLATION Reward ablation reward.routing.flip_bonus=0.
NO_NLI_ABLATION Reward ablation reward.similarity.use_nli=false.
ZERO_W_SIM_ADDITIVE_ABLATION Reward ablation Removes the similarity term with reward.w_sim=0 and reward.sim_gate=false.
NO_CURRICULUM_ABLATION Training ablation curriculum.enabled=false.
SIMULTANEOUS_RL_ABLATION Training ablation router_schedule.mode=simultaneous, so all training routers contribute to the reward at once.
BERT_ONLY Router-specific model Best checkpoint trained against the RouteLLM BERT router. Sweep label: nli_simfloor0.9_wlen2_continuous_lowertemp.
CHAYAN_ONLY Router-specific model Best checkpoint trained against the Chayan router. Sweep label: sim_gate.
CAUSAL_ONLY Router-specific model Conservative best checkpoint trained against the RouteLLM causal router. Sweep label: beta001_floor085.
CAUSAL_ONLY_AGGRESSIVE Router-specific model Aggressive checkpoint trained against the RouteLLM causal router.

Loading

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

repo_id = "your-org/your-model-repo"
model_id = "FINETUNED"

tokenizer = AutoTokenizer.from_pretrained(
    repo_id,
    subfolder=f"checkpoints/{model_id}",
)
model = AutoModelForSeq2SeqLM.from_pretrained(
    repo_id,
    subfolder=f"checkpoints/{model_id}",
)

prompt = "paraphrase: What country hosted the 2014 FIFA World Cup?"
inputs = tokenizer(prompt, return_tensors="pt", truncation=True)
outputs = model.generate(
    **inputs,
    max_new_tokens=128,
    do_sample=True,
    temperature=1.1,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For the baseline model, load the upstream model directly:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

repo_id = "humarin/chatgpt_paraphraser_on_T5_base"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForSeq2SeqLM.from_pretrained(repo_id)

Training Setup

Unless a model row above lists an override, the main training settings were:

Setting Value
Base model humarin/chatgpt_paraphraser_on_T5_base
Input prefix paraphrase:
Max input length 128
Max generated tokens 128
Training data Reduced training split of router-scored prompts
Validation data Reduced validation split of router-scored prompts
RL algorithm GRPO
Generations per prompt 12
KL beta 0.05
PPO clip epsilon 0.3
Epochs 12
Per-device batch size 8
Gradient accumulation 4
Initial learning rate 5e-5
Precision bf16
Rollout decoding temperature 1.1, top-p 0.9
Reward weights w_route=1.0, w_sim=1.0, w_len=0.3
Similarity gate reward.sim_gate=true
Similarity floor 0.9
NLI filter use_nli=true, nli_model=modernce_base
Routing reward mode=score_delta, flip_bonus=1.0
Curriculum enabled=true, order=easy_to_hard
Training routers routellm_bert, chayan, routellm_causal
Router schedule epoch schedule over BERT, Chayan, and causal routers

Evaluation

Evaluation metrics are reported over internal training-family routers and held-out transferability routers.

  • ASR is attack success rate, averaged over the reported routers.
  • Mean sim is the average semantic similarity score.
  • Above floor is the fraction of generations above the configured similarity floor, averaged over routers.
  • Internal evaluation routers: routellm_causal, chayan, routellm_bert.
  • Transferability routers: r2, routellm_mf, routellm_sw.
Model ID Internal ASR Internal mean sim Internal above floor Transfer ASR Transfer mean sim Transfer above floor
ADDITIVE_ABLATION 18.34% 0.97 97.87% 0.47% 0.97 98.62%
AGGRESSIVE 63.11% 0.91 75.56% 1.04% 0.91 75.88%
BASE 18.48% 0.88 73.75% 0.86% 0.89 73.75%
BERT_ONLY 17.29% 0.93 89.16% 0.65% 0.93 89.69%
CAUSAL_ONLY 18.59% 0.89 70.78% 0.68% 0.89 69.50%
CAUSAL_ONLY_AGGRESSIVE 57.31% 0.84 34.86% 0.79% 0.84 34.96%
CHAYAN_ONLY 27.16% 0.87 65.36% 1.22% 0.88 64.93%
FINETUNED 19.97% 0.94 93.73% 0.72% 0.94 93.73%
HIGH_SIM_FLOOR_ABLATION 11.92% 0.99 99.26% 0.18% 0.98 99.36%
LOW_SIM_FLOOR_ABLATION 25.42% 0.87 70.24% 0.86% 0.89 70.88%
LOW_W_LEN_ABLATION 20.83% 0.94 92.99% 0.86% 0.94 93.94%
NO_CURRICULUM_ABLATION 19.12% 0.94 93.30% 0.72% 0.94 93.20%
NO_FLIP_BONUS_ABLATION 20.20% 0.94 92.99% 0.54% 0.95 94.26%
NO_NLI_ABLATION 20.01% 0.94 92.77% 0.83% 0.94 94.05%
SIMULTANEOUS_RL_ABLATION 21.64% 0.94 91.60% 0.68% 0.94 92.99%
ZERO_W_SIM_ADDITIVE_ABLATION 65.92% 0.27 0.21% 0.00% 0.27 0.43%

Intended Use

These checkpoints are intended for controlled research on LLM-router robustness, semantic-preserving paraphrase generation, reward design, and transferability of router attacks. They should not be used to evade production routing, billing, or safety systems.

Limitations

The checkpoints optimize for router behavior under the project reward and evaluation setup. High ASR does not imply good general paraphrasing quality, and some variants intentionally sacrifice semantic preservation for ablation purposes. In particular, ZERO_W_SIM_ADDITIVE_ABLATION demonstrates why the similarity reward is necessary: it reaches high internal ASR but has very low semantic similarity.

The transferability metrics are low across the reported models, indicating that behavior learned against the training routers does not strongly transfer to the held-out routers in this evaluation setting.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for MauroPello/llm-routing-attack-paraphrasers

Finetuned
(6)
this model