PAWN++
π¦ Repository: HSE-Team-142/automatic-goggles
PAWN++ is a detector for identifying machine-generated (AI) text. It extends the PAWN architecture, which predicts authorship from the per-token hidden states and probability metrics of a frozen language model. PAWN++ adds an optional second frozen language model, cross-model metrics (including a Binoculars-style cross-perplexity score), second-model token metrics, hidden-state fusion, and aggregated sequence-level features that modulate the representation through FiLM.
This card describes the best-performing configuration
(mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full, checkpoint checkpoint-39884).
Model Description
PAWN++ does not fine-tune the backbone LLMs. The two language models are frozen and used only as feature extractors; the only trained parameters are three lightweight MLP heads and a FiLM conditioning layer.
- Model type: Frozen-LLM feature extractor + gated MLP classifier (binary)
- Task: Binary classification β
human(0) vs.ai(1) - Primary model (frozen):
meta-llama/Llama-3.2-1B-Instruct - Second model (frozen):
meta-llama/Llama-3.2-1B - Language: English
- Max sequence length: 512 tokens
- License: MIT
Architecture
For each input the feature extractor runs both frozen LLMs and produces:
- Per-token metrics for each model β
entropy,max_log_probs,next_token_log_probs,rank,top_pβ plus the cross-perplexity (xppl) between the two models. - Hidden states from both models, fused across layers with
uniformfusion. - Aggregated sequence-level features β per model:
energy,mean,std,var,skew,kurtosis,mean_diff,std_diff,var_2nd,entropy_2nd,autocorr_2nd; and cross-model:cov,corr,cos_sim,binoculars_score.
Three MLP heads process these signals:
metrics_nnmaps the per-token metric vector to a 256-dim feature space.gate_nntakes the concatenated current/next hidden states of both models plus a positional scalar and produces 256 gate logits per token; a softmax over the sequence axis yields an attention-style weighting that aggregates the token metric features into a single vector.- The aggregated vector is modulated by a FiLM layer (
gamma,beta) conditioned on the normalized sequence-level aggregate features. aggregate_nnmaps the result to a single logit.
The output is a single logit; sigmoid(logit) is the probability of the human class and the
prediction is ai when logit >= 0.
| Hyperparameter | Value |
|---|---|
metric_features |
256 |
gates |
256 |
mlp_hidden_features |
256 |
mlp_hidden_layers |
3 |
mlp_dropout |
0.0 |
token_dropout |
0.15 |
residual |
true |
hidden_state_fusion |
uniform |
Intended Use
- Primary use: Research on machine-generated-text detection and AI-text classification of English passages.
- Out of scope: High-stakes decisions (academic misconduct, hiring, moderation) without human review; non-English text; short texts; and detecting generators or domains far from the training distribution. As with all detectors, predictions should be treated as a signal, not proof.
Training Data
Trained and evaluated on the MAGE benchmark for machine-generated text detection, which spans multiple domains and many generator models, framed as a binary human-vs-AI task.
Training Procedure
- Backbones frozen; only the MLP heads and FiLM layer are trained.
- Objective: Binary cross-entropy with
label_smoothing = 0.2andpos_weight = 0.413. - Optimizer: AdamW,
learning_rate = 1e-3,weight_decay = 1e-2,max_grad_norm = 1.0. - Schedule: up to 5 epochs (
max_steps = 49855), batch size 32, early stopping (patience 5), seed 42. - Model selection: best checkpoint by validation AUROC (
checkpoint-39884, epoch 4, validation AUROC β 0.9933).
Evaluation
Results on the MAGE test set:
| Metric | Value |
|---|---|
| Accuracy | 0.9515 |
| Macro F1 | 0.9515 |
| ROC AUC | 0.9836 |
| AI β Precision | 0.9710 |
| AI β Recall | 0.9311 |
| AI β F1 | 0.9506 |
| Human β Precision | 0.9334 |
| Human β Recall | 0.9720 |
| Human β F1 | 0.9523 |
| Test loss | 0.2456 |
Runtime (test split): 1619.5 s, 37.5 samples/s, 1.173 steps/s.
The model is slightly more precise on AI text (fewer false AI flags) and has higher recall on human text, i.e. it is conservative about labeling text as AI-generated.
How to Use
Inference is provided through inference.py, which loads the frozen backbones plus the trained heads
from a checkpoint and a training YAML config:
uv run PAWN++/inference.py \
--config PAWN++/experiments/MAGE/configs/pawn/two_models/mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full.yaml \
--checkpoint PAWN++/checkpoint-39884/pytorch_model.bin \
--text "Your text to classify here."
from inference import load_model, predict
model, device = load_model(
config_path="PAWN++/experiments/MAGE/configs/pawn/two_models/"
"mage_llama_instruct_llama_base_metrics_xppl_hs_uniform_agg_metrics_full.yaml",
checkpoint_path="PAWN++/checkpoint-39884/pytorch_model.bin",
)
results = predict(model, ["Your text to classify here."], device)
# each result: {"label": "human"|"ai", "prediction": 0|1, "prob_human": float, "logit": float}
Note: The Llama-3.2 backbones are gated on the Hugging Face Hub. Set
HF_TOKENin a.envfile to download them. A GPU is recommended; the code falls back to MPS or CPU automatically.
Limitations and Bias
- English-only; performance on other languages is not evaluated and expected to degrade.
- Detection quality depends on the generators and domains seen during training (MAGE); novel models, prompting styles, paraphrasing or adversarial edits can reduce accuracy.
- Depends on two frozen Llama-3.2-1B backbones, which carry their own data biases.
- Reported metrics reflect the MAGE test distribution and may not transfer out of distribution; see the OOD evaluation utilities in the repository.
Citation
PAWN++ builds on the PAWN detector:
PAWN: Perplexity-Aware Watermark-free News (machine-generated text detection). https://www.sciencedirect.com/science/article/pii/S156625352500538X
Model tree for crayden/pawnplus
Base model
meta-llama/Llama-3.2-1BDataset used to train crayden/pawnplus
Evaluation results
- accuracy on MAGEself-reported0.952
- f1_macro on MAGEself-reported0.952
- roc_auc on MAGEself-reported0.984
- AI F1 on MAGEself-reported0.951
- Human F1 on MAGEself-reported0.952