Style-Aware Paraphraser (Mistral-7B + DPO)
Official model release for Attacks on Machine-Text Detectors Retain Stylistic Fingerprints (ICML).
This is mistralai/Mistral-7B-Instruct-v0.3 after two stages of training on
the Reddit Million Users Dataset:
- SFT — LoRA (r=32, α=64, dropout=0.1), lr=2e-5, 1 epoch, on instruction triples that map Mistral-7B paraphrases of a comment back to the original author's style, conditioned on 16 author exemplars.
- Detector-guided DPO — β=5, lr=1e-6, 3 epochs, cosine decay,
The reward signal comes from a roberta-base
classifier trained to discriminate the SFT model's outputs from genuine
human text. The "most-human" candidate becomes
chosen; a random other candidate becomesrejected.
Intended use
Style-aware paraphrasing of Mistral-7B paraphrases of machine-generated text, in the style of a target human author specified by 16 reference texts. When transforming machine-text, the model is designed to be invoked iteratively (3 rounds, 10 candidates per round, top-5 by SBERT) (see the demo notebook in the GitHub release).
Best on social-media-like text. Trained on Reddit comments (32–128 tokens). The paper's evaluation shows transfer to Amazon reviews works well and Blogs less well; for long-form text, split into paragraphs and refine each independently (paper Appendix L, "Chunk & Merge").
Inference recipe
Attacking machine-text is a two-stage process:
- Mistral-7B paraphrase: Paraphrase the input machine text P=5 times
with
mistralai/Mistral-7B-Instruct-v0.3using the prompt in Appendix H.1 of the paper. - Iterative refinement: Feed those 5 paraphrases + 16 author exemplars into this model using the prompt in Appendix H.4, take 10 samples, keep the top-5 by SBERT cosine similarity to the originals, repeat for 3 iterations.
Runnable example pulling the released author bank from the Hub and using
the inference/iterative_refinement.py library function from the
source repository:
import sys; sys.path.insert(0, "release/inference")
from datasets import load_dataset
from iterative_refinement import iterative_refine
# Each row of the author bank already carries the 5 Mistral-7B paraphrases
# of `respond_reddit` plus the 16 target-author exemplars and their
# paraphrases — no second LLM needed for this path.
row = load_dataset(
"rrivera1849/style-aware-paraphraser-author-bank-reddit",
split="train", streaming=True,
)
row = next(iter(row))
iters = iterative_refine(
initial_paraphrases=row["paraphrase_respond_reddit"],
target_texts=row["reference_text"],
target_paraphrases=row["paraphrase_reference_text"],
original_text=row["respond_reddit"], # SBERT rerank reference
model_path="rrivera1849/style-aware-paraphraser-mistral7b", # this model
num_iters=3,
)
print(iters[-1][0]) # final iteration, top SBERT pick
For an end-to-end walkthrough (including the Mistral-7B paraphrase pass
needed for inputs that aren't in the bank), see release/inference/demo.ipynb.
Companion artifacts
- Author bank:
rrivera1849/style-aware-paraphraser-author-bank-reddit— 36 000 rows (12 000 authors × 3 generators) with 16 exemplars + Mistral-7B paraphrases ready to feed into this model. - Released outputs across 3 domains:
rrivera1849/style-aware-paraphraser-outputs— the{human_text, machine_text, adversarial_paraphrase}triples used for the paper's Table 1 / Figure 1 evaluation. - Code repository: github.com/rrivera1849/style-aware-paraphrasing — full training, inference, and evaluation pipeline.
Evaluation
At N=1 document per author (Reddit), the strongest detector across the paper's nine-detector suite scores:
| Method | Max AUROC(1), lower = better attack |
|---|---|
| No attack | ≈ 1.00 |
| Ours | ≈ 0.55 |
Source: paper Figure 1, left panel. Same finding holds on Amazon and Blogs (which weren't seen during training); see Figure 1 middle/right panels and Appendix A for the per-detector breakdown.
The attack is the only method evaluated that evades all detectors — including style-based ones — at single-document scale. As N grows past ~25, distributions become separable again and detection recovers.
Responsible use
This model is an adversarial paraphraser: trained to make machine- generated text harder for current detectors to identify. We release it to support research on the limits of machine-text detection — detector stress-testing, studying which stylistic features survive aggressive paraphrasing, and building defenses (including the multi-document detection regime the paper argues is necessary, §8).
Out-of-scope / prohibited uses. Producing content to deceive a reader, evaluator, or platform about whether it was written by a human (academic dishonesty, ghostwriting submissions, evading content-moderation or authenticity systems whose terms prohibit it); impersonating a specific real person; mass production of synthetic content for spam, manipulation, or abuse.
The model is dual-use; responsibility for downstream use rests with the user. See the paper's Impact Statement for the full framing.
Training data
Training data derives from the Reddit Million Users Dataset (Khan et al. 2021).
Limitations
- Trained only on Reddit (32–128 token comments). Cross-domain transfer to Amazon reviews and blog posts works (paper Fig 1) but is weaker for blogs.
- Requires 16 exemplars from the target author.
- Detection is recovered with N≥25 documents per author.
Citation
@inproceedings{rivera-soto-etal-2026-attacks,
title = {Attacks on Machine-Text Detectors Retain Stylistic Fingerprints},
author = {Rivera Soto, Rafael A. and Chen, Barry and Andrews, Nicholas},
booktitle = {Proceedings of the International Conference on Machine Learning},
year = {2026},
url = {https://arxiv.org/abs/2505.14608},
}
License
Apache 2.0 — the model weights are a fine-tuned derivative of
mistralai/Mistral-7B-Instruct-v0.3 and inherit its license. The
companion code in the
style-aware-paraphrasing repository is MIT.
- Downloads last month
- 79
Model tree for rrivera1849/style-aware-paraphraser-mistral7b
Base model
mistralai/Mistral-7B-v0.3