Style-Aware Paraphraser (Mistral-7B + DPO)

Official model release for Attacks on Machine-Text Detectors Retain Stylistic Fingerprints (ICML).

This is mistralai/Mistral-7B-Instruct-v0.3 after two stages of training on the Reddit Million Users Dataset:

SFT — LoRA (r=32, α=64, dropout=0.1), lr=2e-5, 1 epoch, on instruction triples that map Mistral-7B paraphrases of a comment back to the original author's style, conditioned on 16 author exemplars.
Detector-guided DPO — β=5, lr=1e-6, 3 epochs, cosine decay, The reward signal comes from a roberta-base classifier trained to discriminate the SFT model's outputs from genuine human text. The "most-human" candidate becomes chosen; a random other candidate becomes rejected.

Intended use

Style-aware paraphrasing of Mistral-7B paraphrases of machine-generated text, in the style of a target human author specified by 16 reference texts. When transforming machine-text, the model is designed to be invoked iteratively (3 rounds, 10 candidates per round, top-5 by SBERT) (see the demo notebook in the GitHub release).

Best on social-media-like text. Trained on Reddit comments (32–128 tokens). The paper's evaluation shows transfer to Amazon reviews works well and Blogs less well; for long-form text, split into paragraphs and refine each independently (paper Appendix L, "Chunk & Merge").

Inference recipe

Attacking machine-text is a two-stage process:

Mistral-7B paraphrase: Paraphrase the input machine text P=5 times with mistralai/Mistral-7B-Instruct-v0.3 using the prompt in Appendix H.1 of the paper.
Iterative refinement: Feed those 5 paraphrases + 16 author exemplars into this model using the prompt in Appendix H.4, take 10 samples, keep the top-5 by SBERT cosine similarity to the originals, repeat for 3 iterations.

Runnable example pulling the released author bank from the Hub and using the inference/iterative_refinement.py library function from the source repository:

import sys; sys.path.insert(0, "release/inference")
from datasets import load_dataset
from iterative_refinement import iterative_refine

# Each row of the author bank already carries the 5 Mistral-7B paraphrases
# of `respond_reddit` plus the 16 target-author exemplars and their
# paraphrases — no second LLM needed for this path.
row = load_dataset(
    "rrivera1849/style-aware-paraphraser-author-bank-reddit",
    split="train", streaming=True,
)
row = next(iter(row))

iters = iterative_refine(
    initial_paraphrases=row["paraphrase_respond_reddit"],
    target_texts=row["reference_text"],
    target_paraphrases=row["paraphrase_reference_text"],
    original_text=row["respond_reddit"],                        # SBERT rerank reference
    model_path="rrivera1849/style-aware-paraphraser-mistral7b", # this model
    num_iters=3,
)
print(iters[-1][0])   # final iteration, top SBERT pick

For an end-to-end walkthrough (including the Mistral-7B paraphrase pass needed for inputs that aren't in the bank), see release/inference/demo.ipynb.

Companion artifacts

Author bank: rrivera1849/style-aware-paraphraser-author-bank-reddit — 36 000 rows (12 000 authors × 3 generators) with 16 exemplars + Mistral-7B paraphrases ready to feed into this model.
Released outputs across 3 domains: rrivera1849/style-aware-paraphraser-outputs — the {human_text, machine_text, adversarial_paraphrase} triples used for the paper's Table 1 / Figure 1 evaluation.
Code repository: github.com/rrivera1849/style-aware-paraphrasing — full training, inference, and evaluation pipeline.

Evaluation

At N=1 document per author (Reddit), the strongest detector across the paper's nine-detector suite scores:

Method	Max AUROC(1), lower = better attack
No attack	≈ 1.00
Ours	≈ 0.55

Source: paper Figure 1, left panel. Same finding holds on Amazon and Blogs (which weren't seen during training); see Figure 1 middle/right panels and Appendix A for the per-detector breakdown.

The attack is the only method evaluated that evades all detectors — including style-based ones — at single-document scale. As N grows past ~25, distributions become separable again and detection recovers.

Responsible use

This model is an adversarial paraphraser: trained to make machine- generated text harder for current detectors to identify. We release it to support research on the limits of machine-text detection — detector stress-testing, studying which stylistic features survive aggressive paraphrasing, and building defenses (including the multi-document detection regime the paper argues is necessary, §8).

Out-of-scope / prohibited uses. Producing content to deceive a reader, evaluator, or platform about whether it was written by a human (academic dishonesty, ghostwriting submissions, evading content-moderation or authenticity systems whose terms prohibit it); impersonating a specific real person; mass production of synthetic content for spam, manipulation, or abuse.

The model is dual-use; responsibility for downstream use rests with the user. See the paper's Impact Statement for the full framing.

Training data

Training data derives from the Reddit Million Users Dataset (Khan et al. 2021).

Limitations

Trained only on Reddit (32–128 token comments). Cross-domain transfer to Amazon reviews and blog posts works (paper Fig 1) but is weaker for blogs.
Requires 16 exemplars from the target author.
Detection is recovered with N≥25 documents per author.

Citation

@inproceedings{rivera-soto-etal-2026-attacks,
  title     = {Attacks on Machine-Text Detectors Retain Stylistic Fingerprints},
  author    = {Rivera Soto, Rafael A. and Chen, Barry and Andrews, Nicholas},
  booktitle = {Proceedings of the International Conference on Machine Learning},
  year      = {2026},
  url       = {https://arxiv.org/abs/2505.14608},
}

License

Apache 2.0 — the model weights are a fine-tuned derivative of mistralai/Mistral-7B-Instruct-v0.3 and inherit its license. The companion code in the style-aware-paraphrasing repository is MIT.