Part of the Brie model family — a controlled study of how architecture and scale affect small-data domain adaptation. See also: Brie v2 3B (flagship 3B) · Brie Llama 3.2 3B (cross-architecture) · Brie Qwen 2.5 0.5B (foundational).

Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation

Brie v2 7B

A LoRA adapter for Qwen/Qwen2.5-7B-Instruct that specializes the model in continental philosophy, speculative reasoning, and contemplative prose, trained on ~1,200 examples authored by hand. It is the 7B point in a controlled study that holds the dataset fixed and varies base architecture and size (0.5B → 3B → 7B; Qwen and Llama) to measure how each affects a model's ability to adopt a specialized register.

The 3B model card closed with a prediction: "Qwen 2.5 7B would likely show further improvements." This is that experiment. On the identical held-out set, the 7B reaches a 95.6% in-domain win rate over its base on the 2025 judge panel and 86.3% on the harder 2026 panel — above the 3B's 78.9% on that same 2026 panel — with no catastrophic forgetting out-of-domain.

What this study is about

The claim is not that supervised fine-tuning improves a model — that is assumed. The questions of interest are narrower and more practical:

  1. Can a small, hand-built corpus move a 7B model's register reliably — and does the effect grow with base scale, holding the data fixed?
  2. At what cost to generality? A specialization is only useful if it doesn't break the model elsewhere, so the design pairs every in-domain measurement with an out-of-domain control.
  3. How do you know the win is real rather than an artifact of a friendly judge, response length, or a single lab's taste? The evaluation is built to survive those objections (see Evaluation methodology).

Dataset

1,213 examples (≈1,153 train / 60 validation), authored by the researcher over several years of philosophical dialogue with language models and curated by hand — not scraped, not synthetically bulk-generated. Each is a single user/assistant turn in chat format. Domain coverage: continental philosophy (phenomenology, existentialism, critical theory), speculative and experimental thinking, conceptual reframing for creative work, and contemplative prose. A 15-example representative sample ships in the repo; the full set is private.

The corpus is the actual contribution. Treating quality as a sample-efficiency problem rather than a scale problem — a few hundred high-signal, hand-authored examples instead of a large noisy scrape — is what the whole family is built to test.

Training

  • Base: Qwen/Qwen2.5-7B-Instruct
  • Method: LoRA — r=16, lora_alpha=32, lora_dropout=0.1, targets q_proj, v_proj
  • Schedule: 1 epoch (deliberate — a larger model on a small corpus overfits quickly; the 0.5B/3B used 2 epochs, the 7B is held to 1), 152 steps, lr 2e-4 linear, warmup 10, effective batch size 8 (batch 1 × grad-accum 8), max_length=2048, gradient checkpointing, fp16
  • Adapter size: ~19 MB
  • Cost: a few dollars of cloud GPU

Evaluation

Methodology (the controls, not just the score)

A win rate is meaningless without saying who judged, against what, in what order, and at what length. This evaluation was built to be defensible:

  • Blind A/B, position-randomized. Each judge sees two unlabeled responses; which slot holds the fine-tune is randomized per item and de-randomized only at scoring. Judges are never told a fine-tune is involved.
  • Two judge generations across three independent labs, matching the rest of the family: a 2025 panel (GPT-4o, Gemini 2.5 Flash-Lite) and a 2026 panel (GPT-5, Gemini 3.1 Pro, Claude Sonnet 4.5, Claude Haiku 4.5). The same model outputs are judged by both generations, so a drop from 2025 to 2026 reflects the judges getting harder, not the model changing.
  • Bias-corrected. An earlier in-house pass used a single judge with project context; it was discarded and replaced with the blind, no-context panels here, precisely because that judge could recognize the model it had helped shape.
  • Length-controlled. Across the 2026 panel the longer response won 50.3% of decisive judgments — an essentially perfect coin flip — and in-domain the fine-tune averages about the same length as the base while winning ~86%. The wins are about register, not verbosity.
  • Out-of-domain control on coding/math/practical/factual/creative prompts, to detect catastrophic forgetting rather than report only the flattering half.
  • Identical prompts to the 3B, so the family is directly comparable.

Results (72 held-out prompts: 57 in-domain + 15 out-of-domain)

2025 panel

Judge Lab In-domain (n=57) Out-of-domain (n=15)
Gemini 2.5 Flash-Lite Google 98.2% 57.1%
GPT-4o OpenAI 92.9% 69.2%
2-judge average 95.6% 63.2%

2026 panel

Judge Lab In-domain (n=57) Out-of-domain (n=15)
Gemini 3.1 Pro Google 89.5% 53.3%
GPT-5 OpenAI 87.3% 60.0%
Claude Haiku 4.5 Anthropic 87.7% 66.7%
Claude Sonnet 4.5 Anthropic 80.7% 46.7%
4-judge average 86.3% 56.7%
  • Per-prompt consensus (2026 panel): more judges preferred the fine-tune than the base on 49 of 57 in-domain prompts, 39 unanimously (4/4).
  • No catastrophic forgetting: out-of-domain sits at parity in both generations (~57–63%), matching the 3B's 47–60%. The one clear regression is coding — expected for a philosophy specialization.
  • The 2025→2026 drop (95.6% → 86.3%) is the same temporal-relativism effect the 3B documents: newer judges grade the same outputs harder. On the 2026 panel the 7B (86.3%) still scores well above the 3B (78.9%), confirming the scaling prediction.
  • A separate blind Claude-subagent pass (run via subscription) corroborates the Anthropic judges independently.

Limitations

  1. Specialized, not general. Optimized for philosophy and creative writing. Out-of-domain is parity at best and regresses on coding.
  2. Subjective domain. "Better philosophical prose" is judged by LLMs as proxies for human taste; these are model preferences, not ground truth.
  3. Judge substitution. The 2026 Google judge is gemini-3.1-pro (the current Pro-tier release; gemini-3-pro(-preview) was retired before this run), a one-point-release substitute for the model the 3B used. The 7B's 2025 panel is two judges (GPT-4o, Gemini 2.5) rather than four; its full four-judge comparison is the 2026 panel. An independent blind Claude-subagent pass corroborates the Anthropic judges.
  4. Private training data. The full corpus is not released (a 15-example sample is), so the dataset half is illustrative rather than fully reproducible.

Usage

from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch

model = AutoPeftModelForCausalLM.from_pretrained(
    "closestfriend/brie-v2-7b", device_map="auto", torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain 'being-toward-death' and why it matters for an account of authenticity."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Recommended generation: temperature 0.7–0.75, top_p 0.9, max_new_tokens 512–1024 for creative/philosophical tasks.

Brie family — scaling the same dataset

Model Base Params In-domain win rate Out-of-domain
Brie v2 0.5B Qwen 2.5 0.5B 0.5B 77.0% ~40%
Brie v2 3B Qwen 2.5 3B 3B 80–95% (2025) 47–60%
Brie v2 7B (this model) Qwen 2.5 7B 7B 95.6% (2025) / 86.3% (2026) 57–63% (parity)

Identical training data; in-domain win rate rises with base scale while out-of-domain stays at parity — capability added without capability lost.

Citation

@misc{brie2026,
  title  = {Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation},
  author = {Karman, Hunter N.},
  year   = {2026},
  doi    = {10.5281/zenodo.17657737},
  url    = {https://doi.org/10.5281/zenodo.17657737}
}
Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for closestfriend/brie-v2-7b

Base model

Qwen/Qwen2.5-7B
Adapter
(2209)
this model

Evaluation results

  • In-domain win rate vs base (2025 panel, GPT-4o + Gemini 2.5, 2-judge average, n=57) on Held-out philosophy/creative prompts (identical to the 3B comprehensive set)
    self-reported
    95.600
  • In-domain win rate vs base (2026 panel, GPT-5 + Gemini 3.1 Pro + Sonnet 4.5 + Haiku 4.5, 4-judge average, n=57) on Held-out philosophy/creative prompts (identical to the 3B comprehensive set)
    self-reported
    86.300
  • Out-of-domain win rate vs base (2026 panel, 4-judge average, n=15) — parity, no catastrophic forgetting on Out-of-domain control (coding/math/practical/creative/factual)
    self-reported
    56.700