Instructions to use closestfriend/brie-v2-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use closestfriend/brie-v2-7b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct") model = PeftModel.from_pretrained(base_model, "closestfriend/brie-v2-7b") - Notebooks
- Google Colab
- Kaggle
Part of the Brie model family — a controlled study of how architecture and scale affect small-data domain adaptation. See also: Brie v2 3B (flagship 3B) · Brie Llama 3.2 3B (cross-architecture) · Brie Qwen 2.5 0.5B (foundational).
Paper: Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation
Brie v2 7B
A LoRA adapter for Qwen/Qwen2.5-7B-Instruct that specializes the model in
continental philosophy, speculative reasoning, and contemplative prose, trained
on ~1,200 examples authored by hand. It is the 7B point in a controlled study
that holds the dataset fixed and varies base architecture and size (0.5B → 3B →
7B; Qwen and Llama) to measure how each affects a model's ability to adopt a
specialized register.
The 3B model card closed with a prediction: "Qwen 2.5 7B would likely show further improvements." This is that experiment. On the identical held-out set, the 7B reaches a 95.6% in-domain win rate over its base on the 2025 judge panel and 86.3% on the harder 2026 panel — above the 3B's 78.9% on that same 2026 panel — with no catastrophic forgetting out-of-domain.
What this study is about
The claim is not that supervised fine-tuning improves a model — that is assumed. The questions of interest are narrower and more practical:
- Can a small, hand-built corpus move a 7B model's register reliably — and does the effect grow with base scale, holding the data fixed?
- At what cost to generality? A specialization is only useful if it doesn't break the model elsewhere, so the design pairs every in-domain measurement with an out-of-domain control.
- How do you know the win is real rather than an artifact of a friendly judge, response length, or a single lab's taste? The evaluation is built to survive those objections (see Evaluation methodology).
Dataset
1,213 examples (≈1,153 train / 60 validation), authored by the researcher over several years of philosophical dialogue with language models and curated by hand — not scraped, not synthetically bulk-generated. Each is a single user/assistant turn in chat format. Domain coverage: continental philosophy (phenomenology, existentialism, critical theory), speculative and experimental thinking, conceptual reframing for creative work, and contemplative prose. A 15-example representative sample ships in the repo; the full set is private.
The corpus is the actual contribution. Treating quality as a sample-efficiency problem rather than a scale problem — a few hundred high-signal, hand-authored examples instead of a large noisy scrape — is what the whole family is built to test.
Training
- Base:
Qwen/Qwen2.5-7B-Instruct - Method: LoRA —
r=16,lora_alpha=32,lora_dropout=0.1, targetsq_proj,v_proj - Schedule: 1 epoch (deliberate — a larger model on a small corpus
overfits quickly; the 0.5B/3B used 2 epochs, the 7B is held to 1), 152 steps,
lr
2e-4linear, warmup 10, effective batch size 8 (batch 1 × grad-accum 8),max_length=2048, gradient checkpointing, fp16 - Adapter size: ~19 MB
- Cost: a few dollars of cloud GPU
Evaluation
Methodology (the controls, not just the score)
A win rate is meaningless without saying who judged, against what, in what order, and at what length. This evaluation was built to be defensible:
- Blind A/B, position-randomized. Each judge sees two unlabeled responses; which slot holds the fine-tune is randomized per item and de-randomized only at scoring. Judges are never told a fine-tune is involved.
- Two judge generations across three independent labs, matching the rest of the family: a 2025 panel (GPT-4o, Gemini 2.5 Flash-Lite) and a 2026 panel (GPT-5, Gemini 3.1 Pro, Claude Sonnet 4.5, Claude Haiku 4.5). The same model outputs are judged by both generations, so a drop from 2025 to 2026 reflects the judges getting harder, not the model changing.
- Bias-corrected. An earlier in-house pass used a single judge with project context; it was discarded and replaced with the blind, no-context panels here, precisely because that judge could recognize the model it had helped shape.
- Length-controlled. Across the 2026 panel the longer response won 50.3% of decisive judgments — an essentially perfect coin flip — and in-domain the fine-tune averages about the same length as the base while winning ~86%. The wins are about register, not verbosity.
- Out-of-domain control on coding/math/practical/factual/creative prompts, to detect catastrophic forgetting rather than report only the flattering half.
- Identical prompts to the 3B, so the family is directly comparable.
Results (72 held-out prompts: 57 in-domain + 15 out-of-domain)
2025 panel
| Judge | Lab | In-domain (n=57) | Out-of-domain (n=15) |
|---|---|---|---|
| Gemini 2.5 Flash-Lite | 98.2% | 57.1% | |
| GPT-4o | OpenAI | 92.9% | 69.2% |
| 2-judge average | 95.6% | 63.2% |
2026 panel
| Judge | Lab | In-domain (n=57) | Out-of-domain (n=15) |
|---|---|---|---|
| Gemini 3.1 Pro | 89.5% | 53.3% | |
| GPT-5 | OpenAI | 87.3% | 60.0% |
| Claude Haiku 4.5 | Anthropic | 87.7% | 66.7% |
| Claude Sonnet 4.5 | Anthropic | 80.7% | 46.7% |
| 4-judge average | 86.3% | 56.7% |
- Per-prompt consensus (2026 panel): more judges preferred the fine-tune than the base on 49 of 57 in-domain prompts, 39 unanimously (4/4).
- No catastrophic forgetting: out-of-domain sits at parity in both generations (~57–63%), matching the 3B's 47–60%. The one clear regression is coding — expected for a philosophy specialization.
- The 2025→2026 drop (95.6% → 86.3%) is the same temporal-relativism effect the 3B documents: newer judges grade the same outputs harder. On the 2026 panel the 7B (86.3%) still scores well above the 3B (78.9%), confirming the scaling prediction.
- A separate blind Claude-subagent pass (run via subscription) corroborates the Anthropic judges independently.
Limitations
- Specialized, not general. Optimized for philosophy and creative writing. Out-of-domain is parity at best and regresses on coding.
- Subjective domain. "Better philosophical prose" is judged by LLMs as proxies for human taste; these are model preferences, not ground truth.
- Judge substitution. The 2026 Google judge is
gemini-3.1-pro(the current Pro-tier release;gemini-3-pro(-preview)was retired before this run), a one-point-release substitute for the model the 3B used. The 7B's 2025 panel is two judges (GPT-4o, Gemini 2.5) rather than four; its full four-judge comparison is the 2026 panel. An independent blind Claude-subagent pass corroborates the Anthropic judges. - Private training data. The full corpus is not released (a 15-example sample is), so the dataset half is illustrative rather than fully reproducible.
Usage
from peft import AutoPeftModelForCausalLM
from transformers import AutoTokenizer
import torch
model = AutoPeftModelForCausalLM.from_pretrained(
"closestfriend/brie-v2-7b", device_map="auto", torch_dtype=torch.float16,
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain 'being-toward-death' and why it matters for an account of authenticity."},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=True, temperature=0.7, top_p=0.9)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Recommended generation: temperature 0.7–0.75, top_p 0.9, max_new_tokens
512–1024 for creative/philosophical tasks.
Brie family — scaling the same dataset
| Model | Base | Params | In-domain win rate | Out-of-domain |
|---|---|---|---|---|
| Brie v2 0.5B | Qwen 2.5 0.5B | 0.5B | 77.0% | ~40% |
| Brie v2 3B | Qwen 2.5 3B | 3B | 80–95% (2025) | 47–60% |
| Brie v2 7B (this model) | Qwen 2.5 7B | 7B | 95.6% (2025) / 86.3% (2026) | 57–63% (parity) |
Identical training data; in-domain win rate rises with base scale while out-of-domain stays at parity — capability added without capability lost.
Citation
@misc{brie2026,
title = {Human-Curated Data Authoring with LLMs: A Small-Data Approach to Domain Adaptation},
author = {Karman, Hunter N.},
year = {2026},
doi = {10.5281/zenodo.17657737},
url = {https://doi.org/10.5281/zenodo.17657737}
}
- Downloads last month
- 18
Model tree for closestfriend/brie-v2-7b
Evaluation results
- In-domain win rate vs base (2025 panel, GPT-4o + Gemini 2.5, 2-judge average, n=57) on Held-out philosophy/creative prompts (identical to the 3B comprehensive set)self-reported95.600
- In-domain win rate vs base (2026 panel, GPT-5 + Gemini 3.1 Pro + Sonnet 4.5 + Haiku 4.5, 4-judge average, n=57) on Held-out philosophy/creative prompts (identical to the 3B comprehensive set)self-reported86.300
- Out-of-domain win rate vs base (2026 panel, 4-judge average, n=15) — parity, no catastrophic forgetting on Out-of-domain control (coding/math/practical/creative/factual)self-reported56.700