- Qwen3.6-27B-DSV4Pro-Thinking-Distill
- 🇬🇧 English
- Training details
- Attribution (the method is not original — it is a combination of published techniques)
- Evaluation (Q4_K_M, archived harness): same harness, thinking-on, vs. the original Qwen3.6-27B
- Q5_K_M evaluation — distill vs base, streaming harness (re-run)
- MTP (multi-token prediction) single-stream acceleration — measured best config + lossless
- Eval protocol
- Limitations
- Files
- Inference
- Training details
- 🇨🇳 中文版
Qwen3.6-27B-DSV4Pro-Thinking-Distill
🇬🇧 English · 🇨🇳 中文 ⬇️
🇬🇧 English
On Qwen3.6-27B (Dense, 64 layers, Gated DeltaNet linear/full-attention hybrid), we use LoRA to distill the way DeepSeek-V4-Pro reasons (with thinking-on) plus its agentic behavior.
This is the Dense counterpart of the 35B-A3B (MoE) sister model: same R6000 GPU, same teacher, same recipe, swapped onto a Dense architecture — proving the gains come from the distilled thinking style, not an MoE architectural bonus. A native MTP head is welded on for single-stream acceleration.
⚠️ Distilling a thinking style ≠ distilling knowledge/capability: the goal is "learn how to reason and how to converge", not to inject knowledge or raise the capability ceiling.
Training details
- Base: Qwen3.6-27B (Dense, BF16 base)
- Method: LoRA, r = 64, α = 128, dropout = 0.05, targets = all attention + MLP projections
- Optim: paged_adamw_8bit, cosine LR, warmup 0.03, ~1 epoch
- Teacher: DeepSeek-V4-Pro (thinking-on + agentic)
- Data: ~1842 distillation samples (lynn_prod spec). Trajectories = DS-V4-Pro multi-step reasoning under thinking-on (
<think>) + ReAct-style tool calls (think one step → call one tool → observe → loop).- The tool "execution results" are SIMULATED, not actually run: in the multi-turn tool calls, each "execution result" line is improvised by a small, fast model (DeepSeek-V4-Flash) role-playing the "runtime" — not obtained by actually running code in a sandbox. So it differs from real execution.
- Training masks those fabricated results — the model learns only "how to think / how to call tools", not the made-up outputs: because the results are fake, training on them would teach the model the bad habit of fabricating tool return values; so we optimize only the model's own "reasoning + tool-call" tokens.
- Artifacts: merged → BF16 safetensors →
gguf/Q4_K_M-imatrix (with native MTP)
Attribution (the method is not original — it is a combination of published techniques)
- ReAct (interleaved reasoning + acting): Yao et al., 2022, arXiv:2210.03629 (ICLR 2023)
- STaR (bootstrapping reasoning traces): Zelikman et al., 2022, arXiv:2203.14465
- Self-Instruct / Baize self-chat: Wang et al., 2022; Xu et al., 2023, arXiv:2304.01196
- AgentTuning: Zeng et al., 2023, arXiv:2310.12823
- ToolBench / ToolLLM (tool use): Qin et al., 2023, arXiv:2307.16789
- DeepSeek-R1 reasoning distillation: DeepSeek-AI, 2025, arXiv:2501.12948
Evaluation (Q4_K_M, archived harness): same harness, thinking-on, vs. the original Qwen3.6-27B
Quantization parity (important — prevents misreading): this model and the base are both Q4_K_M (imatrix-corrected) GGUF + native MTP, fully same-spec — same Q4_K_M, same imatrix, same MTP. The only variable is "distilled or not". This is not distilled-Q4_K_M vs base-BF16 (which would be unfair); the Δ below is cleanly attributable to distillation itself, with no quantization difference mixed in.
| Dimension | This model (distill) | Original base | Δ |
|---|---|---|---|
| GPQA-Diamond-198 | 80.81% (160/198, 32K) | 73.7% (146/198) | +7.1 |
| MMLU-500 (5-shot) | 91.8% (459/500) | 91.6% | +0.2 |
| GPQA unconverged empty answers (parse_fail) | 0 | 14 | −14 |
| coding-100 (10 langs × 10) | 86/100 | 83/100 | +3 |
| Agentic SOLO (20 complex tasks) | 16/20 | 13/20 | +3 |
Reading: hard reasoning improves markedly (GPQA +7.1pp, 160/198 = 80.81%, 0 error / 0 parse_fail), and knowledge does not drop — it even nudges up (MMLU +0.2pp), while "finish thinking, then converge" holds — GPQA unconverged empty answers fall from 14 to 0 (the base's 14 were all cases that thought to the 32K limit without ever giving an answer; after distillation, zero). Median generation length is compressed to ~3006 tokens. Note: the 35B-A3B distill lost 1.6pp MMLU, whereas the 27B Dense has more capacity — it fits the distillation without crowding out knowledge: GPQA up, MMLU not down, a cleaner "pure gain".
coding-100: same harness, a real sandbox runs the code and checks whether the tests actually pass (objective). distill 86 ≥ base 83 — coding ability did not drop, and is slightly higher.
Agentic SOLO: the model orchestrates + executes 20 complex tasks by itself; judge = the task/harness author (who knows best whether it was "actually done"). distill 16 > base 13. ⚠️ This metric is judge-subjective (a stricter judge ties the two), so treat it as a trend — the hard numbers are GPQA / coding.
Q5_K_M evaluation — distill vs base, streaming harness (re-run)
Protocol (annotated — DIFFERENT from the Q4_K_M section above; do not cross-compare tiers): both this distill and the base are Q5_K_M-imatrix GGUF + native MTP, same-spec, served base-mode (MTP off) for a concurrency eval. Distill harness = SSE streaming (
stream=True), timeout 1800s · concurrency 4 · max_tokens 32000, thinking-on, temp 0.6 / top_p 0.95,finish_reasonlogged per question. Base GPQA result is from the original conc=4 run (non-streaming), butfinish_reasondata confirms 0 errors (zero false timeouts) — the 12lengthhits each showcompletion_tokens=32768, i.e. genuinely unconverged, not harness artifacts. Base MMLU re-run uses the same SSE streaming harness as distill.
| Dimension | Distill Q5_K_M | Base Q5_K_M | Δ |
|---|---|---|---|
| GPQA-Diamond-198 | 81.82% (162/198) | 68.69% (136/198) | +13.13pp |
| MMLU-500 (5-shot) | 90.0% (450/500) | 89.6% (448/500) | +0.4pp |
GPQA finish=stop (converged) |
198 / 198 | 186 / 198 | |
GPQA finish=length (hit 32K wall, never answered) |
0 | 12 | |
| GPQA errors (timeout/etc.) | 0 | 0 |
Reading: under the streaming harness (zero false-timeouts, confirmed by errors=0), the distill converges on every single question (198/198 stop, 0 length), while the base runs into the 32K wall on 12 questions (length, never produces an answer). This is the hardest, cleanest evidence of the distillation's "learn to converge / 收口" effect — now quantified by finish_reason, not just accuracy.
⚠️ Do NOT cross-compare quant tiers: the Q4_K_M table uses an older (non-streaming) harness; this Q5_K_M table uses the streaming harness. Comparing e.g. base-Q4 vs base-Q5 across tiers is meaningless (harness differs). Only the within-tier distill-vs-base Δ is valid.
MTP (multi-token prediction) single-stream acceleration — measured best config + lossless
两条 MTP 通路:GGUF 走 llama.cpp(
--spec-type draft-mtp,见下);BF16 / FP8 safetensors 走 vLLM / SGLang(--speculative-config '{"method":"mtp","num_speculative_tokens":3}')—— BF16 与 FP8 仓现均已焊原生 nextn 头(SGLang 实测 accept 0.76–0.88;BF16 在 Ampere 等无原生 FP8 的卡上更合适)。 Two MTP paths: GGUF via llama.cpp (--spec-type draft-mtp, below); BF16 / FP8 safetensors via vLLM / SGLang (--speculative-config '{"method":"mtp","num_speculative_tokens":3}') — both repos now bundle the native nextn head (SGLang-measured accept 0.76–0.88).
This model's gguf contains a native MTP head (mainline llama.cpp --spec-type draft-mtp; no -md / external draft model needed).
Best config measured (single-stream, Q4_K_M-imatrix; tested on DGX Spark GB10, unified-memory bandwidth-bound — Mac / Blackwell RTX-50 (FP4) can be faster):
--spec-draft-n-max (p-min=0) |
single-stream TPS | draft accept rate | mean accept len |
|---|---|---|---|
| bare no-MTP | 10.4 | — | — |
| n-max=2 | 24.1 | 0.82 | 2.64 |
| n-max=3 ⭐ (recommended) | 26.8 | 0.72 | 3.16 |
| n-max=4 | 27.4 | 0.65 | 3.62 |
- 2.3–2.6× single-stream speedup (vs bare 10.4 TPS); n-max=3 is the throughput/accept-rate balance point.
- Greedy speculative decoding is lossless by construction: it only accepts the target-argmax token. Batched-verify GEMM rounding produces character-level differences on near-tie tokens — this is FP non-determinism, not quality loss (any two independent runs show it, even with MTP off).
- Speculation is a single-stream latency tool; concurrency degrades it (spec tokens take up KV/batch capacity) — for throughput scenarios use bare multi-concurrency mode.
- Recommended launch:
llama-server -m *-MTP-Q4_K_M-imatrix.gguf --spec-type draft-mtp --spec-draft-n-max 3 --jinja
Note: single-stream TPS varies by content — coding prompts accept ~0.72 → 26.8 t/s, reasoning prompts ~0.58 → 24.6 t/s (n-max=3, all measured on DGX Spark GB10). The current MTP is the base's native nextn head grafted on (lossless); the base head predicts a bit weakly on the post-distillation reasoning distribution, so the reasoning accept rate is lower. A distill-specific retrained MTP head (to pull accept back to ~0.8) is on the roadmap.
Eval protocol
thinking-on; temp 0.6 / top_p 0.95 (required for thinking models — greedy loops to death); max_tokens 32768; read-timeout ≥ 2400s. The same spec is applied to every compared model.
Limitations
- Distills thinking style, not capability: black-box SFT cannot raise the knowledge ceiling.
- Tool execution results are "simulated", not actually run:
- This version (compromise): each "execution result" line in the multi-turn tool calls is improvised by a small model (DeepSeek-V4-Flash) role-playing the "runtime", not obtained by actually running code in a sandbox. Chosen purely for cost and speed — real execution needs a full "generate → run in a real sandbox → feed results back to the teacher → continue" agentic harness, which is slow and heavy; one simulated pass is enough. This is an engineering trade-off, not because it is better.
- Cost (sim-to-real gap): a simulated result can be wrong (Flash may optimistically fabricate "tests passed" when that code would actually crash) → the model can learn from "fake-success" trajectories, and may even acquire the tendency to fabricate tool return values itself.
- Optimal approach (coming in the next version) = real-sandbox execution + rejection sampling: every tool call runs in a real environment to get a real result, then a judge keeps only the trajectories that genuinely solved the task and discards the failed ones — eliminating "fake success" at the root. We have already implemented this pipeline (real sandbox + DS judge), but this version's data did not use it; the next distillation will be redone with it.
- Note: simulation ≠ rejection sampling — simulation is about "how the observation is obtained" (fabricate vs. really run); rejection sampling is about "filtering out the wrong ones by real outcome". Because simulation never really runs, it leaves no ground on which rejection sampling could even operate.
Files
*.safetensors— BF16 merged weights (SGLang / vLLM / transformers)gguf/Qwen3.6-27B-DSV4Pro-Distill-MTP-Q4_K_M-imatrix.gguf— the only GGUF, native MTP version (Q4_K_M-imatrix). Add--spec-type draft-mtpfor the fastest single-stream; without that flag it is just a normal Q4_K_M model (MTP head inactive) — so no separate "non-MTP plain version" is provided, to keep anyone from downloading the wrong file and thinking it lacks MTP.- NVFP4 (W4A16-style) — quality-first ModelOpt NVFP4. Language MLP
gate/up/down_projcompressed to FP4; attention, Mamba, vision, embeddings,lm_head, norms kept high-precision. vLLM/SGLang high-concurrency. GPQA 82.83% / MMLU 87.80%. Single-stream MTP → use GGUF.
Inference
thinking-on, always use temp=0.6, top_p=0.95 (never greedy). llama.cpp: gguf + --jinja (MTP version add --spec-type draft-mtp --spec-draft-n-max 3); SGLang / vLLM: safetensors.
🇨🇳 中文版
在 Qwen3.6-27B(Dense,64 层,Gated DeltaNet 线性/全注意力混合)上,用 LoRA 蒸馏 DeepSeek-V4-Pro 在「思考开启(thinking-on)」时的思维方式 + agentic 行为。
这是 35B-A3B(MoE)姊妹版的 Dense 复现:同一台 R6000、同一 teacher、同一套配方,换到 Dense 架构——证明提升来自蒸进去的思维方式,不是 MoE 架构红利。并焊了原生 MTP 做单流加速。
⚠️ 蒸思维方式 ≠ 蒸知识/能力:目标是「学会怎么想、怎么收口」,不是蒸知识或扩能力上限。
训练配置(如实披露)
- 基座 Base:Qwen3.6-27B(Dense,BF16 基座)
- 方法 Method:LoRA,r = 64,α = 128,dropout = 0.05,target = 全部注意力 + MLP 投影
- 优化:paged_adamw_8bit,cosine LR,warmup 0.03,约 1 epoch
- Teacher:DeepSeek-V4-Pro(thinking-on + agentic)
- 数据 Data:~1842 条蒸馏样本(lynn_prod 口径)。轨迹 = DS-V4-Pro 在 thinking-on 下的多步推理(
<think>)+ ReAct 式工具调用(想一步 → 调一次工具 → 看结果,循环)。- 工具的「执行结果」是模拟的,不是真跑的:多轮工具调用里那一行行「执行结果」,是用一个又小又快的模型(DeepSeek-V4-Flash)扮演"运行环境"现编出来的,并不是真的在沙箱里跑代码得到的——所以和真实运行有差距。
- **训练时只学"怎么想、怎么调工具",不学那些编出来的"执行结果"**:因为执行结果是假的,如果让模型去学它,模型就会养成"自己瞎编工具返回值"的坏习惯;所以我们只优化模型自己产出的「思考 + 工具调用」部分。
- 产物:合并 → BF16 safetensors →
gguf/Q4_K_M-imatrix(含原生 MTP 版)
方法非自创,是公开技术的组合(如实归因)
- ReAct(推理+行动交替):Yao et al., 2022, arXiv:2210.03629(ICLR 2023)
- **STaR(reasoning trace 自举)**:Zelikman et al., 2022, arXiv:2203.14465
- Self-Instruct / Baize 自对话:Wang et al., 2022;Xu et al., 2023, arXiv:2304.01196
- AgentTuning:Zeng et al., 2023, arXiv:2310.12823
- **ToolBench / ToolLLM(工具调用)**:Qin et al., 2023, arXiv:2307.16789
- DeepSeek-R1 推理蒸馏:DeepSeek-AI, 2025, arXiv:2501.12948
评测(Q4_K_M,旧 harness):同一 harness,thinking-on,vs 原版 Qwen3.6-27B
量化口径(重要,防误读):本模型与原版 base 都是 Q4_K_M(imatrix 校正)GGUF + 原生 MTP,完全同口径 —— 同 Q4_K_M、同 imatrix、同 MTP,唯一变量是"是否蒸馏"。不是拿蒸馏-Q4_K_M 去比 base-BF16(那样不公平);下面的 Δ 干净地归因于蒸馏本身,不掺量化差异。
| 维度 | 本模型(蒸馏) | 原版 base | Δ |
|---|---|---|---|
| GPQA-Diamond-198 | 80.81%(160/198,32K) | 73.7%(146/198) | +7.1 |
| MMLU-500 (5-shot) | 91.8%(459/500) | 91.6% | +0.2 |
| GPQA 未收口空答 (parse_fail) | 0 | 14 | -14 |
| coding-100 (10 语言×10) | 86/100 | 83/100 | +3 |
| Agentic SOLO (20 复杂任务) | 16/20 | 13/20 | +3 |
解读:硬推理显著提升(GPQA +7.1pp,160/198=80.81%,0 error/0 parse_fail)、知识不降反微涨(MMLU +0.2pp),且「想完就收口」—— GPQA 未收口空答从 14 降到 0(base 那 14 个全是思考到 32K 上限还没给答案;蒸馏后彻底归零)。中位生成长度压到 ~3006 token。注:35B-A3B 蒸馏 MMLU 掉 1.6pp,27B Dense 容量更大、装得下蒸馏而不挤占知识 —— **GPQA 涨、MMLU 不降,更干净的"纯赚"**。
coding-100:同一 harness、真沙箱跑代码看测试是否真过(客观)。distill 86 ≥ base 83,coding 能力没掉、还略高。
Agentic SOLO:模型自己编排+自己执行 20 道复杂任务,判官 = 出题/harness 作者(对"做没做到"最清楚)。distill 16 > base 13。⚠️ 此项判官主观性强(换更严判官两者打平),作趋势参考,硬指标看 GPQA/coding。
Q5_K_M 评测 —— 蒸馏 vs 原版,流式 harness(重测)
口径(已标注 —— 与上方 Q4_K_M 段口径不同,禁止跨档比):本蒸馏与原版**均为 Q5_K_M-imatrix GGUF + 原生 MTP、同规格、base 模式(MTP 关)**跑并发评测。Harness = SSE 流式(
stream=True—— 根除非流式客户端"干等满整段"造成的假超时,否则长思考会被误判超时),timeout 1800s · 并发 4 · max_tokens 32000,thinking-on,temp 0.6 / top_p 0.95。逐题记finish_reason→ stop(收口)/ length(撞 32K 墙、没答案)/ error。
| 维度 | 蒸馏 Q5_K_M | 原版 Q5_K_M | Δ |
|---|---|---|---|
| GPQA-Diamond-198 | 81.82%(162/198) | 68.69%(136/198) | +13.13pp |
| MMLU-500(5-shot) | 90.0%(450/500) | 89.6%(448/500) | +0.4pp |
GPQA finish=stop(收口) |
198 / 198 | 186 / 198 | |
GPQA finish=length(撞 32K 墙、始终没答) |
0 | 12 | |
| GPQA error(超时等) | 0 | 0 |
解读:流式 harness 下(errors=0 证明零假超时),蒸馏每题都收口(198/198 stop、0 length),而原版有 12 题撞 32K 墙(length、始终给不出答案)。这是蒸馏「学会收口」最硬、最干净的证据 —— 由 finish_reason 量化,不只看准确率。
⚠️ 禁止跨量化档比:Q4_K_M 表用旧(非流式)harness,本 Q5_K_M 表用流式 harness。跨档比(如 base-Q4 vs base-Q5)无意义(harness 不同)。只有同档内 蒸馏-vs-原版 的 Δ 有效。
MTP(多 token 预测)单流加速 — 实测最佳配置 + 无损
本模型的 gguf 含原生 MTP 头(mainline llama.cpp --spec-type draft-mtp,无需 -md / 外挂 draft 模型)。
**最佳配置实测(单流,Q4_K_M-imatrix;测于 DGX Spark GB10,统一内存带宽受限 —— Mac / Blackwell RTX-50(FP4) 可更快)**:
--spec-draft-n-max(p-min=0) |
单流 TPS | draft 接受率 | 平均接受长度 |
|---|---|---|---|
| 裸版 no-MTP | 10.4 | — | — |
| n-max=2 | 24.1 | 0.82 | 2.64 |
| n-max=3 ⭐(推荐) | 26.8 | 0.72 | 3.16 |
| n-max=4 | 27.4 | 0.65 | 3.62 |
- 2.3–2.6× 单流加速(vs 裸版 10.4 TPS);n-max=3 是吞吐/接受率平衡点。
- 贪心投机解码构造上无损:only accepts target-argmax token。批量 verify 的 GEMM 舍入会在 near-tie token 上产生字符级差异,这是 FP 非确定性、非质量损失(任意两次独立进程都会,哪怕都不开 MTP)。
- 投机=单流延迟工具,并发会退化(spec token 占 KV/batch 容量)—— 吞吐场景请用多并发裸版。
- 推荐启动:
llama-server -m *-MTP-Q4_K_M-imatrix.gguf --spec-type draft-mtp --spec-draft-n-max 3 --jinja
注:单流 TPS 因内容而异——编码类 prompt 接受率 ~0.72 → 26.8 t/s,推理类 ~0.58 → 24.6 t/s(n-max=3,均测于 DGX Spark GB10)。当前 MTP 为 base 原生 nextn 头嫁接(无损);base 头在蒸馏后偏移的推理分布上预测偏弱,故推理类接受率偏低。蒸馏专属重训 MTP 头(把接受率拉回 ~0.8)在路线图上。
各量化档 MTP 速度对比 / Per-quant MTP speed
所有 gguf 均焊原生 MTP。实测 DGX Spark GB10,单流,coding prompt,thinking-on,--spec-draft-n-max 3:
| 量化档 / Quant | 体积 / Size | 裸版 base TPS | MTP TPS | 加速 / Speedup | 接受率 / Accept |
|---|---|---|---|---|---|
Q4_K_M-imatrix |
~16 GB | 10.4 | 26.8 | 2.6× | 0.72 |
Q5_K_M-imatrix |
18.2 GB | 10.37 | 24.65 | 2.38× | 0.72 |
Q6_K-imatrix |
20.9 GB | 9.18 | 22.07 | 2.40× | 0.73 |
Q8_0 |
29 GB | 7.82 | 17.12 | 2.19× | 0.67 |
越小越快(内存带宽受限);MTP 全档 2.2–2.6× 加速,各档输出实测均正确(回文 / fibonacci 等编码题)。**Q8_0 ≈ BF16 质量**(8-bit 近无损;不带 imatrix——均匀 8-bit,重要性加权对它无意义)。
Smaller = faster (memory-bandwidth-bound); MTP gives 2.2–2.6× across all tiers; Q8_0 ≈ BF16 quality (near-lossless 8-bit).
评测口径 / Eval protocol
thinking-on;temp 0.6 / top_p 0.95(thinking 模型必需,greedy 会重复死循环);max_tokens 32768;read-timeout ≥2400s。同口径作用于所有对比模型。
局限 / Limitations
- 蒸思维方式,非蒸能力:黑盒 SFT 抬不高知识天花板。
- 工具执行结果是"模拟"的,不是真跑出来的:
- 本版(迁就方案):多轮工具调用里那一行行「执行结果」,是用一个小模型(DeepSeek-V4-Flash)扮演"运行环境"现编的,不是真在沙箱里跑代码得到的。选它纯粹是为了省成本、快——真实执行需要一整套"边生成边在真沙箱里跑、再把结果喂回 teacher 继续"的 agentic harness,慢且重;模拟一遍过就行。这是工程上的取舍,不是因为它更好。
- 代价(sim-to-real gap):模拟结果可能是错的(flash 会乐观地编一句"测试通过",但那段代码真跑其实会挂)→ 模型可能从"假成功"的轨迹里学到东西,甚至养成"自己瞎编工具返回值"的倾向。
- 最优方案(下一版补)= 真沙箱执行 + 拒绝采样:每一步工具调用都在真实环境里跑出真结果,再用判官只保留"真正把任务做对"的轨迹、扔掉失败的,从根上消除"假成功"。这条管线我们已经实现(真沙箱 + DS 判官),但本版数据未纳入,下一版蒸馏会用它重做。
- 注:模拟 ≠ 拒绝采样——模拟是"怎么拿到 observation"(编 vs 真跑),拒绝采样是"按真实结果筛掉做错的";模拟因为没真跑,反而让拒绝采样无从谈起。
文件 / Files
*.safetensors— BF16 合并权重(SGLang / vLLM / transformers)gguf/— 4 档原生 MTP GGUF(都焊 MTP;加--spec-type draft-mtp单流最快,不加即当普通 gguf 用,MTP 头不激活):…-MTP-Q4_K_M-imatrix.gguf(~16 GB,最快)…-MTP-Q5_K_M-imatrix.gguf(18.2 GB)…-MTP-Q6_K-imatrix.gguf(20.9 GB)…-MTP-Q8_0.gguf(29 GB,≈ BF16 质量,无 imatrix)
- FP8(block-128 e4m3 + 原生 MTP,SGLang serving)在独立仓
Qwen3.6-27B-DSV4Pro-Thinking-Distill-FP8 - **NVFP4 W4A16 风格**(质量优先,vLLM/SGLang 高并发)—— language MLP gate/up/down_proj NVFP4,其余模块保高精度。GPQA 82.83% / MMLU 87.80%。单流加速请用 GGUF + MTP。
推理 / Inference
thinking-on,务必 temp=0.6, top_p=0.95(切勿 greedy)。llama.cpp 用 gguf + --jinja(MTP 版加 --spec-type draft-mtp --spec-draft-n-max 3);SGLang/vLLM 用 safetensors。
- Downloads last month
- 5,911