16 new research on inference-time scaling:

For the last couple of weeks a large amount of studies on inference-time scaling has emerged. And it's so cool, because each new paper adds a trick to the toolbox, making LLMs more capable without needing to scale parameter count of the models.

So here are 13 new methods + 3 comprehensive studies on test-time scaling:

1. Inference-Time Scaling for Generalist Reward Modeling (2504.02495)
Probably, the most popular study. It proposes to boost inference-time scalability by improving reward modeling. To enhance performance, DeepSeek-GRM uses adaptive critiques, parallel sampling, pointwise generative RM, and Self-Principled Critique Tuning (SPCT)

2. T1: Tool-integrated Self-verification for Test-time Compute Scaling in Small Language Models (2504.04718)
Allows small models to use external tools, like code interpreters and calculator, to enhance self-verification

3. Z1: Efficient Test-time Scaling with Code (2504.00810)
Proposes to train LLMs on code-based reasoning paths to make test-time scaling more efficient, limiting unnecessary tokens with a special dataset and a Shifted Thinking Window

4. GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning (2504.00891)
Introduces GenPRM, a generative PRM, that uses CoT reasoning and code verification for step-by-step judgment. With only 23K training examples, GenPRM outperforms prior PRMs and larger models

5. Can Test-Time Scaling Improve World Foundation Model? (2503.24320)
SWIFT test-time scaling framework improves World Models' performance without retraining, using strategies like fast tokenization, Top-K pruning, and efficient beam search

6. Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking (2504.07104)
Proposes REBEL for RAG systems scaling, which uses multi-criteria optimization with CoT prompting for better performance-speed tradeoffs as inference compute increases

7. $φ$-Decoding: Adaptive Foresight Sampling for Balanced Inference-Time Exploration and Exploitation (2503.13288)
Proposes a φ-Decoding strategy that uses foresight sampling, clustering and adaptive pruning to estimate and select optimal reasoning steps

Read further below 👇

Also, subscribe to the Turing Post https://www.turingpost.com/subscribe

2 replies

reacted to AdinaY's post with 🔥 11 days ago

Post

2702

Moonshot AI 月之暗面 🌛 @Kimi_Moonshotis just dropped an MoE VLM and an MoE Reasoning VLM on the hub!!

Model:https://huggingface.co/collections/moonshotai/kimi-vl-a3b-67f67b6ac91d3b03d382dd85

✨3B with MIT license
✨Long context windows up to 128K
✨Strong multimodal reasoning (36.8% on MathVision, on par with 10x larger models) and agent skills (34.5% on ScreenSpot-Pro)

reacted to AdinaY's post with 🔥 11 days ago

Post

3168

Shanghai AI Lab - OpenGV team just released InternVL3 🔥

OpenGVLab/internvl3-67f7f690be79c2fe9d74fe9d

✨ 1/2/8/9/14/38/28B with MIT license
✨ Stronger perception & reasoning vs InternVL 2.5
✨ Native Multimodal Pre-Training for even better language performance

1 reply

liked a model 17 days ago

BIOMEDICA/BMC_CLIP_CF

Updated Jan 17 • 11

New activity in microsoft/Phi-4-mini-instruct about 1 month ago

ValueError Rope Scaling

#22 opened about 1 month ago by

clawvyrin