Vitaliy Polshkov

cogwheelhead

cogwheelhead

AI & ML interests

Data-centric AI methods; Experiment design, statistical learning and causal inference approaches for LLM / agentic evaluations; "Proper old-school" RL and meta-learning techniques for LLM training

Recent Activity

updated a Space 21 days ago

toloka/u-math-leaderboard

posted an update about 2 months ago

Me and my team have performed an in-depth investigation comparing o1 to R1 (and other reasoning models) Link: https://toloka.ai/blog/r1-is-not-on-par-with-o1-and-the-difference-is-qualitative-not-quantitative It started with us evaluating them on our own university-math benchmarks: U-MATH for problem-solving and μ-MATH for judging solution correctness (see the HF leaderboard: https://huggingface.co/spaces/toloka/u-math-leaderboard) tl;dr: R1 sure is amazing, but what we find is that it lags behind in novelty adaptation and reliability: * performance drops when updating benchmarks with fresh unseen tasks (e.g. AIME 2024 -> 2025) * R1-o1 gap widens when evaluating niche subdomains (e.g. university-specific math instead of the more common Olympiad-style contests) * same with going into altogether unconventional domains (e.g. chess) or skills (e.g. judgment instead of problem-solving) * R1 also runs into failure modes way more often (e.g. making illegal chess moves or falling into endless generation loops) Our point here is not to bash on DeepSeek — they've done exceptional work, R1 is a game-changer, and we have no intention to downplay that. R1's release is a perfect opportunity to study where all these models differ and gain understanding on how to move forward from here

updated a Space about 2 months ago

toloka/u-math-leaderboard

View all activity

Organizations

Posts 2

Post

2520

Me and my team have performed an in-depth investigation comparing o1 to R1 (and other reasoning models)

Link: https://toloka.ai/blog/r1-is-not-on-par-with-o1-and-the-difference-is-qualitative-not-quantitative

It started with us evaluating them on our own university-math benchmarks: U-MATH for problem-solving and μ-MATH for judging solution correctness (see the HF leaderboard: toloka/u-math-leaderboard)

tl;dr: R1 sure is amazing, but what we find is that it lags behind in novelty adaptation and reliability:
* performance drops when updating benchmarks with fresh unseen tasks (e.g. AIME 2024 -> 2025)
* R1-o1 gap widens when evaluating niche subdomains (e.g. university-specific math instead of the more common Olympiad-style contests)
* same with going into altogether unconventional domains (e.g. chess) or skills (e.g. judgment instead of problem-solving)
* R1 also runs into failure modes way more often (e.g. making illegal chess moves or falling into endless generation loops)

Our point here is not to bash on DeepSeek — they've done exceptional work, R1 is a game-changer, and we have no intention to downplay that. R1's release is a perfect opportunity to study where all these models differ and gain understanding on how to move forward from here

Post

304

Hey!

Me and my team recently released two benchmarks on university-level math: U-MATH (for University-MATH) and μ-MATH (for Meta U-MATH).

We're working a lot on complex reasoning for LLMs, and we were in particular interested in evaluating university-curricula math skills — in topics such as differential calculus and linear algebra — for their wide applicability and practicality.

We noticed that available benchmarks at the time were either at or below high-school level, or mainly leaning towards Olympiad-style problems, or synthetically generated from a set of templates / seeds.

We wanted focus on university curricula and we wanted "organic" variety, so we created our own benchmark using problems sourced from actual teaching materials used in top US universities — that is how U-MATH came to be.

We also, and that is my primary focus in particular, are very eager on studying and improving evaluations themselves, since the standard llm-as-a-judge approach is known to be noisy and biased, but that often remains unaccounted for. So we then created a U-MATH-derived benchmark to do "meta-evaluations" — i.e. evaluate the evaluators — which allows to quantify their error-rates, study their behaviors and biases, and so on.

I'm super excited to be sharing those publicly!

toloka/u-math
toloka/mu-math

View all Posts

Collections 1

Papers 1

arxiv:2412.03205

models

None public yet

datasets

None public yet