Toloka

company

Verified

https://toloka.ai

TolokaAI

Toloka

Activity Feed Request to join this org

AI & ML interests

Human In The Loop - data labeling, model training and hosting, human verification, and more

Recent Activity

tilgasergey updated a dataset 6 days ago

toloka/JEEM

KatyaA updated a dataset 6 days ago

toloka/JEEM

cogwheelhead updated a Space 22 days ago

toloka/u-math-leaderboard

View all activity

toloka's activity

tilgasergey

updated a dataset 6 days ago

toloka/JEEM

Viewer • Updated 6 days ago • 2.18k • 168 • 5

KatyaA

updated a dataset 6 days ago

toloka/JEEM

Viewer • Updated 6 days ago • 2.18k • 168 • 5

cogwheelhead

updated a Space 22 days ago

U Math Leaderboard

🥇

U-MATH and μ-MATH leaderboard

KatyaA

published a dataset 23 days ago

toloka/JEEM

Viewer • Updated 6 days ago • 2.18k • 168 • 5

cogwheelhead

posted an update about 2 months ago

Post

2520

Me and my team have performed an in-depth investigation comparing o1 to R1 (and other reasoning models)

Link: https://toloka.ai/blog/r1-is-not-on-par-with-o1-and-the-difference-is-qualitative-not-quantitative

It started with us evaluating them on our own university-math benchmarks: U-MATH for problem-solving and μ-MATH for judging solution correctness (see the HF leaderboard: toloka/u-math-leaderboard)

tl;dr: R1 sure is amazing, but what we find is that it lags behind in novelty adaptation and reliability:
* performance drops when updating benchmarks with fresh unseen tasks (e.g. AIME 2024 -> 2025)
* R1-o1 gap widens when evaluating niche subdomains (e.g. university-specific math instead of the more common Olympiad-style contests)
* same with going into altogether unconventional domains (e.g. chess) or skills (e.g. judgment instead of problem-solving)
* R1 also runs into failure modes way more often (e.g. making illegal chess moves or falling into endless generation loops)

Our point here is not to bash on DeepSeek — they've done exceptional work, R1 is a game-changer, and we have no intention to downplay that. R1's release is a perfect opportunity to study where all these models differ and gain understanding on how to move forward from here

k4black

updated a Space about 2 months ago

U Math Leaderboard

🥇

U-MATH and μ-MATH leaderboard

k4black

published a Space 2 months ago

U Math Leaderboard

🥇

U-MATH and μ-MATH leaderboard

KatyaA

in toloka/beemo 3 months ago

minor edits

#2 opened 3 months ago by

vmkhlv

cogwheelhead

posted an update 4 months ago

Post

304

Hey!

Me and my team recently released two benchmarks on university-level math: U-MATH (for University-MATH) and μ-MATH (for Meta U-MATH).

We're working a lot on complex reasoning for LLMs, and we were in particular interested in evaluating university-curricula math skills — in topics such as differential calculus and linear algebra — for their wide applicability and practicality.

We noticed that available benchmarks at the time were either at or below high-school level, or mainly leaning towards Olympiad-style problems, or synthetically generated from a set of templates / seeds.

We wanted focus on university curricula and we wanted "organic" variety, so we created our own benchmark using problems sourced from actual teaching materials used in top US universities — that is how U-MATH came to be.

We also, and that is my primary focus in particular, are very eager on studying and improving evaluations themselves, since the standard llm-as-a-judge approach is known to be noisy and biased, but that often remains unaccounted for. So we then created a U-MATH-derived benchmark to do "meta-evaluations" — i.e. evaluate the evaluators — which allows to quantify their error-rates, study their behaviors and biases, and so on.

I'm super excited to be sharing those publicly!

toloka/u-math
toloka/mu-math