arxiv:2402.04494

Grandmaster-Level Chess Without Search

Published on Feb 7

· Submitted by

akhaliq on Feb 8

#1 Paper of the day

Upvote

Authors:

Anian Ruoss ,

Sourabh Medapati ,

Abstract

The recent breakthrough successes in machine learning are mainly attributed to scale: namely large-scale attention-based architectures and datasets of unprecedented scale. This paper investigates the impact of training at scale for chess. Unlike traditional chess engines that rely on complex heuristics, explicit search, or a combination of both, we train a 270M parameter transformer model with supervised learning on a dataset of 10 million chess games. We annotate each board in the dataset with action-values provided by the powerful Stockfish 16 engine, leading to roughly 15 billion data points. Our largest model reaches a Lichess blitz Elo of 2895 against humans, and successfully solves a series of challenging chess puzzles, without any domain-specific tweaks or explicit search algorithms. We also show that our model outperforms AlphaZero's policy and value networks (without MCTS) and GPT-3.5-turbo-instruct. A systematic investigation of model and dataset size shows that strong chess performance only arises at sufficient scale. To validate our results, we perform an extensive series of ablations of design choices and hyperparameters.

View arXiv page View PDF Add to collection

Community

MichaelBarryUK

Feb 8

This tells me that our current architectures are way more powerful than we give them credit for, that they have so much untapped potential, that can be unlocked with "smarter, more complex" data. I.E imagine we had the Internet archive data from an advanced alien civilization and then trained our current models on it, they would be orders of magnitude better, but with the same architecture

theswifter01

Feb 8

yeah that "untapped potental" = thousands of GPUs = $$$

MichaelBarryUK

Feb 8

yeah that "untapped potental" = thousands of GPUs = $$$

Sure you can throw GPU's at it, but that's not what I mean, I'm taking about current architectures and methodologies but trained on (non-existent) ultra high quality data.

librarian-bot

Feb 9

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

eelang

Feb 9

" We annotate each board in the dataset with action-values provided by the powerful Stockfish 16 engine"

As I understood this, the hard part of learning the state-action pair values was outsourced to a specialist engine (which btw uses search creating the predictions).
Whilst I think it's an interesting experiment I don't see immediately what new insights it gives to us.

Mvictor97

Feb 11

This paper contains almost no novelty. To imply the method does not use search is completely disingenuous. It is trained on stockfish search targets. Which not only uses search, but also contains many human developed heuristics. Therefore this method simply performs one stage of expert iteration except using a high fine tuned sophisticated expert that is not even fully AI. Presumably their only contribution is neural network architectures which do not seem considerably novel or different from those used in Leela zero. Would recommend skipping this paper.

merve

Feb 12

•

edited Feb 12

@eelang I think these days combining human annotated datasets + pseudolabelled datasets and dumping them to large models (or even self-training) just works well. it begs the question why we haven't tried this before. this pattern helped developing many foundation models in other domains like OWLv2 or SAM. I guess this is just yet another adaptation of the same recipe.
so 1. find a scalable architecture, 2. scale it not only with GPUs (which was last year's trend) but also with data coverage through manually labelled + pseudolabelled datasets 3. (optional) distill/quantize seems to be the trend of 2024.
@theswifter01 a lot of GPUs gets one only so far, data coverage is the 👑