Add google/recurrentgemma-2b-it

#677
by devingulliver - opened

Hello,
I noticed that google/recurrentgemma-2b was added to the leaderboard despite the architecture not being in a release of transformers yet.
Could you also evaluate the RLHF version google/recurrentgemma-2b-it so that both results are available?
Thank you

Note: I tried submitting the model to the Leaderboard the standard way and got the "trust_remote_code=True" error, even though there's no custom modeling code in the repo...
The original recurrentgemma commit was made by @clefourrier so I guess they're the only one with the power to add the RLHF version to the Leaderboard?

Additionally, I'm seeing a surprising amount of variance in scores reported by various groups, for example ARC:

  • Open LLM Leaderboard (float16, 25-shot): 31.40
  • Google DeepMind's arXiv preprint (metric not listed): 42.3
  • Myself, installing transformers from source (float16, 25-shot): 47.53

What's the source of error here? Is it simply different versions of the eval suite and data, or is there a discrepancy in one of the recurrentgemma implementations?

This comment has been hidden
Hugging Face H4 org

Hi!
As you can see in the About page, we don't report ARC, but the challenge subset only, for which we report the normalized accuracy.
You should get scores extremely close to ours if you use the same harness and model commit.
The differences you see between what we report (with a completely reproducible setup) and what papers report (without providing anything about their evaluation scripts) is typically an illustration of why it's important to have a space like the leaderboard where all models are evaluated in the same way.

The original recurrentgemma commit was made by @clefourrier

I had added the original model as we worked with Google to have the results on the leaderboard on release date, so I had to run it privately and port it on the day.
It will be possible to submit automatically once the code modifications make it into a stable release of transformers (they are only on main atm iirc), and we update our leaderboard's dependencies.

clefourrier changed discussion status to closed

In light of the new transformers release, I checked to see if anyone had submitted the recurrentgemma model yet. It was submitted... and it failed.
I'd like to inquire about the logs and see what I can do to successfully submit it myself. (I'm unable to "reopen" this issue, but it seemed improper to create a new one for essentially the same topic.)

Hugging Face H4 org

Hi!
We updated our requirements yesterday for both front and backend, just checked the logs and the model evaluation started fine but was sigtermed - I'll relaunch it, but maybe we'll have to manage our automatic DP/PP system differently for recurrent models (could require higher memory at inference)

clefourrier changed discussion status to open

Looks like it failed again... this is from the same release as the successful recurrentgemma eval so I'm not sure what went wrong. Send the logs this time?

Hugging Face H4 org

Hi @devingulliver ,
Here are the logs, which will not tell you anything more than I did above.

]
  Evaluation {
    Evaluate on 62 tasks.
    Running loglikelihood requests
  0%|          | 0/1 [00:00<?, ?it/s]    [2024-04-19 06:37:43,128] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1869283 closing signal SIGTERM
[2024-04-19 06:37:43,129] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1869288 closing signal SIGTERM
[2024-04-19 06:37:44,294] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1869282) of binary: .../bin/python

The evaluation was sigtermed, so killed. When I launched the model manually, I had to experiment with DP/PP combinations - the default one does not seem to work. We will have to launch the evaluations on this one manually, since it's important for the community.

cc @alozowski if you have the time :)

Thank you for the clear explanation! Glad you're working on it 🙂

Hugging Face H4 org

Hi! It's been manually relaunched by @alozowski ! You'll find it on the leaderboard :)
Closing the issue

clefourrier changed discussion status to closed

Sign up or log in to comment