Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1023

Add google/recurrentgemma-2b-it

#677

by devingulliver - opened Apr 13

Discussion

devingulliver

Apr 13

Hello,
I noticed that google/recurrentgemma-2b was added to the leaderboard despite the architecture not being in a release of transformers yet.
Could you also evaluate the RLHF version google/recurrentgemma-2b-it so that both results are available?
Thank you

devingulliver

Apr 14

Note: I tried submitting the model to the Leaderboard the standard way and got the "trust_remote_code=True" error, even though there's no custom modeling code in the repo...
The original recurrentgemma commit was made by @clefourrier so I guess they're the only one with the power to add the RLHF version to the Leaderboard?

devingulliver

Apr 14

Additionally, I'm seeing a surprising amount of variance in scores reported by various groups, for example ARC:

Open LLM Leaderboard (float16, 25-shot): 31.40
Google DeepMind's arXiv preprint (metric not listed): 42.3
Myself, installing transformers from source (float16, 25-shot): 47.53

What's the source of error here? Is it simply different versions of the eval suite and data, or is there a discrepancy in one of the recurrentgemma implementations?

deleted

Apr 14

This comment has been hidden

clefourrier

Open LLM Leaderboard org Apr 15

Hi!
As you can see in the About page, we don't report ARC, but the challenge subset only, for which we report the normalized accuracy.
You should get scores extremely close to ours if you use the same harness and model commit.
The differences you see between what we report (with a completely reproducible setup) and what papers report (without providing anything about their evaluation scripts) is typically an illustration of why it's important to have a space like the leaderboard where all models are evaluated in the same way.

The original recurrentgemma commit was made by @clefourrier

I had added the original model as we worked with Google to have the results on the leaderboard on release date, so I had to run it privately and port it on the day.
It will be possible to submit automatically once the code modifications make it into a stable release of transformers (they are only on main atm iirc), and we update our leaderboard's dependencies.

clefourrier changed discussion status to closed Apr 17

devingulliver

Apr 19

In light of the new transformers release, I checked to see if anyone had submitted the recurrentgemma model yet. It was submitted... and it failed.
I'd like to inquire about the logs and see what I can do to successfully submit it myself. (I'm unable to "reopen" this issue, but it seemed improper to create a new one for essentially the same topic.)

clefourrier

Open LLM Leaderboard org Apr 19

Hi!
We updated our requirements yesterday for both front and backend, just checked the logs and the model evaluation started fine but was sigtermed - I'll relaunch it, but maybe we'll have to manage our automatic DP/PP system differently for recurrent models (could require higher memory at inference)

clefourrier changed discussion status to open Apr 19

devingulliver

Apr 19

Looks like it failed again... this is from the same release as the successful recurrentgemma eval so I'm not sure what went wrong. Send the logs this time?

clefourrier

Open LLM Leaderboard org Apr 19

Hi @devingulliver ,
Here are the logs, which will not tell you anything more than I did above.

]
  Evaluation {
    Evaluate on 62 tasks.
    Running loglikelihood requests
  0%|          | 0/1 [00:00<?, ?it/s]    [2024-04-19 06:37:43,128] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1869283 closing signal SIGTERM
[2024-04-19 06:37:43,129] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1869288 closing signal SIGTERM
[2024-04-19 06:37:44,294] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1869282) of binary: .../bin/python

The evaluation was sigtermed, so killed. When I launched the model manually, I had to experiment with DP/PP combinations - the default one does not seem to work. We will have to launch the evaluations on this one manually, since it's important for the community.

cc @alozowski if you have the time :)

devingulliver

Apr 19

Thank you for the clear explanation! Glad you're working on it 🙂

clefourrier

Open LLM Leaderboard org Apr 23

Hi! It's been manually relaunched by @alozowski ! You'll find it on the leaderboard :)
Closing the issue

clefourrier changed discussion status to closed Apr 23

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment