Spaces:
Running
on
CPU Upgrade
Add google/recurrentgemma-2b-it
Hello,
I noticed that google/recurrentgemma-2b was added to the leaderboard despite the architecture not being in a release of transformers yet.
Could you also evaluate the RLHF version google/recurrentgemma-2b-it so that both results are available?
Thank you
Note: I tried submitting the model to the Leaderboard the standard way and got the "trust_remote_code=True" error, even though there's no custom modeling code in the repo...
The original recurrentgemma commit was made by
@clefourrier
so I guess they're the only one with the power to add the RLHF version to the Leaderboard?
Additionally, I'm seeing a surprising amount of variance in scores reported by various groups, for example ARC:
- Open LLM Leaderboard (float16, 25-shot): 31.40
- Google DeepMind's arXiv preprint (metric not listed): 42.3
- Myself, installing transformers from source (float16, 25-shot): 47.53
What's the source of error here? Is it simply different versions of the eval suite and data, or is there a discrepancy in one of the recurrentgemma implementations?
Hi!
As you can see in the About page, we don't report ARC, but the challenge
subset only, for which we report the normalized accuracy.
You should get scores extremely close to ours if you use the same harness and model commit.
The differences you see between what we report (with a completely reproducible setup) and what papers report (without providing anything about their evaluation scripts) is typically an illustration of why it's important to have a space like the leaderboard where all models are evaluated in the same way.
The original recurrentgemma commit was made by @clefourrier
I had added the original model as we worked with Google to have the results on the leaderboard on release date, so I had to run it privately and port it on the day.
It will be possible to submit automatically once the code modifications make it into a stable release of transformers (they are only on main atm iirc), and we update our leaderboard's dependencies.
In light of the new transformers release, I checked to see if anyone had submitted the recurrentgemma model yet. It was submitted... and it failed.
I'd like to inquire about the logs and see what I can do to successfully submit it myself. (I'm unable to "reopen" this issue, but it seemed improper to create a new one for essentially the same topic.)
Hi!
We updated our requirements yesterday for both front and backend, just checked the logs and the model evaluation started fine but was sigtermed - I'll relaunch it, but maybe we'll have to manage our automatic DP/PP system differently for recurrent models (could require higher memory at inference)
Looks like it failed again... this is from the same release as the successful recurrentgemma eval so I'm not sure what went wrong. Send the logs this time?
Hi
@devingulliver
,
Here are the logs, which will not tell you anything more than I did above.
]
Evaluation {
Evaluate on 62 tasks.
Running loglikelihood requests
0%| | 0/1 [00:00<?, ?it/s] [2024-04-19 06:37:43,128] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1869283 closing signal SIGTERM
[2024-04-19 06:37:43,129] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1869288 closing signal SIGTERM
[2024-04-19 06:37:44,294] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -11) local_rank: 0 (pid: 1869282) of binary: .../bin/python
The evaluation was sigtermed, so killed. When I launched the model manually, I had to experiment with DP/PP combinations - the default one does not seem to work. We will have to launch the evaluations on this one manually, since it's important for the community.
cc @alozowski if you have the time :)
Thank you for the clear explanation! Glad you're working on it 🙂
Hi! It's been manually relaunched by
@alozowski
! You'll find it on the leaderboard :)
Closing the issue