open-llm-leaderboard/open_llm_leaderboard · Evaluation failed `fblgit/UNA-SOLAR-10.7B-Instruct-v1.0`

Dec 19, 2023

I was able to reproduce the tests, same commit version without problem. What has caused the "FAILED" state on the evaluation?

Regards

clefourrier

Open LLM Leaderboard org Dec 19, 2023

Hi @fblgit ,
Please follow the FAQ in the about page and link the request file of your model so I can investigate more easily :)

fblgit

Dec 19, 2023

Sure
https://huggingface.co/datasets/open-llm-leaderboard/requests/commit/20fd1e06e84f5d6363e037e328bc13854f4c9fda

fblgit/UNA-SOLAR-10.7B-Instruct-v1.0

I ammended the LICENSE and README.

clefourrier

Open LLM Leaderboard org Dec 19, 2023

Hi! Your model is still running, see here :)
The commit you linked to was a bug on my side (I switched part of our backend to spaces eval, and when checking for finished evals, it failed finding the current results)

clefourrier

Open LLM Leaderboard org Dec 19, 2023

Sorry for the scare ^^
If that works for you, I'll close the issue, feel free to reopen if it actually fails later on!

clefourrier changed discussion status to closed Dec 19, 2023

fblgit

Dec 20, 2023

@clefourrier sorry to bother, any idea when UNA-SOLAR results will be out ? I have the impression that each time i push a model, the eval queue gets frozen. I do encourage the more background checking as possible for leader models.
I wouldn't have any problem to wait as long as it is needed if I were able to lock the repo till the eval is concluded. There is any way that I can only allow the evaluation pull? I see the private checkbox in the Leaderboard, can you please advise?

Do u need me to open a new issue or something? I really would prefer to have a cordial dialog, can I reach HF Leaderboard staff via discord?

clefourrier

Open LLM Leaderboard org Dec 20, 2023

Hi, no problem!
Just checked, it was preempted when we transferred the backend from one cluster to another and we did not pick it up - passing it back to pending again (and since we run models by order of submission, it would be quite up in the queue).

What you could do is wait for the eval to start again (= job is running), then wait about 1h and pass your repo to private because we should have downloaded it in that lapse of time, and we just need one download for eval (it should work OK if your job does not get preempted again - but we are running on the spare cycles of the cluster) - would that work for you?

fblgit

Dec 20, 2023

I do not have any issue with the timeline, the preemption, and I do understand these evals are from the research cluster and obviously are "on-best-efforts".
Can I mark it as request access repo and provide granular access to the eval mechanism?

Thanks for you help on triggering the job again.. and sorry for all the surrounding noise..

clefourrier

Open LLM Leaderboard org Dec 21, 2023

Hi!
You could, tagging @SaylorTwift since he's the one running the evals atm, so it would have to use his token.
Side note - our eval cluster changed and we are in full debugging mode (connectivity issues) so it might take a couple days for us to come back to you.

fblgit

Dec 21, 2023

Totally fine, for the next model we can try this one way. thanks for your help

fblgit

Dec 22, 2023

•

edited Dec 22, 2023

Hi @clefourrier can you please remove the second UNA-SOLAR and keep just 1 with the highest score?

Also I noticed the changes, great stuff and the automation was nice and i couldn't believe that the queue was empty.
It may be related with the connectivity issues that you mentioned, but I had a higher failure rate and had to submit the model a few times. You can look at the commit history of the requests repo to see the few failed attempts, i think its not just mines.. maybe this can be helpful

Thanks and merry christmas!

clefourrier

Open LLM Leaderboard org Dec 22, 2023

Hi @fblgit ,
Re the connectivity issues, schematically, our new cluster is basically getting rate limited when connecting to the hub - our provider has to change the network configuration of the cluster's gateways, which is sadly out of our hands.
So long as it's not fixed, anytime we want to launch an eval, when we try to download a model from the hub, it fails and the eval is stopped. (I think your models slipped through the cracks, luckily for you :) ).

Re the model you want removed, can you point to the corresponding results and request files?

clefourrier

Open LLM Leaderboard org Dec 22, 2023

(Merry end of year to you too!)

fblgit

Dec 22, 2023

•

edited Dec 22, 2023

I kinda suspected this tbh.. it gave the impression of saturation.
How about QoS at the node level? I guess the runner/worker doesnt use the same protocol/port. so you can prioritise the job connectivity.

maybe this can give you an idea of what i mean. (and i think he got it right)

Creating a Quality of Service (QoS) script in Linux to prioritize SSH traffic over HTTP/HTTPS requires configuring traffic control settings via the tc command. This example script assumes you're familiar with Linux networking and iptables. It also assumes that SSH is on its default port (22) and HTTP/HTTPS are on ports 80 and 443, respectively.

Here's a basic script to achieve this:

#!/bin/bash

# Define the interface
IFACE="eth0"

# Clear existing down- and uplink qdiscs, start fresh
tc qdisc del dev $IFACE root

# Add root qdisc
tc qdisc add dev $IFACE root handle 1: htb

# Add parent class
tc class add dev $IFACE parent 1: classid 1:1 htb rate 10mbit

# Create two subclasses for SSH and HTTP/HTTPS
tc class add dev $IFACE parent 1:1 classid 1:10 htb rate 5mbit ceil 10mbit
tc class add dev $IFACE parent 1:1 classid 1:20 htb rate 5mbit ceil 10mbit

# Add a filter for SSH traffic
tc filter add dev $IFACE protocol ip parent 1:0 prio 1 u32 match ip dport 22 0xffff flowid 1:10

# Add a filter for HTTP/HTTPS traffic
tc filter add dev $IFACE protocol ip parent 1:0 prio 2 u32 match ip dport 80 0xffff flowid 1:20
tc filter add dev $IFACE protocol ip parent 1:0 prio 2 u32 match ip dport 443 0xffff flowid 1:20

# Apply changes
tc qdisc add dev $IFACE parent 1:10 handle 10: sfq perturb 10
tc qdisc add dev $IFACE parent 1:20 handle 20: sfq perturb 10

Explanation:

This script sets up a basic hierarchical token bucket (HTB) with two subclasses.
One subclass is for SSH traffic (high priority) and the other for HTTP/HTTPS traffic (lower priority).
Traffic control filters are used to classify traffic into these buckets based on the destination port.
Stochastic Fair Queueing (SFQ) is used to ensure fair bandwidth sharing within each subclass.

Important Notes:

Replace "eth0" with your actual network interface.
Adjust the rate and ceil parameters according to your network's bandwidth.
This script only handles the egress (outgoing) traffic. If you need to control ingress (incoming) traffic, additional configuration is necessary.
Ensure that you have the necessary permissions to execute these commands and that the tc tool is installed.

fblgit

Dec 22, 2023

•

edited Dec 22, 2023

For the deletion, these:

https://huggingface.co/datasets/open-llm-leaderboard/details_fblgit__UNA-SOLAR-10.7B-Instruct-v1.0/blob/main/results_2023-12-21T16-27-41.332399.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/fblgit/UNA-SOLAR-10.7B-Instruct-v1.0_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/fblgit/UNA-SOLAR-10.7B-Instruct-v1.0/results_2023-12-21T16-27-41.332399.json

I gave a further spin to the timeouts and congestion, https://trickled.sourceforge.net/ this may be the simplest way.

clefourrier

Open LLM Leaderboard org Dec 22, 2023

Thanks for the links, deleted all the files :)

For the network, I don't have the rights to change things at this level - but thanks for the refs, I'll come back to it if I need it :)