Spaces:
Running
on
CPU Upgrade
Evaluation failed `fblgit/UNA-SOLAR-10.7B-Instruct-v1.0`
Hi @clefourrier
I was able to reproduce the tests, same commit version without problem. What has caused the "FAILED" state on the evaluation?
Regards
Hi
@fblgit
,
Please follow the FAQ in the about page and link the request file of your model so I can investigate more easily :)
fblgit/UNA-SOLAR-10.7B-Instruct-v1.0
I ammended the LICENSE and README.
Hi! Your model is still running, see here :)
The commit you linked to was a bug on my side (I switched part of our backend to spaces eval, and when checking for finished evals, it failed finding the current results)
Sorry for the scare ^^
If that works for you, I'll close the issue, feel free to reopen if it actually fails later on!
@clefourrier
sorry to bother, any idea when UNA-SOLAR results will be out ? I have the impression that each time i push a model, the eval queue gets frozen. I do encourage the more background checking as possible for leader models.
I wouldn't have any problem to wait as long as it is needed if I were able to lock the repo till the eval is concluded. There is any way that I can only allow the evaluation pull? I see the private
checkbox in the Leaderboard, can you please advise?
Do u need me to open a new issue or something? I really would prefer to have a cordial dialog, can I reach HF Leaderboard staff via discord?
Hi, no problem!
Just checked, it was preempted when we transferred the backend from one cluster to another and we did not pick it up - passing it back to pending again (and since we run models by order of submission, it would be quite up in the queue).
What you could do is wait for the eval to start again (= job is running), then wait about 1h and pass your repo to private because we should have downloaded it in that lapse of time, and we just need one download for eval (it should work OK if your job does not get preempted again - but we are running on the spare cycles of the cluster) - would that work for you?
I do not have any issue with the timeline, the preemption, and I do understand these evals are from the research cluster and obviously are "on-best-efforts".
Can I mark it as request access repo and provide granular access to the eval mechanism?
Thanks for you help on triggering the job again.. and sorry for all the surrounding noise..
Hi!
You could, tagging
@SaylorTwift
since he's the one running the evals atm, so it would have to use his token.
Side note - our eval cluster changed and we are in full debugging mode (connectivity issues) so it might take a couple days for us to come back to you.
Totally fine, for the next model we can try this one way. thanks for your help
Hi @clefourrier can you please remove the second UNA-SOLAR and keep just 1 with the highest score?
Also I noticed the changes, great stuff and the automation was nice and i couldn't believe that the queue was empty.
It may be related with the connectivity issues that you mentioned, but I had a higher failure rate and had to submit the model a few times. You can look at the commit history of the requests repo to see the few failed attempts, i think its not just mines.. maybe this can be helpful
Thanks and merry christmas!
Hi
@fblgit
,
Re the connectivity issues, schematically, our new cluster is basically getting rate limited when connecting to the hub - our provider has to change the network configuration of the cluster's gateways, which is sadly out of our hands.
So long as it's not fixed, anytime we want to launch an eval, when we try to download a model from the hub, it fails and the eval is stopped. (I think your models slipped through the cracks, luckily for you :) ).
Re the model you want removed, can you point to the corresponding results and request files?
(Merry end of year to you too!)
I kinda suspected this tbh.. it gave the impression of saturation.
How about QoS at the node level? I guess the runner/worker doesnt use the same protocol/port. so you can prioritise the job connectivity.
maybe this can give you an idea of what i mean. (and i think he got it right)
Creating a Quality of Service (QoS) script in Linux to prioritize SSH traffic over HTTP/HTTPS requires configuring traffic control settings via the tc
command. This example script assumes you're familiar with Linux networking and iptables. It also assumes that SSH is on its default port (22) and HTTP/HTTPS are on ports 80 and 443, respectively.
Here's a basic script to achieve this:
#!/bin/bash
# Define the interface
IFACE="eth0"
# Clear existing down- and uplink qdiscs, start fresh
tc qdisc del dev $IFACE root
# Add root qdisc
tc qdisc add dev $IFACE root handle 1: htb
# Add parent class
tc class add dev $IFACE parent 1: classid 1:1 htb rate 10mbit
# Create two subclasses for SSH and HTTP/HTTPS
tc class add dev $IFACE parent 1:1 classid 1:10 htb rate 5mbit ceil 10mbit
tc class add dev $IFACE parent 1:1 classid 1:20 htb rate 5mbit ceil 10mbit
# Add a filter for SSH traffic
tc filter add dev $IFACE protocol ip parent 1:0 prio 1 u32 match ip dport 22 0xffff flowid 1:10
# Add a filter for HTTP/HTTPS traffic
tc filter add dev $IFACE protocol ip parent 1:0 prio 2 u32 match ip dport 80 0xffff flowid 1:20
tc filter add dev $IFACE protocol ip parent 1:0 prio 2 u32 match ip dport 443 0xffff flowid 1:20
# Apply changes
tc qdisc add dev $IFACE parent 1:10 handle 10: sfq perturb 10
tc qdisc add dev $IFACE parent 1:20 handle 20: sfq perturb 10
Explanation:
- This script sets up a basic hierarchical token bucket (HTB) with two subclasses.
- One subclass is for SSH traffic (high priority) and the other for HTTP/HTTPS traffic (lower priority).
- Traffic control filters are used to classify traffic into these buckets based on the destination port.
- Stochastic Fair Queueing (SFQ) is used to ensure fair bandwidth sharing within each subclass.
Important Notes:
- Replace
"eth0"
with your actual network interface. - Adjust the
rate
andceil
parameters according to your network's bandwidth. - This script only handles the egress (outgoing) traffic. If you need to control ingress (incoming) traffic, additional configuration is necessary.
- Ensure that you have the necessary permissions to execute these commands and that the
tc
tool is installed.
For the deletion, these:
https://huggingface.co/datasets/open-llm-leaderboard/details_fblgit__UNA-SOLAR-10.7B-Instruct-v1.0/blob/main/results_2023-12-21T16-27-41.332399.json
https://huggingface.co/datasets/open-llm-leaderboard/requests/blob/main/fblgit/UNA-SOLAR-10.7B-Instruct-v1.0_eval_request_False_float16_Original.json
https://huggingface.co/datasets/open-llm-leaderboard/results/blob/main/fblgit/UNA-SOLAR-10.7B-Instruct-v1.0/results_2023-12-21T16-27-41.332399.json
I gave a further spin to the timeouts and congestion, https://trickled.sourceforge.net/ this may be the simplest way.
Thanks for the links, deleted all the files :)
For the network, I don't have the rights to change things at this level - but thanks for the refs, I'll come back to it if I need it :)