Join the force

#426
by RichardErkhov - opened

Hello @mradermacher , as you noticed we have been competing for the amount of models for quite a while. So instead of competing, want to join forces? I talked to @nicoboss , he is up for it, and I have my quant server for you with 2 big bananas (E5-2697Av4), 64 gigs of ram, and a 10gbps line ready for you!

Well, "take what I have" and "join forces" are not exactly the same thing. When we talked last about it, I realised we were doing very different things and thought diversity is good, especially when I actually saw what models you quantize and how :) BTW, I am far from beating your amount of models (remember, I have roughly two repos per model, so you have twice the amount), and wasn't in the business of competing, as it was clear I couldn't :)

But of course, I won't say no to such an offer, especially not at this moment (if you have seen my queue recently...).

So how do we go about it? Nico runs some virtualisation solution, and we decided on a linux container to be able to access his graphics cards, but since direct hardware access is not a concern, a more traditional VM would probably be the simplest option. I could give you an image, or you could create a VM with debian 12/bookworm and my ssh key on it (nico can just copy the authorized_kleys file).

Or, if you have any other ideas, let's talk.

Oh, and how much diskspace are you willing to give me? :)

Otherwise, welcome to team mradermacher. Really should have called it something else in the beginning.

Ah, and as for network access, I only need some port to reach ssh, and be able to get a tunnel out (wireguard, udp). having a random port go to the vm ssh port and forward udp port 7103 to the same vm port would be ideal. I can help with all that, and am open to alternative arrangements, but I have total trust in you that you can figure everything out :)

No worries I will help him setting up everything infrastructure wise. He already successfully created a Debian 12 LXC container. While a VMs might be easier those few percentages of lost performance bother me but if you prefer a VM I can also help him with that.

LXC sits perfectly well with me.

this brings me joy

@mradermacher Your new server "richard1" is ready. Make sure to abuse the internet as hard as you can. Details were provided by email by @nicoboss , so check it please as soon as you can

Oh, and how much diskspace are you willing to give me? :)

2 TB of SSD as this is all he has. Some resources are currently still in use by his own quantize tasks but should be gone by tomorrow once the models that are currently being processed are done but just already start your own tasks once the container is ready. He is also running a satellite imagery data processing project for me for the next few weeks but its resource usage will be minimal. Just go all in and try to use as much resources as you can on this server. For his quantization tasks he usually runs 10 models in parallel and uses an increased number of connections to download them in order to optimally make use of all resources available.

I'm on it. Wow, load average of 700 :)

He needs to reboot the host to solve the issues that is also preventing normal access to the mradermacher LXC container.
For SSH we are getting "Connection refused" and if we try to access the container directly from the host using lxc-attach mradermacher we are getting:

lxc-attach: mradermacher: attach.c: get_attach_context: 405 Connection refused - Failed to get init pid
lxc-attach: mradermacher: attach.c: lxc_attach: 1469 Connection refused - Failed to get attach context

Because of this we are unable to run ./rich1-pause before rebooting the host. So you are the only one capable of doing so thanks to your WireGuard tunnel.

I'm sure we can reschedule the reboot to a time that better works for you but I think you just timing starting the ./rich1-pause script and and we checking nload, the status page and CPU activity to make sure nothing is running before rebooting should be fine as well.

no, time is fine. i'll do this, and pray it actually works:

sleep $(( $(TZ=Asia/Nicosia date -d'12:45' +%s) - $(date +%s) ))&&poweroff

ps: you should also be able to reach it from nico1, rich1 is 10.28.1.7

ps: you should also be able to reach it from nico1, rich1 is 10.28.1.7

Wow that is cool. I can confirm this works and I was able to SSH to rich1 from nico1 using 10.28.1.7. So should your timed poweroff not work I will just manualy execute poweroff without first executing ./rich1-pause before Richard reboots the host.

sounds all good :)

pps: i am less concerned about interrupting jobs than i am about having the job status file on stable storage (I do not call fsync anymore), thus the poweroff vs. pausing

Just FYI, but it seems the port forwarding (port 2222) once again stopped working. It doesn't affect quanting negatively.

hello @mradermacher , how are you? Can you please add stats like models processed by each server, traffic consumed, average cpu and ram load and total uptime for each server to satisfy my competitive nature?

Just FYI, but it seems the port forwarding (port 2222) once again stopped working. It doesn't affect quanting negatively.

eventually will be solved, if you need access just let me know we can try solving it on the spot

RichardErkhov changed discussion status to closed
RichardErkhov changed discussion status to open

hello @mradermacher , how are you?

Not good as hf has essentially shut me down.

Can you please add stats like models processed by each server, traffic consumed, average cpu and ram load and total uptime for each server to satisfy my competitive nature?

I don't really have any of that. I can grep a few stats for you, possibly. I can tell you that backup1, db1, db2, db3 had continuous uptime since february until I shut them off yesterday,. when I rebooted them into quant mode, and I recently rebooted rain after about 1200 days of uptime, while leia is at 482. kaos and back had issues and had tombe rebooted much more recent :)

Let me see...

Ah, right, it's worse, as the logfiles are currently separate for rich1 and nico1. But rich1 has uploaded 13419 individual quants so far, and nico1 24268. And here the others:

22980 back
20282 backup1
50820 db1
50337 db2
50192 db3
1 Dec <- bug
23973 kaos
31244 leia
6430 marco
21009 rain

But of course, all of very different sizes and over very different timeframes, so comparison is at best for fun.

As for traffic, I knew I wanted to check one last time before switching off dbX etc, but, as you can guess, I forgot. But here is some vnstat samples:

                      rx      /      tx      /     total    /   estimated
       2024-11     14.62 TiB  /   39.18 TiB  /   53.80 TiB kaos
       2024-11      11.18 TB |    87.75 TB |    98.93 TB  rich1
       2024-11      59.86 TB  /   162.90 TB  /   222.76 TB nico1

Ah, and total repo size, counted by my maintenance script:

TB 2792.683

eventually will be solved, if you need access just let me know we can try solving it on the spot

It's at worst a minor inconvenience: all automatic stuff goes via rsh/ssh via wireguard, which is unaffected. It only affects llama updates and me logging in. Not to worry.

Sign up or log in to comment