open_cn_llm_leaderboard

Running on CPU Upgrade

Repeated failures of various running models

by CombinHorizon - opened Aug 17

Aug 17

as per https://huggingface.co/datasets/open-cn-llm-leaderboard/requests/commits/main
cognitivecomputations/dolphin-2.9.3-mistral-nemo-12b has been repeatedly failing,
does something in the maybe back-end need to be updated to better support nemo models?

examples where it works

CombinHorizon changed discussion title from Repeated failures of dolphin-2.9.3-mistral-nemo-12b to Repeated failures of various running models Sep 12

CombinHorizon

Sep 12

•

edited Sep 12

Could it because there isn't enough reserve free-ram or capacity, so that as a model runs, and perhaps resource RAM usage fluctuations, cause some of the models to have OOM errors,
thus maybe not a specific model's fault?
but a perhaps, a problem with the how they are queued? (maybe too many running at the same time?)

Edit: question - when a model fails, and then is restarted with same settings (if same commit, param-s) does it have to redo all the tasks and tests, or is its progress remembered, and thus continues where it left off?, if not, would it be easy to implement that, wouldn't that save some resources? (but do take into account that different commits of the same model aren't necessarily the same, thus don't do that for those, not as good of an idea to treat them the same, thus perhaps keep them separate..

CombinHorizon

28 days ago

it's fail looping repeatedly

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment