ALL Jamba models failing

#690
by devingulliver - opened

So far every Jamba model submitted to the leaderboard has failed, including the base model. Any clue what's causing this to happen?

Hugging Face H4 org

Hi!
Is the architecture integrated in a stable release of transformers?

Hugging Face H4 org

Could you point to some of the request files as indicated in the About, so we can investigate?

Hugging Face H4 org

Hi!
Thanks a lot for the exhaustiveness - apart from the first model which seems to have an inherent problem (I provided the log for this one below), it would seem that all the other ones are failing because we updated our bitsandbytes version and they made breaking changes in their lib on how to launch configs. We'll update and relaunch everything.
CC @SaylorTwift for the backend and @alozowski for the relaunches once it's fixed.

We'll do this asap, hopefully we should be good by this evening.


Other failure:

The fast path is not available because on of `(selective_state_update, selective_scan_fn, causal_conv1d_fn, causal_conv1d_update, mamba_inner_fn)` is None. To install follow https://github.com/state-spaces/mamba/#installation and https://github.com/Dao-AILab/causal-conv1d. If you want to use the naive implementation, set `use_mamba_kernels=False` in the model config
[E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000063 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000057 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:474] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000072 milliseconds before timing out.
... 
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
[E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000063 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:915] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000072 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:915] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=3000000) ran for 3000057 milliseconds before timing out.
[2024-04-19 01:09:34,012] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2081115 closing signal SIGTERM

jondurbin/bagel-jamba-v05_eval_request_False_float16_Original.json has finally failed... I'm guessing it's the same as all the others, but since it lasted so long in the first place I'm not sure.

Re: the "inherent" error on the ai21labs/Jamba-v0.1 eval
The warning message that begins "The fast path is not available" is in the custom code from that model repo, but I couldn't find the message anywhere in the transformers-library implementation of JambaForCausalLM.
Is it possible that the model was somehow erroneously run with remote code?
EDIT - I was wrong about this, it's in the HF Jamba code. Just didn't come up in GitHub's search tool the first time.

Hugging Face H4 org

@devingulliver Normally no as our production environment is fixed with new releases - but it could be possible we made a mistake. We'll relaunch them all at the same time anyway.

Hugging Face H4 org

Hi! Our prod was fixed last week, and I relaunched all of the above models, feel free to reopen if you need :)

clefourrier changed discussion status to closed

The models are failing again :/
If it's not bitsandbytes, I'm guessing they're all encountering similar failures to the one you posted earlier?

Hugging Face H4 org
edited 8 days ago

Yep, same error message.
I'm a bit at a loss and we're having more important priorities at the moment, so I'll put this on hold, but reopening so we keep track of the issue.

clefourrier changed discussion status to open

Perhaps installing the fast Mamba kernels would solve the issue? If it doesn't affect the reproducibility of the rest of the environment of course

Sign up or log in to comment