serpdotai/sparsetral-16x7B-v2 · How to use unsloth with multi gpus?

Training

8x A6000s
Forked version of unsloth for efficient training
Sequence Length: 4096
Effective batch size: 128
Learning Rate: 2e-5 with linear decay
Epochs: 1
Dataset: OpenHermes-2.5
Base model trained with QLoRA (rank 64, alpha 16) and MoE adapters/routers trained in bf16
Num Experts: 16
Top K: 4
Adapter Dim: 512

I found the model desc is use 8X A6000s, but the unsloth just support 1gpu, and when I run python train.py, it will be get error: ValueError: Pointer argument (at 2) cannot be accessed from Triton (cpu tensor?)
It must be CUDA_VISIBLE_DEVICES=0 python train.py. How can I use it with multi gpus?