Triton issues when fine-tuning on A10G

#1
by trinhhung - opened

I tried to fine-tune https://huggingface.co/CATIE-AQ/FAT5-small-flan-en but I got this error on A10g / AWS EC2:
triton.runtime.errors.OutOfResources: out of resource: shared memory, Required: 163840, Hardware limit: 101376. Reducing block sizes or num_stages may help.
And here is training params:

{"epochs":2,"max_length":2048,"batch_size":16,"per_device_train_batch_size":2,"per_device_eval_batch_size": 2,"learning_rate":1e-5,"warmup_steps":150,"evaluation_steps": 100,"use_amp":false}

Sign up or log in to comment