Text-to-Speech
coqui

CUDA Assertion Errors with Concurrent Requests on XTTS-v2

#107
by Chan-Y - opened

Environment

  • Model: coqui/XTTS-v2
  • GPU: NVIDIA RTX 3080 Laptop (16GB VRAM)
  • Operating System: Windows
  • PyTorch: 2.6.0+cu124
  • Torchaudio: 2.6.0+cu124
  • Transformers: 4.46.1
  • CUDA: 12.4

Issue Description

When processing multiple concurrent requests in quick succession, the model consistently encounters CUDA errors that can only be resolved by restarting the program. The error appears to be related to indexing operations and CUBLAS execution.

Error Message

device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call
CUBLAS_STATUS_EXECUTION_FAILED when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

Assertion errors in Indexing.cu:1369:
block: [0,0,0], thread: [92-95,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

Reproduction Steps

  1. Run the XTTS-v2 model on local machine
  2. Send multiple inference requests within a short time window
  3. Error occurs consistently under load
  4. Only solution is to restart the program

Expected Behavior

  • Model should handle concurrent requests without CUDA assertion errors
  • Graceful handling of multiple requests without requiring program restart

Questions

  1. Are there known issues with concurrent processing using PyTorch 2.6.0?
  2. Are there recommended batch size or request rate limitations for this model?
  3. Is there a way to implement request queuing or rate limiting to prevent these errors?
  4. Could this be related to CUDA 12.4 compatibility issues?

Additional Context

The error seems to indicate an issue with array indexing operations during concurrent processing. The assertion srcIndex < srcSelectDimSize suggests possible race conditions or memory access violations during parallel execution.

The error occurs with the latest stable versions of PyTorch and torchaudio (2.6.0+cu124), which should theoretically be compatible with this workload. Any guidance on handling concurrent requests or implementing proper request management would be greatly appreciated.

Sign up or log in to comment