Text Generation
Inference Endpoints

Sorry Triton VS Cuda what is the difference?

by Goldenblood56 - opened

As a reminder is use Oobabooga when possible and llama.cpp when I have too.

I don't really need an in depth answer but I tried to google it and nothing made sense to me. Is the Triton model better for Oobabooga?
I even read some of your other discussion including things like "true sequential" + "groupsize 128" and "act order"

Act order is an improvement on how the USER and AI talk? Like were it will stop less or answer it's own questions less or something no that I have there issues at present?
Here is my current argument if I run the Triton model that has these new functions should I change my arguments?
"call python server.py --auto-devices --chat --model reeducator_vicuna-13b-cocktail --wbits 4 --groupsize 128"

Personally I'm not familiar if there are any benefits of using one over the other, since I don't use GPU inference myself at this point... Maybe @TheYuriLover knows about about it since he requested the triton conversion? Or somehow who knows more might fill in here.

Thanks I'm just happy I'm not the only one. I will try asking chat GPT because google itself did not help me. To technical. lol And then maybe reach out to @TheYuriLover or hope that he sees this.

The thing is that only triton makes the groupsize 128 + act order work on ooba's webui, cuda can't or is too slow for it.
Having the act_order is an advantage because it gives a better quantization quality, the perplexity is closer to f16 if you add act_order

Look the readame: https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda

Thanks Yuri I copy and pasted our conversation here into chat GTP including what you said. And then asked a few questions. I think I kind of get it between you and chat GTP.

I have one last question should I add --act-order and --true-sequential to my arguments?

This is what chat GTP said about Cuda vs Triton. Not sure how much of that is true and there may be more to it. But I think I at least get it. It's kind of like OpenGL VS Vulkan etc. Pros and cons to each I guess.
"Yes, CUDA and Triton are both software frameworks used for GPU-accelerated deep learning inference.

CUDA is a software platform developed by NVIDIA that enables developers to use NVIDIA GPUs to accelerate compute-intensive applications. It includes libraries, tools, and APIs for developing and deploying deep learning models on NVIDIA GPUs.

Triton, on the other hand, is a deep learning inference server developed by NVIDIA that provides a flexible and efficient way to deploy and serve deep learning models. It supports multiple deep learning frameworks and allows users to deploy models on any GPU or CPU in their data center or cloud environment.

Both frameworks have their own advantages and disadvantages, and the choice of which to use may depend on factors such as the specific use case, the available hardware, and the expertise of the development team."

I have one last question should I add --act-order and --true-sequential to my arguments?

You don't need to do anything, the act_order and true_sequential are implemented on the triton quantized model, just run it normally and you're good to go

Sign up or log in to comment