How to speed up inferring?

#21
by merlinarer - opened

Apart from int8, is there any plan to speed up inferring, such as fastertransformer?

I dont think fastertransformer is an easy way... may torchscript and pytorch2.0 work

BigCode org

The easiest way to do this may be to use the inference server:

https://github.com/bigcode-project/starcoder#text-generation-inference

You could try: https://huggingface.co/michaelfeil/ct2fast-starcoder/blob/main/README.md

Amazing, how much speed could ct2fast-starcoder bring compared with the oringinal starcoder?

You could try: https://huggingface.co/michaelfeil/ct2fast-starcoder/blob/main/README.md

Amazing, how much speed could ct2fast-starcoder bring compared with the oringinal starcoder?

Did not have time to check for starcoder. For santacoder:
Task: "def hello" -> generate 30 tokens
-> transformers pipeline in float 16, cuda: ~1300ms per inference
-> ctranslate2 in int8, cuda -> 315ms per inference

I assume for starcoder, weights are bigger, hence maybe 1.5-2.5x speedup.

You could try: https://huggingface.co/michaelfeil/ct2fast-starcoder/blob/main/README.md

This works like a charm, 100 times faster than the starchat and starcoder. I tried with 8, 12, 16G but failed, at least 24G RAM GPU will work.

Sign up or log in to comment