Speed of the hosted inference API for interactive playground

#107
by pai4451 - opened

I want to know why the hosted inference API for BLOOM with the interactive playground on HuggingFace is so fast. On my hardware and just like many other people reported in the inference benchmarks, the inference speed is slow with HuggingFace accelerate. The throughput on 8x A100 with the HuggingFace framework in this link is about four tokens per second under batch size 1 (230ms per token). But I feel like the HuggingFace interactive playground is way faster than this.

Any tips for doing the inference faster as the Huggingface hosted API? Is the hosted inference API a quantized version of BLOOM (for example, the int8 version) or the runtime is powered by a different framework such as Microsoft DeepSpeed?

BigScience Workshop org

There might be some secret sauce, cc @narsil @nouamane and others =)

pai4451 changed discussion status to closed
pai4451 changed discussion status to open
BigScience Workshop org

That custom server solution is still being worked on and will be released when ready.

Meanwhile you can use Deepspeed-Inference fused-custom-kernel solution which is on par or even faster depending on the setup.

Please see the benchmarks and demo scripts here:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/scripts/bloom-inference-scripts

An alternative server solution by one of the external contributors is being developed here:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/tree/main/scripts/bloom-inference-server
(this is not the code used by HF you're referring to, but should be about the same speed)

BigScience Workshop org
edited Sep 16, 2022

As mentionned by @stas , we're looking to port back our hacky solutions into cleaner open sourceable code. It might take time, some of this are not easy to integrate into our libs, or necessitate bigger changes that the hack took time to write.

If you're willing to run your own thing, we're more than willing to help.

1/ The first thing is to run the model in TP (Tensor parallelism). Deepspeed does that, but we don't use deepspeed (because we were seeing too many random crashes at runtime because of some unknown issues in the kernels).
You need to hack the modeling code to handle ProcessGroup from torch distributed, and rewrite the Linear layers to be column linear and row linear on both attention and MLP (there are other ways to solve this, but this is what we went for).
That should lower your latency to around 90ms for 1 token (when using past) (less if using only 8A100 (80) we're using 16xA100(40)) per token (Sorry I'm counting like this because token/s is inverted and so it's harder to count performance increase with inverted numbers like fps)

2/ Easy optimization add torch.jit.script around gelu. That's a ~10% speed increase, pretty free.

3/ Create a custom kernel for the attention + softmax (it's more involved) that's another 10% speed increase

For all these first 3 steps, you can check out https://github.com/huggingface/transformers/tree/thomas/add_custom_kernels which is public, but there's no doc coming with that :).

4/ Once you get this correct, then you need to figure out a serving option. The biggest challenging is serving different requests with different parameters in the same batch. Some might require 10 tokens, some 200 tokens, some might be using sampling, others greedy and different sampling parameters. In order to server all uses cases while still batching (batching is free 8A100 is a lot more compute than necessary even for bloom on small batch sizes) you need to drop the simple code, and basically rewrite generate using the low level LogitsProcessor and StoppingCriteria logic. Then you can apply that generation logic differently on all different items in the batch (the trickier thing is handling the past correctly). This adds some overhead (like ~5%) but it does provide a much better latency guarantees on API users that might be using the API differently at the same time. Depending on your use case you might not need that. We use redis in pub/sub mode to get a simple webserver to actually distribute the requests over the n processes which are handling the various GPUs for instance (and answering back the generated content)

Having explained all the steps we went through you might understand better why it's not already within our libs as some of this stuff requires to be heavily more involved than what transformers aims to be (which is simple to use and to hack). So we're really trying to figure out a way to keep the simplicity where it should be, while enabling others to use all those techniques should they want to.

Also I'm not even mentionning running custom weights to speed up loading times from ~10mn+ to 45s (which helps tremendously when iterating on such a large model)

@Narsil Hey, thank you for sharing so much detail, and I will try some of them on my own. Also excited to wait for the open-source solution someday.

16xA100(40)
@Narsil thanks for your good sharing. I would like to know "16xA100(40)" you mentioned all in single node? Actually we have 16xA6000(48GB), but unfortunately they are 2 nodes. It prevents us using "transformers + accelerate". I would like to know your machine by nature installed 16 GPUs or you adopted some kind of virtual machine technology to group 16 GPUs together. Thanks in advance.

BigScience Workshop org

FYI, we finally re-organized the demo scripts and you can read about those here: https://huggingface.co/blog/bloom-inference-pytorch-scripts

https://github.com/huggingface/transformers-bloom-inference is now the center-point to gather various fast solutions for BLOOM inference.

BigScience Workshop org

| I would like to know your machine by nature installed 16 GPUs
We're using an GCP node for now (it was the most convenient at the time of our work). GCP provides 16xA100 (40Go) so it's enough to run bloom.
Both Azure and AWS provide 8xA100(80Go) which are also enough, but maybe getting access is slightly more inconvenient.

I made a big mistake, bloom is running in TP (Tensor Parallelism) not PP (pipeline parallelism, which is accelerate). Sorry for the bad typo. :)

@Narsil We noticed an excellent blog sharing from you (https://huggingface.co/blog/bloom-inference-optimization).
We now implemented whole thing with architecture of "bloom-176b + deepspeed + ds-mii + torchserve on A6000 x 16". Besides unpredictable crashes, we also faced some trouble because ds-mii not supporting multi-node (even we has done some hack work to let it work on multi-node, but fail over after process crash is still unresolved.).
From your branch, we knew the way to replace DeepSpeed tensor-parallelism (https://github.com/huggingface/transformers/tree/thomas/dirty_bloom_tp), it's very inspiring, and from the segment "Webserver part" , we also noticed there also ways to replace DS-MII ("...we opted to communicate raw strings over a Redis pub/sub to distribute the requests to all processes at once..."), but unfortunately it seems no source code link shared in this blog.
Would you like also kindly share the source about web server part? It would help us a lot. thanks in advance.

BigScience Workshop org

The simpler version.
https://github.com/huggingface/transformers_bloom_parallel/

The more complex version which has some cooler features.
https://github.com/huggingface/text-generation-inference

@Narsil Sorry for bothering, do you have prebuilt wheel file for "safetensors". I got some SSL problem while "python setup.py develop" for this, and difficult to workaround it. It's also very appreciate if you could let me know any alternative way to skip this part.

BigScience Workshop org

pip install safetensors should work on most platform.

Please report here if not https://github.com/huggingface/safetensors/issues

pip install safetensors should work on most platform.

Please report here if not https://github.com/huggingface/safetensors/issues

Yes, "pip install" works, thank you very much. @Narsil

The simpler version.
https://github.com/huggingface/transformers_bloom_parallel/

The more complex version which has some cooler features.
https://github.com/huggingface/text-generation-inference

@Narsil first, thanks for your selfless sharing of these source code. We have installed text-generation-inference solution on our K8S clusters and explore some of the features. but one point I am wondering....
It's observed "streaming response mode" provided in huggingface hosted BLOOM inference API(https://huggingface.co/bigscience/bloom), but why I can't see related implementation inside of "text-generation-inference/router
Would you like kindly give me some hint of this?

BigScience Workshop org

streaming response mode

Not sure what you are referring to ?

streaming response mode

Not sure what you are referring to ?

It means "output word by word"

BigScience Workshop org

Oh, the widget is simply displayed a being progressive, the actual request is made in bulk.

We considered making streaming responses, but so far we haven't added yet in the public demo and it's unlikely to come soon (too much engineering for the benefit currently).

This comment has been hidden
christopher changed discussion status to closed

Sign up or log in to comment