The best way to deploy UMT5 variants into production with low-latency inference?
#4
by
Respair
- opened
This is such a neat model, but I don't see it being supported by most frameworks since it uses a different sampling method.
Can you recommend anyway to deploy this model (by this, I mean the model we finetune on the downstream task) into production? and possibly a trivial way to convert it to ONNX. optimum doesn't support it just yet. preferably something that relies on GPUs.
man, I really wish there was a vllm for such seq2seq models. their potential is so underrated. if this tiny voice of mine can be heard by the big guys at google, please create a framework that makes it easier to use seq2seq model!