Text Generation
Transformers
Safetensors
English
llama
conversational
Inference Endpoints
text-generation-inference

Deployment?

#24
by huggingface9837 - opened

I've been playing around with trying to fine tune various versions of llama, with the ultimate goal of embedding the model in a binary I ship with my software to users. I'm struggling on one point (aside from the general mess that is ML packaging & versioning): how can I deploy this in a generic way with all dependencies included whilst keeping memory and inference time as low as possible? I don't want to try to ship some gargantuan python environment or a docker build or something crazy like that.

Does anyone know how I might do that with TinyLlama? How do you make sure the user has the right drivers accelerators / drivers / whatever to make sure it's using the GPU? Is this feasible, or are we just too early in the LLM space to expect something like this? If it is feasible, does anyone know of any good resources for how I might do this? If it matters, I'd primarily like to ship to just windows right now.

I'm looking for something similar. What I found is using the huggingface/candle wasm module and serve the downloaded/bundled model.

Thanks, I'll take a look!

Probably one of the best way right now is trough llama.cpp or any of it's many bindings libraries. it's small, efficient and you'll just need the model in GGUF format, like TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF

Sign up or log in to comment