question: do I need to host two instance?

#1
by hiepxanh - opened

I would like to test and implement it. Right now the issue is most inference and API platform only support 1 model to Embedding, or Text Generation. So as your mention, I have to host 2 API only using the same 1 model? I mean I need to create something new to run it, or I can use some inference like llama.cpp or something?

GritLM org

Not sure - If the inference platform allows you to get the final hidden states from your text generation model then you can only host a text generation endpoint and either use it for gen or get final hidden states from it and then average them across sequence length to get the embedding
I think something like llama.cpp should work

Sign up or log in to comment