GCP system to host llama2 70B Chat model

#40
by Hammad-Ahmad - opened

@TheBloke can you please help me to figure out which system is best to host llama2 Chat 70B model on GCP for production? So it can handle multiple requests and mention multiple user chat sessions at the same time. And do load balancing effectively.

Which quantize model is best, having max context length, max speed, and mini response time. Thank in Advance!

Sign up or log in to comment