System Configuration for deploying a web app like huggingchat

#250
by Nirmalt13 - opened

I would like to know the cloud configuration required to deploy llama2-70b-chat model which is used by Hugging Chat.

Hugging Chat org

If you want to deploy llama2 on your own infrastructure, you can try using text generation inference.

If you don't have access to enough local compute to run it yourself, you can try to deploy it using AWS SageMaker for example and follow the guide on how to set it up with chat-ui.

Regarding chat-ui parameters for llama 2, here is what we use:

    "userMessageToken": "",
    "userMessageEndToken": " [/INST] ",
    "assistantMessageToken": "",
    "assistantMessageEndToken": " </s><s>[INST] ",
    "preprompt": "<s>[INST] <<SYS>>\n\n<</SYS>>\n\n",
    "promptExamples": [
      {
        "title": "Write an email from bullet list",
        "prompt": "As a restaurant owner, write a professional email to the supplier to get these products every week: \n\n- Wine (x10)\n- Eggs (x24)\n- Bread (x12)"
      }, {
        "title": "Code a snake game",
        "prompt": "Code a basic snake game in python, give explanations for each step."
      }, {
        "title": "Assist in a task",
        "prompt": "How do I make a delicious lemon cheesecake?"
      }
    ],
    "parameters": {
      "temperature": 0.1,
      "top_p": 0.95,
      "repetition_penalty": 1.2,
      "top_k": 50,
      "truncate": 1000,
      "max_new_tokens": 1024
    },

You can find docs about the rest of the parameters here.

I hope that's enough to get started, let me know if you need anything else.

Hey @nsarrazin thanks for your reply.
I also wanted to know the NCCL GPU configurations you have used to create text-generation inference. Also how many concurrent requests can it handle ?

Sign up or log in to comment