How do i run it?

#2
by Yuuru - opened

Sry, which loader works currently? I've tried some for no success, but it could be me doing smth wrong.

With oobagooba and ExLlama_HF loader got the following error : KeyError: 'model.embed_tokens.weight'
Other gptq model load without issue, example TheBlokeSynthia-34B-v1.2-GPTQ load without error (using 21Go Vram)

I heard that ExLlama added support for Yi recently, so you might just need to update ExLlama

(4-bit versions only of course, not 3-bit or 8-bit)

Or the Transformers loader should work. Not AutoGPTQ yet.

feel free to try this new project to serve the model locally. https://github.com/vectorch-ai/ScaleLLM
1: start model inference server

docker run -it --gpus=all --net=host --shm-size=1g \
  -v $HOME/.cache/huggingface/hub:/models \
  -e HF_MODEL_ID=TheBloke/Yi-34B-GPTQ \
  -e DEVICE=auto \
  docker.io/vectorchai/scalellm:latest --logtostderr

2: start REST API server

docker run -it --net=host \
  docker.io/vectorchai/scalellm-gateway:latest --logtostderr

you will get following running services:

  • ScaleLLM gRPC server on port 8888: localhost:8888
  • ScaleLLM HTTP server for monitoring on port 9999: localhost:9999
  • ScaleLLM REST API server on port 8080: localhost:8080

then send requests

curl http://localhost:8080/v1/completions   -H "Content-Type: application/json"   -d '{
    "model": "TheBloke/Yi-34B-GPTQ",
    "prompt": "what is vue.js",
    "max_tokens": 32,
    "temperature": 0.7
  }'

@Yuuru

You can also inferencing in textgen with exllamav2

https://huggingface.co/01-ai/Yi-34B/discussions/22#654fb707380ee26b49b3b180

Sign up or log in to comment