TheBloke/guanaco-65B-GPTQ · How do I consume the successfully running pod with your one click UI template through an API

Jun 12, 2023

Hey Bloke, Great work! You and your team have been doing incredible work on this space.

I have successfully run your template to load guanaco-65B in the pod and was able to use the UI to generate text to the prompts. However, I want to consume that pod's loaded model in my local python code through an API so i can build my own application over it. I could see in interface options "API" is enabled , however I am not able to see any documentation around how to consume this through an API.

Thanks in advance.

gr8ston changed discussion title from How do I consume the successfully running pod with your one click UI through an API to How do I consume the successfully running pod with your one click UI template through an API Jun 12, 2023

cognisant

Jun 12, 2023

Much easier and probably cheaper (Depending on usage) to deploy on modal.com btw. I got it running there np. Just repurpose some of their example codes to get started.

TheBloke

Owner Jun 12, 2023

My team? ;)

I have two templates. They run identical software, but the one called "Local LLMs One-Click UI" doesn't expose the API, and the "One-Click UI and API" does. So make sure you're using the latter.

The API one exposes two ports, 5000 and 5005. 5000 is an HTTPs port, 5005 is TCP for use with web sockets.

There are two example scripts included in text-generation-webui:

I believe the streaming example (which returns text word-by-word, like ChatGPT) uses port 5005, websockets, and the other one uses 5000 with HTTPS.

To connect to the HTTPS port:

Click the CONNECT button in Runpod
Right-click where it says "HTTP 5000", and copy the link address
Now connect to that address on port 443 from your API client code

To connect to the web sockets port:

Click the CONNECT button in Runpod
Click on the "TCP ports" tab (or similar name; I forget the exact wording)
Find the entry that says Internal: 5005 and make a note of its external port
Then connect to the pod's external IP (will be shown on that tab, or under CONNECT) on that external port. There's a screenshot in the README demonstrating this, although the UI has changed slightly since I took the screenshot. I need to update that.

TheBloke

Owner Jun 12, 2023

•

edited Jun 12, 2023

Much easier and probably cheaper (Depending on usage) to deploy on modal.com btw. I got it running there np. Just repurpose some of their example codes to get started.

I agree that serverless could be a good solution. I've not investigated it myself yet, but my friend Wing Lian has done great work creating Docker containers that deploy GPU-accelerated GGML models on Runpod serverless: https://github.com/OpenAccess-AI-Collective/servereless-runpod-ggml

I've not tried Modal yet (I plan to; they gave me some free credits to try it out.) But my first impression was that it looked much more expensive than Runpod on account of the huge RAM pricing. $0.04 per GB per hour. That basically kills pytorch inference, which has to load the full model into RAM as well as on to the GPU; a 30B GPTQ 4bit model would use at least 24GB RAM which is another $0.96/hr on top of the GPU cost, which is already higher than Runpod's. For GGML it might work better, given that a GPU-accelerated GGML model only uses a few gigs of RAM when fully loaded onto the GPU. But it's still going to end up more expensive than Runpod.

I've not tried Serverless on either yet, but that was my impression just looking at their pricing. I guess if Modal ends up being more reliable or faster then that could balance out. But I was surprised at the cost.

gr8ston

Jun 12, 2023

Thanks for the quick and detailed response. :-)

Let me try it out and get back. Looking forward to see something from you on the serverless end soon.

cognisant

Jun 13, 2023

Much easier and probably cheaper (Depending on usage) to deploy on modal.com btw. I got it running there np. Just repurpose some of their example codes to get started.

I agree that serverless could be a good solution. I've not investigated it myself yet, but my friend Wing Lian has done great work creating Docker containers that deploy GPU-accelerated GGML models on Runpod serverless: https://github.com/OpenAccess-AI-Collective/servereless-runpod-ggml

I've not tried Modal yet (I plan to; they gave me some free credits to try it out.) But my first impression was that it looked much more expensive than Runpod on account of the huge RAM pricing. $0.04 per GB per hour. That basically kills pytorch inference, which has to load the full model into RAM as well as on to the GPU; a 30B GPTQ 4bit model would use at least 24GB RAM which is another $0.96/hr on top of the GPU cost, which is already higher than Runpod's. For GGML it might work better, given that a GPU-accelerated GGML model only uses a few gigs of RAM when fully loaded onto the GPU. But it's still going to end up more expensive than Runpod.

I've not tried Serverless on either yet, but that was my impression just looking at their pricing. I guess if Modal ends up being more reliable or faster then that could balance out. But I was surprised at the cost.

nice share! I was going to paste you the modal.com code but running it, I ran into an error I have yet to resolve. Maybe it was the 33B model I had run. However, here's at least the Falcon example they give from which I've been adapting others:

# ---
# integration-test: false
# ---
# # Run Falcon-40B with AutoGPTQ

# In this example, we run a quantized 4-bit version of Falcon-40B, the first open-source large language
# model of its size, using HuggingFace's [transformers](https://huggingface.co/docs/transformers/index)
# library and [AutoGPTQ](https://github.com/PanQiWei/AutoGPTQ).
#
# Due to the current limitations of the library, the inference speed is a little under 1 token/second and the
# cold start time on Modal is around 25s.
#
# For faster inference at the expense of a slower cold start, check out
# [Running Falcon-40B with `bitsandbytes` quantization](/docs/guide/ex/falcon_bitsandbytes). You can also
# run a smaller, 7-billion-parameter model with the [OpenLLaMa example](/docs/guide/ex/openllama).
#
# ## Setup
#
# First we import the components we need from `modal`.

from modal import Image, Stub, gpu, method, web_endpoint

# ## Define a container image
#
# To take advantage of Modal's blazing fast cold-start times, we download model weights
# into a folder inside our container image. These weights come from a quantized model
# found on Huggingface.
IMAGE_MODEL_DIR = "/model"


def download_model():
    from huggingface_hub import snapshot_download

    model_name = "TheBloke/falcon-40b-instruct-GPTQ"
    snapshot_download(model_name, local_dir=IMAGE_MODEL_DIR)


# Now, we define our image. We'll use the `debian-slim` base image, and install the dependencies we need
# using [`pip_install`](/docs/reference/modal.Image#pip_install). At the end, we'll use
# [`run_function`](/docs/guide/custom-container#running-a-function-as-a-build-step-beta) to run the
# function defined above as part of the image build.

image = (
    Image.debian_slim(python_version="3.10")
    .apt_install("git")
    .pip_install(
        "huggingface_hub==0.14.1",
        "transformers @ git+https://github.com/huggingface/transformers.git@f49a3453caa6fe606bb31c571423f72264152fce",
        "auto-gptq @ git+https://github.com/PanQiWei/AutoGPTQ.git@b5db750c00e5f3f195382068433a3408ec3e8f3c",
        "einops==0.6.1",
    )
    .run_function(download_model)
)

# Let's instantiate and name our [Stub](/docs/guide/apps).
stub = Stub(name="example-falcon-gptq", image=image)


# ## The model class
#
# Next, we write the model code. We want Modal to load the model into memory just once every time a container starts up,
# so we use [class syntax](/docs/guide/lifecycle-functions) and the `__enter__` method.
#
# Within the [@stub.cls](/docs/reference/modal.Stub#cls) decorator, we use the [gpu parameter](/docs/guide/gpu)
# to specify that we want to run our function on an [A100 GPU](/pricing). We also allow each call 10 mintues to complete,
# and request the runner to stay live for 5 minutes after its last request.
#
# The rest is just using the `transformers` library to run the model. Refer to the
# [documentation](https://huggingface.co/docs/transformers/v4.29.1/en/main_classes/text_generation#transformers.GenerationMixin.generate)
# for more parameters and tuning.
#
# Note that we need to create a separate thread to call the `generate` function because we need to
# yield the text back from the streamer in the main thread. This is an idiosyncrasy with streaming in `transformers`.


@stub
	.cls(gpu=gpu.A100(), timeout=60 * 10, container_idle_timeout=60 * 5)
class Falcon40BGPTQ:
    def __enter__(self):
        from transformers import AutoTokenizer
        from auto_gptq import AutoGPTQForCausalLM

        self.tokenizer = AutoTokenizer.from_pretrained(
            IMAGE_MODEL_DIR, use_fast=True
        )
        print("Loaded tokenizer.")

        self.model = AutoGPTQForCausalLM.from_quantized(
            IMAGE_MODEL_DIR,
            trust_remote_code=True,
            use_safetensors=True,
            device_map="auto",
            use_triton=False,
            strict=False,
        )
        print("Loaded model.")

    

@method
	()
    def generate(self, prompt: str):
        from threading import Thread
        from transformers import TextIteratorStreamer

        inputs = self.tokenizer(prompt, return_tensors="pt")
        streamer = TextIteratorStreamer(
            self.tokenizer, skip_special_tokens=True
        )
        generation_kwargs = dict(
            inputs=inputs.input_ids.cuda(),
            attention_mask=inputs.attention_mask,
            temperature=0.1,
            max_new_tokens=512,
            streamer=streamer,
        )

        # Run generation on separate thread to enable response streaming.
        thread = Thread(target=self.model.generate, kwargs=generation_kwargs)
        thread.start()
        for new_text in streamer:
            yield new_text

        thread.join()


# ## Run the model
# We define a [`local_entrypoint`](/docs/guide/apps#entrypoints-for-ephemeral-apps) to call our remote function
# sequentially for a list of inputs. You can run this locally with `modal run -q falcon_gptq.py`. The `-q` flag
# enables streaming to work in the terminal output.
prompt_template = (
    "A chat between a curious human user and an artificial intelligence assistant. The assistant give a helpful, detailed, and accurate answer to the user's question."
    "\n\nUser:\n{}\n\nAssistant:\n"
)




@stub
	.local_entrypoint()
def cli():
    question = "What are the main differences between Python and JavaScript programming languages?"
    model = Falcon40BGPTQ()
    for text in model.generate.call(prompt_template.format(question)):
        print(text, end="", flush=True)


# ## Serve the model
# Finally, we can serve the model from a web endpoint with `modal deploy falcon_gptq.py`. If
# you visit the resulting URL with a question parameter in your URL, you can view the model's
# stream back a response.
# You can try our deployment [here](https://modal-labs--example-falcon-gptq-get.modal.run/?question=Why%20are%20manhole%20covers%20round?).


@stub
	.function(timeout=600)
@web_endpoint()
def get(question: str):
    from fastapi.responses import StreamingResponse
    from itertools import chain

    model = Falcon40BGPTQ()
    return StreamingResponse(
        chain(
            ("Loading model. This usually takes around 20s ...\n\n"),
            model.generate.call(prompt_template.format(question)),
        ),
        media_type="text/event-stream",
    )

gr8ston

Jun 14, 2023

@TheBloke - I did successfully get the API up and running and am able to consume the same in my VS Code. However i keep getting CORS issue when i attach my app to a secured site. I understand what CORS is and it happens because my x.com is trying to load data from y.proxy.runpod.net and hence the issue. This usually gets fixed on the server. But in Runpod and with API template of yours I am not sure how to fix it. I tried many ways that i am aware of but nothing worked out. Any help is appreciated.

TheBloke

Owner Jun 16, 2023

Sorry I'm really not sure about web development questions

If you learn of any change I can make to the template that would help, then let me know and I can look into adding it. But I have no idea how to fix CORS in general.

Anandvamsi1993

Sep 25, 2023

•

edited Sep 25, 2023

@TheBloke , I am trying to do the same. Basically, I am working on an RAG project (Retrieval AUgmented .. ) and I need the model instance on my vscode. I am new to the linux part. Once the model is downloaded into workspace after we use wget, how can I create a variable that references the model object?

I genuinely appreciate your help and I look up to you.

@gr8ston , if you know how to do this, please let me know.