Has anybody tried the inference endpoints of this model

#9
by adityasharma2695 - opened

Hi

I want to know if anyone has tried the inference endpoint for this model, currently I am not using a paid account so I want to know if this model can be accessed via APIs using the inference endpoints.

Thanks!
Aditya

@adityasharma2695 I also would love to access the model via the free Inference API. I just got a pro-account, but it doesn't change anything, the model is still not available and you get this message if you want to access it via the free Inference API:

The model HuggingFaceH4/starchat-alpha is too large to be loaded automatically (31GB > 10GB). For commercial use please use PRO spaces (https://huggingface.co/spaces) or Inference Endpoints (https://huggingface.co/inference-endpoints)

As mentioned in https://huggingface.co/spaces/HuggingFaceH4/starchat-playground/discussions/3#6469fc4e96cfe72aef76aaeb, I think there is no possibility right now to use the model for free via the Inference API. But I might be wrong πŸ™ˆ

@NERDDISCO
Thanks for the reply and it's great that you got a paid account as well.
Does it offer to deploy the model to use on interface via API? just asking if you observed anything like that.

@adityasharma2695 yes, you can deploy your own Inference Endpoint, then you can use it via API. But that has nothing to do with the pro-account. You could also do this without a pro-account.

What I'm trying to do right now is to use quantisation to have a version of the model, that runs on slow hardware. I will ping you once I get this working.

Can anyone guide me, how to deploy this on Vertex AI.

I am getting error like: ERROR 2023-05-31T07:38:18.376283832Z [resource.labels.taskName: workerpool0-0] File "main.py", line 8, in
{
"insertId": "1hjwmhtfm1rso4",
"jsonPayload": {
"message": " File "main.py", line 8, in \n",
"attrs": {
"tag": "workerpool0-0"
},
"levelname": "ERROR"
},
"resource": {
"type": "ml_job",
"labels": {
"job_id": "3056473000426602496",
"task_name": "workerpool0-0",
"project_id": "api-appexecutable-com"
}
},
"timestamp": "2023-05-31T07:38:18.376283832Z",
"severity": "ERROR",
"labels": {
"ml.googleapis.com/tpu_worker_id": "",
"compute.googleapis.com/resource_name": "cmle-training-18349684525105594269",
"ml.googleapis.com/trial_type": "",
"ml.googleapis.com/job_id/log_area": "root",
"ml.googleapis.com/trial_id": "",
"compute.googleapis.com/resource_id": "1504603656255391738",
"compute.googleapis.com/zone": "us-west1-b"
},
"logName": "projects/api-appexecutable-com/logs/workerpool0-0",
"receiveTimestamp": "2023-05-31T07:38:52.295831776Z"
}

My main.py is:

from flask import Flask, request, jsonify
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

app = Flask(name)

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/starchat-alpha")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/starchat-alpha",
load_in_8bit=True,
device_map='auto',
torch_dtype=torch.float16)

@app .route('/generate', methods=['POST'])
def generate():
data = request.json
input_prompt = data['prompt']
system_prompt = "\nBelow is a conversation between a human user and a helpful AI coding assistant.\n"
user_prompt = f"\n{input_prompt}\n"
assistant_prompt = ""
full_prompt = system_prompt + user_prompt + assistant_prompt
inputs = tokenizer.encode(full_prompt, return_tensors="pt").to('cuda')
outputs = model.generate(inputs,
eos_token_id = 0,
pad_token_id = 0,
max_length=256,
early_stopping=True)
output = tokenizer.decode(outputs[0])
output = output[len(full_prompt):]
if "" in output:
cutoff = output.find("")
output = output[:cutoff]
return jsonify({'response': output})

if name == 'main':
app.run(host='0.0.0.0', port=5000)

@ravineshraj I think you should open a new discussion, as we talked in here about using the hosting possibilities of Hugging Face, not Vertex AI πŸ™

Sign up or log in to comment