Spaces:

merve
/

gradio-tgi-2

Running on CPU Upgrade

App Files Files Community

Wauplin HF staff commited on Nov 17, 2023

Commit

dc44908

•

1 Parent(s): b451ff3

More logging + timeout if model is not loaded + url not needed

Browse files

Hey

@merve

🤗

Suggesting a couple of changes in how to use `huggingface_hub`. Feel free to ignore if you prefer the way it was:
1. Enable more logging using `logging.set_verbosity_info()` => this will help in case you need to debug something not work. Typically if the endpoint is not available.
2. Add `timeout=60` to the client => by default the `InferenceClient` will retry indefinitely until the model is ready. You might want to set a timeout here instead.
3. When using the InferenceAPI, no need to paste the full url. Setting `model="meta-llama/Llama-2-7b-chat-hf"` is enough (pasting the full URL is not wrong btw)

Files changed (1) hide show

app.py +4 -1

app.py CHANGED Viewed

@@ -1,12 +1,15 @@
 import gradio as gr
 from huggingface_hub import InferenceClient
 import os
 token = os.getenv("TOKEN")
 endpoint = os.getenv("ENDPOINT")
 # initialize InferenceClient
-client = InferenceClient(model="https://api-inference.huggingface.co/models/meta-llama/Llama-2-7b-chat-hf", token=token)
 # query client using streaming mode
 def inference(message, history):

 import gradio as gr
 from huggingface_hub import InferenceClient
+from huggingface_hub import logging
 import os
+logging.set_verbosity_info()
 token = os.getenv("TOKEN")
 endpoint = os.getenv("ENDPOINT")
 # initialize InferenceClient
+client = InferenceClient(model="meta-llama/Llama-2-7b-chat-hf", timeout=60, token=token)
 # query client using streaming mode
 def inference(message, history):