A: Inference Endpoints are currently available on AWS in us-east-1 (N. Virginia) & eu-west-1 (Ireland) and on Azure in eastus (Virginia). If you need to deploy in a different region, please let us know.
A: No, you cannot access the instance hosting your Endpoint. But if you are missing information or need more insights on the machine where the Endpoint is running, please contact us.
A: No, when creating a Private Endpoint (a Hugging Face Inference Endpoint linked to your VPC via AWS/Azure PrivateLink), you can only see the ENI in your VPC where the Endpoint is available.
A: It depends on the Task. The supported Tasks are using the transformers or sentence-transformers pipelines under the hood. If your Task pipeline supports batching, e.g. Zero-Shot Classification then batch inference is supported. In any case, you can always create your own inference handler and implement batching.
A: The Endpoints are scaled automatically for you, the only information you need to provide is a min replica target and a max replica target. Then the system will scale your Endpoint based on the load. Scaling to zero is currently not supported.
A: Yes, your Endpoint will always stay available/up with the number of min replicas defined in the Advanced configuration.
A: Yes, you can deploy any repository from the Hugging Face Hub and if your task/model/framework is not supported out of the box, you can create your own inference handler and then deploy your model to an Endpoint.
A: The Endpoints are billed based on the compute hours of your Running Endpoints, and the associated instance types. We may add usage costs for load balancers and Private Links in the future.
A: Yes, data is encrypted during transit with TLS/SSL.
A: There are several ways to reduce the latency of your Endpoint. One is to deploy your Endpoint in a region close to your application to reduce the network overhead. Another is to optimize your model using Hugging Face Optimum before creating your Endpoint. If you need help or have more questions about reducing latency, please contact us.
A: You can currently monitor your Endpoint through the 🤗 Inference Endpoints web application, where you have access to the Logs of your Endpoints as well as a metrics dashboard. If you need programmatic access or more information, please contact us.
A: Please contact us if you feel your model would do better on a different instance type than what is listed.
A: You can invalidate existing personal tokens and create new ones in your settings here: https://huggingface.co/settings/tokens. For organization tokens, go to the organization settings.
A: If your Endpoint is using a CPU accelerator, once the average CPU utilization of all your endpoints hits 80%, a new endpoint will be added.
As for the GPU accelerator, once the average GPU utilization of all your endpoints averaged over a 2 minute window reaches 80%, a new endpoint will be added. One Endpoint can be scaled up every 3 minutes.
Note that if we do not have enough resources available for your new endpoint, a new VM needs to be created which generally takes 1 to 5 minutes.