Multi-Model GPU Inference with Hugging Face Inference Endpoints

Multi-model Inference Endpoints provide a way to deploy multiple models onto the same infrastructure for a scalable and cost-effective inference. On multi-model Inference Endpoints, we load a list of models into memory, either CPU or GPU, and dynamically use them during inference time.

The following diagram shows how multi-model inference endpoints look.

asset

This repository includes a custom handler of a sample multi-model EndpointHandler implementation. This multi-model handler loads 5 different models for inference including:

  • DistilBERT model for sentiment-analysis
  • Marian model translation
  • BART model for summarization
  • BERT model for token-classification
  • BERT model for text-classification

If you want to learn more about multi-model inference endpoints checkout https://www.philschmid.de/multi-model-inference-endpoints

Use with Inference Endpoints

Hugging Face Inference endpoints can be used with an HTTP client in any language. We will use Python and the requests library to send our requests. (make your you have it installed pip install requests)

result

Send requests with Pyton

import json
import requests as r

ENDPOINT_URL = "" # url of your endpoint
HF_TOKEN = "" # token of the account you deployed

# define model and payload
model_id = "facebook/bart-large-cnn"
text = "The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building, and the tallest structure in Paris. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world, a title it held for 41 years until the Chrysler Building in New York City was finished in 1930. It was the first structure to reach a height of 300 metres. Due to the addition of a broadcasting aerial at the top of the tower in 1957, it is now taller than the Chrysler Building by 5.2 metres (17 ft). Excluding transmitters, the Eiffel Tower is the second tallest free-standing structure in France after the Millau Viaduct."
request_body = {"inputs": text, "model_id": model_id}

# HTTP headers for authorization
headers= {
    "Authorization": f"Bearer {HF_TOKEN}",
    "Content-Type": "application/json"
}

# send request
response = r.post(ENDPOINT_URL, headers=headers, json=request_body)
prediction = response.json()

# [{'summary_text': 'The tower is 324 metres (1,063 ft) tall, about the same height as an 81-storey building. Its base is square, measuring 125 metres (410 ft) on each side. During its construction, the Eiffel Tower surpassed the Washington Monument to become the tallest man-made structure in the world.'}]
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .