추론 엔드포인트

Hugging Face가 관리하는 추론 엔드포인트는 우리가 모델을 쉽고 안전하게 배포할 수 있게 해주는 도구입니다. 이러한 추론 엔드포인트는 Hub에 있는 모델을 기반으로 설계되었습니다. 이 문서는 huggingface_hub와 추론 엔드포인트 통합에 관한 참조 페이지이며, 더욱 자세한 정보는 공식 문서를 통해 확인할 수 있습니다.

‘huggingface_hub’를 사용하여 추론 엔드포인트를 프로그래밍 방식으로 관리하는 방법을 알고 싶다면, 관련 가이드를 확인해 보세요.

추론 엔드포인트는 API로 쉽게 접근할 수 있습니다. 이 엔드포인트들은 Swagger를 통해 문서화되어 있고, InferenceEndpoint 클래스는 이 API를 사용해 만든 간단한 래퍼입니다.

매소드

다음과 같은 추론 엔드포인트의 기능이 HfApi안에 구현되어 있습니다:

get_inference_endpoint()와 list_inference_endpoints()를 사용해 엔드포인트 정보를 조회할 수 있습니다.
create_inference_endpoint(), update_inference_endpoint(), delete_inference_endpoint()로 엔드포인트를 배포하고 관리할 수 있습니다.
pause_inference_endpoint()와 resume_inference_endpoint()로 엔드포인트를 잠시 멈추거나 다시 시작할 수 있습니다.
scale_to_zero_inference_endpoint()로 엔드포인트의 복제본을 0개로 설정할 수 있습니다.

InferenceEndpoint

기본 데이터 클래스는 InferenceEndpoint입니다. 여기에는 구성 및 현재 상태를 가지고 있는 배포된 InferenceEndpoint에 대한 정보가 포함되어 있습니다. 배포 후에는 InferenceEndpoint.client와 InferenceEndpoint.async_client를 사용해 엔드포인트에서 추론 작업을 할 수 있고, 이때 InferenceClient와 AsyncInferenceClient 객체를 반환합니다.

class huggingface_hub.InferenceEndpoint

< source >

( namespace: str raw: Dict _token: Union _api: HfApi )

Parameters

name (str) — The unique name of the Inference Endpoint.
namespace (str) — The namespace where the Inference Endpoint is located.
repository (str) — The name of the model repository deployed on this Inference Endpoint.
status (InferenceEndpointStatus) — The current status of the Inference Endpoint.
url (str, optional) — The URL of the Inference Endpoint, if available. Only a deployed Inference Endpoint will have a URL.
framework (str) — The machine learning framework used for the model.
revision (str) — The specific model revision deployed on the Inference Endpoint.
task (str) — The task associated with the deployed model.
created_at (datetime.datetime) — The timestamp when the Inference Endpoint was created.
updated_at (datetime.datetime) — The timestamp of the last update of the Inference Endpoint.
type (InferenceEndpointType) — The type of the Inference Endpoint (public, protected, private).
raw (Dict) — The raw dictionary data returned from the API.
token (str or bool, optional) — Authentication token for the Inference Endpoint, if set when requesting the API. Will default to the locally saved token if not provided. Pass token=False if you don’t want to send your token to the server.

Contains information about a deployed Inference Endpoint.

Example:

>>> from huggingface_hub import get_inference_endpoint
>>> endpoint = get_inference_endpoint("my-text-to-image")
>>> endpoint
InferenceEndpoint(name='my-text-to-image', ...)

# Get status
>>> endpoint.status
'running'
>>> endpoint.url
'https://my-text-to-image.region.vendor.endpoints.huggingface.cloud'

# Run inference
>>> endpoint.client.text_to_image(...)

# Pause endpoint to save $$$
>>> endpoint.pause()

# ...
# Resume and wait for deployment
>>> endpoint.resume()
>>> endpoint.wait()
>>> endpoint.client.text_to_image(...)

from_raw

< source >

( raw: Dict namespace: str token: Union = None api: Optional = None )

Initialize object from raw dictionary.

client

< source >

( ) → InferenceClient

Returns

InferenceClient

an inference client pointing to the deployed endpoint.

Raises

InferenceEndpointError

InferenceEndpointError — If the Inference Endpoint is not yet deployed.

Returns a client to make predictions on this Inference Endpoint.

async_client

< source >

( ) → AsyncInferenceClient

Returns

AsyncInferenceClient

an asyncio-compatible inference client pointing to the deployed endpoint.

Raises

InferenceEndpointError

InferenceEndpointError — If the Inference Endpoint is not yet deployed.

Returns a client to make predictions on this Inference Endpoint.

delete

< source >

( )

Delete the Inference Endpoint.

This operation is not reversible. If you don’t want to be charged for an Inference Endpoint, it is preferable to pause it with InferenceEndpoint.pause() or scale it to zero with InferenceEndpoint.scale_to_zero().

This is an alias for HfApi.delete_inference_endpoint().

fetch

< source >

( ) → InferenceEndpoint

Returns

InferenceEndpoint

the same Inference Endpoint, mutated in place with the latest data.

Fetch latest information about the Inference Endpoint.

pause

< source >

( ) → InferenceEndpoint

Returns

InferenceEndpoint

the same Inference Endpoint, mutated in place with the latest data.

Pause the Inference Endpoint.

A paused Inference Endpoint will not be charged. It can be resumed at any time using InferenceEndpoint.resume(). This is different than scaling the Inference Endpoint to zero with InferenceEndpoint.scale_to_zero(), which would be automatically restarted when a request is made to it.

This is an alias for HfApi.pause_inference_endpoint(). The current object is mutated in place with the latest data from the server.

resume

< source >

( running_ok: bool = True ) → InferenceEndpoint

Parameters

running_ok (bool, optional) — If True, the method will not raise an error if the Inference Endpoint is already running. Defaults to True.

Returns

InferenceEndpoint

the same Inference Endpoint, mutated in place with the latest data.

Resume the Inference Endpoint.

This is an alias for HfApi.resume_inference_endpoint(). The current object is mutated in place with the latest data from the server.

scale_to_zero

< source >

( ) → InferenceEndpoint

Returns

InferenceEndpoint

the same Inference Endpoint, mutated in place with the latest data.

Scale Inference Endpoint to zero.

An Inference Endpoint scaled to zero will not be charged. It will be resume on the next request to it, with a cold start delay. This is different than pausing the Inference Endpoint with InferenceEndpoint.pause(), which would require a manual resume with InferenceEndpoint.resume().

This is an alias for HfApi.scale_to_zero_inference_endpoint(). The current object is mutated in place with the latest data from the server.

update

< source >

( accelerator: Optional = None instance_size: Optional = None instance_type: Optional = None min_replica: Optional = None max_replica: Optional = None repository: Optional = None framework: Optional = None revision: Optional = None task: Optional = None ) → InferenceEndpoint

Parameters

accelerator (str, optional) — The hardware accelerator to be used for inference (e.g. "cpu").
instance_size (str, optional) — The size or type of the instance to be used for hosting the model (e.g. "x4").
instance_type (str, optional) — The cloud instance type where the Inference Endpoint will be deployed (e.g. "intel-icl").
min_replica (int, optional) — The minimum number of replicas (instances) to keep running for the Inference Endpoint.
max_replica (int, optional) — The maximum number of replicas (instances) to scale to for the Inference Endpoint.
repository (str, optional) — The name of the model repository associated with the Inference Endpoint (e.g. "gpt2").
framework (str, optional) — The machine learning framework used for the model (e.g. "custom").
revision (str, optional) — The specific model revision to deploy on the Inference Endpoint (e.g. "6c0e6080953db56375760c0471a8c5f2929baf11").
task (str, optional) — The task on which to deploy the model (e.g. "text-classification").

Returns

InferenceEndpoint

the same Inference Endpoint, mutated in place with the latest data.

Update the Inference Endpoint.

This method allows the update of either the compute configuration, the deployed model, or both. All arguments are optional but at least one must be provided.

This is an alias for HfApi.update_inference_endpoint(). The current object is mutated in place with the latest data from the server.

wait

< source >

( timeout: Optional = None refresh_every: int = 5 ) → InferenceEndpoint

Parameters

timeout (int, optional) — The maximum time to wait for the Inference Endpoint to be deployed, in seconds. If None, will wait indefinitely.
refresh_every (int, optional) — The time to wait between each fetch of the Inference Endpoint status, in seconds. Defaults to 5s.

Returns

InferenceEndpoint

the same Inference Endpoint, mutated in place with the latest data.

Raises

InferenceEndpointError or InferenceEndpointTimeoutError

InferenceEndpointError — If the Inference Endpoint ended up in a failed state.
InferenceEndpointTimeoutError — If the Inference Endpoint is not deployed after timeout seconds.

Wait for the Inference Endpoint to be deployed.

Information from the server will be fetched every 1s. If the Inference Endpoint is not deployed after timeout seconds, a InferenceEndpointTimeoutError will be raised. The InferenceEndpoint will be mutated in place with the latest data.

InferenceEndpointStatus

class huggingface_hub.InferenceEndpointStatus

< source >

( value names = None module = None qualname = None type = None start = 1 )

An enumeration.

InferenceEndpointType

class huggingface_hub.InferenceEndpointType

< source >

( value names = None module = None qualname = None type = None start = 1 )

An enumeration.

InferenceEndpointError

class huggingface_hub.InferenceEndpointError

< source >

( )

Generic exception when dealing with Inference Endpoints.

< > Update on GitHub

Hub Python Library

추론 엔드포인트

매소드

InferenceEndpoint

class huggingface_hub.InferenceEndpoint

from_raw

client

async_client

delete

fetch

pause

resume

scale_to_zero

update

wait

InferenceEndpointStatus

class huggingface_hub.InferenceEndpointStatus

InferenceEndpointType

class huggingface_hub.InferenceEndpointType

InferenceEndpointError

class huggingface_hub.InferenceEndpointError