Deploy transformer models accelerated to 1 ms latency

Infinity is the containerized solution for delivering Transformers accuracy at 1ms latency.

Get instant access to test inference latency on your own data

Ultra-Fast Inference in Your Own Infrastructure

Achieve 1ms latency for BERT-like models

Plug and Predict

Infinity comes as a single-container and can be deployed in any production environment. It can easily be scaled to thousands of requests every second using orchestration services like kubernetes.

Unmatched Performance

Infinity achieves unmatched performance for state-of-the-art transformer models. Infinity achieves 1ms latency for BERT-like models on GPU, and 4ms on CPU.

Enterprise Ready

Infinity meets the highest security requirements and can be integrated into your system including air gapped environments. You control your models, your data, and the traffic.

How Does It Work?

Up and running in 15 minutes

Infinity Container

Infinity Container is a hardware-optimized inference solution delivered as a container. The Infinity Container is built specifically to run optimally on a target hardware architecture and exposes an HTTP API to run inference. Currently, supported tasks are document embedding, re-ranking and sequence classification.

Infinity Container

Infinity Multiverse

Infinity Multiverse is a model optimization service delivered as a container, so you can optimize your models within your own environment, toward a compatible target inference hardware. Supported architectures are BERT, BERT-Large, DistilBERT, RoBERTa, RoBERTa-large, DistilRoBERTa, and MiniLM.

Infinity Container

Customer Success Stories

Learn how leading AI teams use Infinity to increase agility, lower costs, and accelerate their Transformer pipelines

Infinity for Ecommerce

Otto logo

One of the world's largest e-commerce companies

Machine Learning Tasks

Feature extraction and ranking tasks

Inference Speed

2.2 ms per request - 10 times faster than before

Portrait of Jens Dorn, Business Intelligence, Otto
Hugging Face helped us solve one of our major challenges: scalable and high-performing transformer models stable enough for production. We reached about 2.2 ms per request for feature extraction and 2.4 ms per ranking tasks on one GPU. 10 times faster than our results in the months before! It’s huge fun to cooperate with the people from Hugging Face!
Jens Dorn
Business Intelligence, Otto

Infinity for Conversations


World-leading providers of outsourced business solutions

Machine Learning Task

Call transcript classification

Inference Speed

4ms per request to classify conversations

Portrait of Pete Hanlon, CTO at Moneypenny
At Moneypenny we look for practical ways to leverage the latest advances in AI, making our next conversation better than the last. With Infinity, we were able to automate call transcript classification easily, predicting the topic of a call with a high level of accuracy and in just four milliseconds per call! Infinity turned this model into an optimized inference solution, ready to deploy on our infrastructure. The whole process was extremely simple.
Pete Hanlon
CTO, Moneypenny

About Hugging Face

We are the creators of Transformers, the leading open source library for data scientists and machine learning engineers to explore state-of-the-art models and build machine learning features. We are on a mission to democratize AI, one commit at a time!

Request a Free Trial for 🤗 Infinity!

Transformers latency down to 1ms? 🤯

Get instant access to test inference latency on your own data.