Deploy transformer models
to 1 ms latency
Infinity is the containerized solution for delivering Transformers accuracy at 1ms latency.
Get instant access to test inference latency on your own data
Ultra-Fast Inference in Your Own Infrastructure
Achieve 1ms latency for BERT-like models
Plug and Predict
- Infinity comes as a single-container and can be deployed in any production environment. It can easily be scaled to thousands of requests every second using orchestration services like kubernetes.
- Infinity achieves unmatched performance for state-of-the-art transformer models. Infinity achieves 1ms latency for BERT-like models on GPU, and 4ms on CPU.
- Infinity meets the highest security requirements and can be integrated into your system including air gapped environments. You control your models, your data, and the traffic.
How Does It Work?
Up and running in 15 minutes
Infinity Container is a hardware-optimized inference solution delivered as a container. The Infinity Container is built specifically to run optimally on a target hardware architecture and exposes an HTTP API to run inference. Currently, supported tasks are document embedding, re-ranking and sequence classification.
Infinity Multiverse is a model optimization service delivered as a container, so you can optimize your models within your own environment, toward a compatible target inference hardware. Supported architectures are BERT, BERT-Large, DistilBERT, RoBERTa, RoBERTa-large, DistilRoBERTa, and MiniLM.
Customer Success Stories
Learn how leading AI teams use Infinity to increase agility, lower costs, and accelerate their Transformer pipelines
Infinity for Ecommerce
One of the world's largest e-commerce companies
Feature extraction and ranking tasks
2.2 ms per request - 10 times faster than before
Hugging Face helped us solve one of our major challenges: scalable and high-performing transformer models stable enough for production. We reached about 2.2 ms per request for feature extraction and 2.4 ms per ranking tasks on one GPU. 10 times faster than our results in the months before! It’s huge fun to cooperate with the people from Hugging Face!
Infinity for Conversations
World-leading providers of outsourced business solutions
Call transcript classification
4ms per request to classify conversations
At Moneypenny we look for practical ways to leverage the latest advances in AI, making our next conversation better than the last. With Infinity, we were able to automate call transcript classification easily, predicting the topic of a call with a high level of accuracy and in just four milliseconds per call! Infinity turned this model into an optimized inference solution, ready to deploy on our infrastructure. The whole process was extremely simple.
About Hugging Face
We are the creators of Transformers, the leading open source library for data scientists and machine learning engineers to explore state-of-the-art models and build machine learning features. We are on a mission to democratize AI, one commit at a time!