AI Total Cost of Ownership Calculator: Evaluate the cost of in-house AI deployment vs AI APIs

Community Article Published September 20, 2023

As enterprises increasingly lean into the AI transformation—thanks in no small part to groundbreaking technologies like ChatGPT— it's becoming evident that Language Models (LLMs) have a broad range of applications that can significantly transform industries. The question is shifting from "What value will it bring?" to "How can we implement it now?" Should you go for internal deployment or SaaS offerings? How can you assess the financial and operational implications of both options?

We introduce an AI Total Cost of Ownership calculator to make comparing their costs easier. The demo is built using Gradio and can be found on our Hugging Face Space.

The LLM “Build vs. Buy” Dilemma

SaaS solutions like OpenAI GPT3.5 and GPT4 APIs, Cohere or Anyscale, are often one of the go-to solution for AI models’ integration. By fully managing the infrastructure complexities, and providing easy to use APIs, they are extremely easy to integrate for developers.

However, open-source alternatives, with for instance Meta’s Llama 2 can be strong alternatives. Indeed, those solve the privacy issues with resorting to external SaaS AI providers. In theory, they are “free” in the sense that there is no cost in using those, however they have large hidden costs in terms of infrastructure and labor required to operate them.

Open-source deployment of models can have a large initial cost that can be amortized with large usage and time, but knowing exactly when to choose one over the other can be complex. That is why we introduce our AI Total Cost of Ownership Comparison Calculator (TCO) Calculator to aid in this complex decision-making process. This tool facilitates an in-depth analysis by evaluating the following:

  • the cost per request (as in what it costs to provide the service once to process a user’s input and generate an answer or perform a task)
  • the labor cost (as in what it costs to have an engineer deploy and supervise the running of the model)
  • and the total set-up TCO (total cost of getting the service up and running)

Our calculator is an open-source project, welcoming contributions from anyone in the AI community!

Using the TCO calculator with a real-life example for Banking Chatbot

Imagine you're an AI project manager in a bank seeking to provide financial advice to customers through a Banking Chatbot. You want to provide a cost analysis on whether the bank should implement an on-premise solution or choose a cloud-based alternative. Here is how you do it.

  1. Select the use case best fitting your scenario

image/png

You can optionally customize the number of input and output tokens per request. Standard values are preset based on the selected use case.

image/png

  1. Choose two AI service options to compare

image/png

In our example, let’s compare OpenAI GPT4 and Llama2 70B.

To know more about or customize parameters, click on the information box.

Here, you can customize your labor cost.

  1. Click “Compute & Compare”

image/png

You’ll then receive a panel of information and results.

image/png

The table and the bar graph provide comparisons of the cost per request, the labor cost, and the average latency. The last graph illustrates the ($) TCO as a function of the number of requests experienced by your service in a month.

In the Banking Chatbot example, you can notice the break-even point for in-house deployment around 750,000 requests per month.

Assuming each client interacts with the chatbot 5 times per month, and each conversation involves about 5 requests, the break-even point is when 30,000 clients use the service. Beyond this point, the OpenAI GPT4 SaaS service costs more than the open-source Llama2 70B solution.

The economic viability of each option hinges on anticipated request volume.

Computations explanation:

This part of the article breaks down the calculations behind our cost modeling. Read this if you’re interested in the technical aspects of the calculator.

Assumptions

The AI TCO Calculator only focuses on deployment and running costs. Additional hardware maintenance or intense re-engagement of the workforce costs are not taken into account.

The total AI TCO is the sum of two main expenses: an infrastructure cost (hardware and software set-ups) and a labor cost (work done by engineers).

  • Infrastructure cost: we compute the cost per request of the product.
  • Labor cost: approximating the average monthly AI engineer workload required to deploy and run the model.

However, you should adjust this value based on your team (consider sourcing types and AI engineers' availability).

Regarding our choices for the models initially put in the calculator, we tried to ensure they have comparable performance levels based on existing benchmark results, even though there are variations.

TCO computation formula

The following formula represents the total AI set-up TCO for a month based on the monthly number of requests for the service:

image/png

Below is the formula used to compute the cost per request:

image/png

The costs per 1000 input and output tokens depend on the infrastructure of the service you selected. The values are either set by the service provider or determined based on benchmark test results for a particular model.

Input and output tokens are contingent on the use case and can be adjusted, as explained in the example above.

For instance, consider evaluating the cost of a request for the Banking Chatbot using OpenAI GPT3.5 Turbo. OpenAI GPT3.5’s pricing is $0.0015 per 1k input tokens and $0.002 per 1k output tokens. Assuming 300 input tokens and 300 output tokens:

image/png

First approach: Deploy yourself

Infrastructure cost

Deploying an AI service with an open-source model demands a specific infrastructure.

Running large-scale models like Llama2 70B or Falcon 40B with high efficiency requires a powerful computer setup (VMs), often featuring top-quality GPUs. Azure, for instance, rents 40GB A100 GPUs for $27.197 per hour.

Based on this, the cost per token is determined by the formula:

image/png

The number of tokens processed per second is influenced by the percentage of time the GPU is fully utilized, which can be varying. For instance, if there is a peak of demand in the morning because users are more active, then the GPUs are fully utilized and we leverage the large batch size used. On the contrary, at 2 am, GPUs might be underutilized, but they still cost the same, therefore greatly increasing the cost per token.

image/png

Renting a GPU and using only a fraction of its capacity increases costs per task performed, making each request more expensive.

For the “Deploy yourself” service with Llama2 70B, we took input and output costs per token values from existing benchmark tests. These tests were performed on two 80GB A100 GPUs and computed using the same formula as above, with the addition of the maxed-out percentage added to our cost model.

Labor cost

Setting up this service requires hands-on work from one or two engineers specialized in AI. We estimated their labor cost to be $5,000/month (the third of an AI engineer’s $180,000 annual cost of work, averaged per month: 5,000=180,0003 * 12). This expense might vary based on the service’s scale and the deploying company’s team composition.

Second approach: SaaS

Infrastructure cost

With SaaS, the service provider handles all infrastructure aspects, charging a usage fee. You can check pricing details on the websites of companies like OpenAI or Cohere, for example.

The user can still select a few parameters, such as context size, or fine-tuning possibilities, which affect the product’s pricing.

Labor cost

A SaaS solution is already operational and requires no extra effort for deployment, so there won't be any associated labor costs.

Limitations of our cost modeling

Our simplified approach omits accuracy as a factor in calculations. A comprehensive cost modeling would account for cost per request per accuracy. For instance, while OpenAI’s GPT4 excels in accuracy, it might not be the most cost-effective choice in the calculator.

We didn’t consider fine-tuning needs for specific use cases, which can impact the overall AI service costs. More technical and computationally intense work at the beginning will increase infrastructure and labor costs. For instance, getting Llama2 70B as accurate as GPT4 would require extensive fine-tuning.

Lastly, privacy, a crucial criterion, isn’t addressed. We may later add privacy-focused models to the calculator. Note that, by definition, SaaS solutions are less keen on privacy since you have to trust the provider with your data.

Contribute with your own AI model service

If you want to add your AI model service to the AI TCO Calculator’s choices, you can follow our “how to contribute” tutorial.

Please note that you must know the values for your service's costs per input and output tokens before you start.

Conclusion

Choosing the right AI deployment solution can be more complex than expected. As we have seen, AI APis are rather competitive when we truly factor all costs of deploying in-house LLMs, such as cost per request and cost of labor.

Many factors have to be taken into account when computing the actual cost, and we hope that with this calculator, it will be easier for you to evaluate each option.