Why does local inference differ from the API?

#38
by davidefiocco - opened

I am computing Jina v2 embeddings via the transformers Python libraries and via the API (see https://jina.ai/embeddings/).

With transformers I can run code along the lines of the model card

from transformers import AutoModel

sentences = ['How is the weather today?']

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
embeddings_1 = model.encode(sentences)

or

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en')
embeddings_2 = model.encode(sentences)

and the resulting embeddings_1 and embeddings_2 match.

However if I use the Jina API e.g. via

import requests

url = 'https://api.jina.ai/v1/embeddings'

headers = {
  'Content-Type': 'application/json',
  'Authorization': 'Bearer jina_123456...' # visit https://jina.ai/embeddings/ for an API key
}

data = {
  'input': sentences,
  'model': 'jina-embeddings-v2-base-en' # note that the model name matches
}

response = requests.post(url, headers=headers, json=data)
embeddings_3 = eval(response.content)["data"][0]["embedding"]

embeddings_3 differ from the other two arrays by a small difference, around 2e-4 in absolute value on average. I see this discrepancy both with CPU and GPU runtimes.

What am I doing wrong? I also posted this very question on https://stackoverflow.com/questions/77875253/why-does-local-inference-differ-from-the-api-when-computing-jina-embeddings

Jina AI org

A good catch. Actually, in our API, we are running the model forward with half-precision fp16 for higher cost-efficiency. I believe this is the source of the inconsistence.

Hi @numb3r3 , thanks for your prompt reply, that would explain!

Just to be 100% sure, is there a way to prove this, e.g. is there code I can run with available model weights to match API results exactly?

Having half-precision models running locally would be neat, as we'd get

  1. a local equivalent of the Jina API
  2. the same performance/efficiency gains that you are enjoying at Jina

Hopefully it's not a cheeky ask :)

Jina AI org

You can easily enjoy the fp16 optimization by

from transformers import AutoModel

sentences = ['How is the weather today?']

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True,  dtype=torch.float16)
embeddings_1 = model.encode(sentences)

By the way, if you are interested in a private deployment of our API with all related optimizations such as fp16 support, I wanted to let you know that we also offer Jina AI through AWS SageMaker. This allows for optimized, on-premises deployment of our 8k embedding models. You can find more details here: https://jina.ai/news/jina-ai-8k-embedding-models-hit-aws-marketplace-for-on-prem-deployment/

Hey, thanks again (and sorry for cross-posting on StackOverflow, will make sure that the result/conclusion of this discussion will be there as well).

I tried this following your advice:

from transformers import AutoModel
import torch

sentences = ['How is the weather today?']

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, token = "hf_my_authorized_token", torch_dtype = torch.float16)
embeddings_1 = model.encode(sentences)

i.e. slightly modifying the from_pretrained call because dtype (as in your original response) is not an expected kwarg for JinaBertModel.__init__().

However, the snippet above throws a RuntimeError: "LayerNormKernelImpl" not implemented for 'Half', which is the same error I get with

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, token = "hf_my_authorized_token")
model.half()
embeddings_1 = model.encode(sentences)

using transformers==4.35.2. So this approach doesn't seem to work for me :/

Thanks a ton also for the AWS suggestion, it's not the cloud provider I am using atm but it's good to know!

Jina AI org
edited Jan 31

Hello,

To effectively utilize fp16 precision with CUDA for the model, can you try the following snippet?

from transformers import AutoModel
import torch

sentences = ['How is the weather today?']

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, attn_implementation='torch', torch_dtype=torch.float16, device_map='cuda')
embeddings_1 = model.encode(sentences)

Let me know if this works for you or if there's anything else I can assist with!

Hi @ziniuyu , thank you!

Indeed your snippet runs without problems (just requires pip install accelerate), and that's great to run at half precision!

Still

from transformers import AutoModel
import torch
sentences = ['How is the weather today?']
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, attn_implementation='torch', torch_dtype=torch.float16, device_map='cuda')
embeddings_1 = model.encode(sentences)

will give me embeddings that do not match with what I get from the same model with https://api.jina.ai/v1/embeddings (you can try that yourself). The difference is again on the order of 2e-4. I thus get slightly different embeddings from full precision, but numbers are not the same as those of the API. So, I couldn't verify if @numb3r3 's explanation for the discrepancy is the right one... So I wouldn't close this yet.
Is the API using a slightly different checkpoint/version maybe?

Jina AI org

@davidefiocco i don't think so, we're using the same model, at least for jina-v2.

Hi @bwang0911 , thanks again for your answer. Here's a reproducible example:

from transformers import AutoModel
import torch
import requests
import numpy as np

sentences = ['How is the weather today?']

url = 'https://api.jina.ai/v1/embeddings'

headers = {
  'Content-Type': 'application/json',
  'Authorization': 'Bearer jina_123456...' # change it here
}

data = {
  'input': sentences,
  'model': 'jina-embeddings-v2-base-en' # note that the model name matches that of the HF model below
}

response = requests.post(url, headers=headers, json=data)
embeddings_api = eval(response.content)["data"][0]["embedding"]

model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True, attn_implementation='torch', torch_dtype=torch.float16, device_map='cuda')
embeddings_hf = model.encode(sentences)

print(np.abs(embeddings_api - embeddings_hf).mean())

The output of the average absolute value of the difference is around 0.0003, close but still not as close as I would have expected.

from sentence_transformers import SentenceTransformer

sentences = 'This framework generates embeddings for each sentence'

model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en')
embeddings = model.encode(sentences)

print(embeddings)

Screenshot 2024-03-19 at 10.59.16 AM.png

Why is this message being displayed? Do I need to train the model to use it, or is it already trained to convert the text into vectors?

Jina AI org

@shivpatel117

# use latest SentenceTransformer pip install -U sentence_transformers
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-en',     trust_remote_code=True)

It works. @bwang0911 Appreciate the help!

Sign up or log in to comment