Potential quality loss on short input? (quantized onnx version)

#10
by do-me - opened

Great models!

I just quantized the base and small model for onnx usage e.g. in transformers.js:

I use these models in SemanticFinder for semantic similarity; to the right you can see the similarity score.

While the small one yields expected good results i.e. food-related chunks are most similar...

image.png

... the base one for some reason returns complete random, unrelated results.

image.png

I just used the standard onnx conversion script of the transformers.js repo with default settings. Is there any kind of setting that needs some tweaking maybe?
I also noticed that if I use longer chunks (e.g. 400 instead of 40 chars), the model behaves as expected again.

In any case it would be cool if you could add the onnx versions to this repo too :)

Hrmmm that is weird. Maybe the model was converted wrongly?
Can you try the ONNX weights we uploaded in fp32?

If it still has this weird behavior for you, maybe it is the quantization to int8.
Though I cant think of a reason why the quantization would make the model worse for short sequence (when it does fine at longer).

Your fp32 weights work just fine on short sequences, just as expected. You can test yourself if you like with do-me/test:
image.png

However, on a closer look again on the quantized versions, even on longer sequences the results seem to be bit arbitrary (though better overall). I'm pretty sure there was something wrong with the conversion. Will head over to transformers.js and ask the experts there. See issue here.

Hi all - creator of Transformers.js here. I just wanted to ask the authors how they converted the weights to ONNX? When I try do it with optimum, I get the following error:

torch.onnx.errors.UnsupportedOperatorError: Exporting the operator 'aten::scaled_dot_product_attention' to ONNX opset version 11 is not supported. Please feel free to request support or submit a pull request on PyTorch GitHub: https://github.com/pytorch/pytorch/issues.

However, inspecting your model in netron suggests it's also using v11.

I will upload a Transformers.js-compatible version using your model (w/ quantization) shortly, but I would love to be able to convert it myself too!

And the quantized versions seem to perform relatively well too:

// npm i 

@xenova

	/transformers
import { pipeline, cos_sim } from '@xenova/transformers';

// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/jina-embeddings-v2-base-en',
    // { quantized: false } // <-- uncomment to use the unquantized versions
);

// Generate embeddings
const output = await extractor(
    ['How is the weather today?', 'What is the current weather like today?'],
    { pooling: 'mean' }
);

// Compute cosine similarity
console.log(cos_sim(output[0].data, output[1].data)); 
// quantized (0.9022937687830741) vs unquantized (0.9341313949712492)
This comment has been hidden

Work like a charm now, just indexed the whole bible once with 200 words and 4000 chars - both work great also with short search queries! :)
https://twitter.com/DomeGIS/status/1717999491427495958

Glad you like the model!

We made 2 modifications to fix 2 issues with the ONNX export:

  1. scaled_dot_product_attention (SDPA)
    As you have observed, the ONNX 11 opset does not contain scaled_dot_product_attention. For our ONNX 11 export, we disabled the scaled_dot_product_attention path by setting attn_implementation=None in the model config. This causes the tracer to go through the standard python implementation of the SDPA operation which is able to be represented in ONNX 11.

  2. Protobuf serialization limit (dynamic alibi tensor alloc)
    Once you solve issue 1, you will encounter a 2nd issue:

[libprotobuf ERROR /Users/runner/work/pytorch/pytorch/pytorch/third_party/protobuf/src/google/protobuf/message_lite.cc:457] onnx_torch.ModelProto exceeded maximum protobuf size of 2GB: 3768614404

ONNX uses protobuf which has a serialized file size limit of 2GB. The limit is hit because of the temporary alibi buffer we allocate at model creation (to save some malloc operations and to prevent memory fragmentation).
In order to reduce the file size, we allocate the alibi tensor on the fly instead.

The required code changes are in this PR.
With these changes, we are able to export the model using the optimum exporter:

pip install 'optimum[exporters]'
optimum-cli export onnx --model jina-embeddings-v2-base-en-airgap --task feature-extraction --trust-remote-code --opset 11 jina_onnx

scaled_dot_product_attention was actually added in ONNX 14. I have exported the models with scaled_dot_product_attention in ONNX 14 in a PR.
Can you guys help to test if it provides any speed improvement?

Small: https://huggingface.co/jinaai/jina-embeddings-v2-small-en/resolve/e79175ccfa7251b6bd7eb5bbbeaabe93c44b417f/model.onnx
Base: https://huggingface.co/jinaai/jina-embeddings-v2-base-en/resolve/d78362089ac649604c57978b78184611f2e72772/model.onnx

Thank you for looking into this. Sure, we can help with testing!
@Xenova if I understand correctly the models should work even better with these modifications, so could you quantize them again?

Meanwhile I added your unquantized versions if you want to try them. There are Xenova's first quantized versions and your current unquantized ones @Jackmin108 on my test repos below:
image.png

Hope this helps in some way. From what I see, the new unquantized models are now even faster for indexing than the quantized ones!

@Jackmin108 can you provide an example of using these onnx weights you've contributed to compute embeddings, please? :)

This comment has been hidden
Jina AI org

hi @nateraw please check the following code:

from transformers import AutoTokenizer
from onnxruntime import InferenceSession
​
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-en', trust_remote_code=True)
session = InferenceSession('YOUR-FOLDER/jina_embeddings_v2_small_onnx_w_mean_pooling/model.onnx')
​
def embed_onnx(text):
    # ONNX Runtime expects NumPy arrays as input
    inputs = tokenizer(text, return_tensors="np")
    outputs = session.run(output_names=["last_hidden_state"], input_feed=dict(inputs))
​
    return outputs
​
​
print(embed_onnx('hello world'))
bwang0911 changed discussion status to closed

Sign up or log in to comment