metadata

tags:
  - FlagEmbedding
  - Embedding
  - Hybrid Retrieval
  - ONNX
  - Optimum
  - ONNXRuntime
  - Multilingual
license: mit
base_model: BAAI/bge-m3

Model Card for philipchung/bge-m3-onnx

This is the BAAI/BGE-M3 inference model converted to ONNX format and can be used with Optimum ONNX Runtime with CPU acceleration. This model outputs all 3 embedding types (Dense, Sparse, ColBERT).

No ONNX optimizations are applied to this model. If you want to apply optimizations, use the export script included in this repo to generate a version of ONNX model with optimizations.

Some of the code is adapted from aapot/bge-m3-onnx. The model in this repo inherits from PretrainedModel and the ONNX model can be downloaded from Huggingface Hub and used directly with the model.from_pretrained() method.

How to Use

from collections import defaultdict
from typing import Any

import numpy as np
from optimum.onnxruntime import ORTModelForCustomTasks
from transformers import AutoTokenizer

# Download ONNX model from Huggingface Hub
onnx_model = ORTModelForCustomTasks.from_pretrained("philipchung/bge-m3-onnx")
tokenizer = AutoTokenizer.from_pretrained("philipchung/bge-m3-onnx")
# Inference forward pass
sentences = ["First test sentence.", "Second test sentence"]
inputs = tokenizer(
    sentences,
    padding="longest",
    return_tensors="np",
)
outputs = onnx_model.forward(**inputs)

def process_token_weights(
    token_weights: np.ndarray, input_ids: list
) -> defaultdict[Any, int]:
    """Convert sparse token weights into dictionary of token indices and corresponding weights.

    Function is taken from the original FlagEmbedding.bge_m3.BGEM3FlagModel from the
    _process_token_weights() function defined within the encode() method.
    """
    # convert to dict
    result = defaultdict(int)
    unused_tokens = set(
        [
            tokenizer.cls_token_id,
            tokenizer.eos_token_id,
            tokenizer.pad_token_id,
            tokenizer.unk_token_id,
        ]
    )
    for w, idx in zip(token_weights, input_ids, strict=False):
        if idx not in unused_tokens and w > 0:
            idx = str(idx)
            # w = int(w)
            if w > result[idx]:
                result[idx] = w
    return result

# Each sentence results in a dict[str, list]float] | dict[str, float] | list[list[float]]] which corresponds to a dict with dense, sparse, and colbert embeddings.
embeddings_list = []
for input_ids, dense_vec, sparse_vec, colbert_vec in zip(
    inputs["input_ids"],
    outputs["dense_vecs"],
    outputs["sparse_vecs"],
    outputs["colbert_vecs"],
    strict=False,
):
    # Convert token weights into dictionary of token indices and corresponding weights
    token_weights = sparse_vec.astype(float).squeeze(-1)
    sparse_embeddings = process_token_weights(
            token_weights,
            input_ids.tolist(),
        )
    multivector_embedding = {
        "dense": dense_vec.astype(float).tolist(),  # (1024)
        "sparse": dict(sparse_embeddings),  # dict[token_index, weight]
        "colbert": colbert_vec.astype(float).tolist(),  # (token len, 1024)
    }
    embeddings_list.append(multivector_embedding)