all-MiniLM-L6-v2 β€” Hailo-10H HEF (experimental)

⚠️ Experimental / early-stage research artifact. This HEF is a quantized port of sentence-transformers/all-MiniLM-L6-v2 to the Hailo-10H accelerator.

Update (2026-05-30): we now have a real retrieval benchmark. Despite a modest ~0.72 cosine vs FP32, on MTEB/BEIR SciFact (300 queries, real qrels) the INT8 HEF drops nDCG@10 0.717 β†’ 0.664 (βˆ’7.4%) and Recall@10 0.843 β†’ 0.786 β€” i.e. it keeps ~93% of FP32 retrieval quality. So the cosine number badly understates usability: this is a usable retrieval embedding (not "broken"), though not FP32-equivalent. We then swept the DFC accuracy levers to close that ~7% β€” QAT, compression, in-domain calibration, 16-bit, larger calibration β€” and none meaningfully helped (QAT/in-domain hurt); the gap is intrinsic to per-tensor INT8 here. Details below + in the research log.

This repo contains a HEF for the MiniLM/BERT-class embedding encoder on Hailo-10H (target architecture mercury) using DFC 5.3.0, compiled with a no-attention-mask cut following the RuVector recipe.

TL;DR

HEF minilm-l6-ruvector.hef (11.21 MiB)
Target Hailo-10H (hailo10h / mercury)
DFC 5.3.0
Source model sentence-transformers/all-MiniLM-L6-v2 (Apache-2.0, 22.7M params, 6 BERT layers, hidden=384)
Sequence length 128 (static)
Quantization INT8 (with ew_add* raised to a16_w16)
Cut topology Single-input, no attention mask. Cut start_node=/embeddings/Add_1 (post-embedding-sum), end_node=last_hidden_state. Host-side computes embeddings + LayerNorm in FP32 (cheap, int64 Gather), and mean-pools the encoder output with the real attention_mask post-NPU.
Cosine vs FP32 ~0.72 mean cosine on emulator (SDK_QUANTIZED vs SDK_FP_OPTIMIZED) β€” a poor proxy for retrieval (see below)
Retrieval vs FP32 SciFact nDCG@10 0.664 vs 0.717 (βˆ’7.4%); Recall@10 0.786 vs 0.843 β€” INT8 keeps ~93% of FP32

What this is

  • A reproducible compile recipe for a BERT-class encoder on Hailo-10H.
  • A worked example of the RuVector recipe (Keras-serializable monkey-patch + multiproc_policy=disabled + drop-mask-input ONNX surgery) adapted from Hailo-8 / DFC 3.x to Hailo-10H / DFC 5.x. Original recipe: RuVector β€” compile-encoder-hef.py, MIT-licensed, Copyright (c) 2025 rUv.
  • The HEF runs on real Hailo-10H hardware via HailoRT.

What this is not

  • A drop-in FP32 replacement. We measured retrieval on SciFact (above): INT8 keeps ~93% of FP32 nDCG@10 (βˆ’7.4%). That's usable for many retrieval use cases but not FP32-equivalent β€” a ~7% nDCG / ~6% Recall@10 hit on a hard benchmark. Measure on your own data; SciFact is one task and the cap-2000 corpus inflates absolutes (the βˆ’7% gap is the reliable figure). The earlier "expect it to degrade, measure first" caveat is now quantified rather than hypothetical.
  • A mask-aware compile. The encoder runs full attention over padded positions; the host applies the real attention_mask in the mean-pool step downstream. A mask-aware variant exists in our research log but has the same or worse cosine number β€” the mask input is functionally inert at the DFC cut topology we have available. See the source repo's PLAN.md for the 18-iteration negative result.
  • Tuned with task-relevant calibration. The calib NPZ was derived from a 50-sentence corpus via make_bert_assets.py. We tested in-domain (SciFact-corpus) calibration β€” it hurt (βˆ’16% nDCG; the distribution is too narrow). Diverse general-domain calibration (β‰ˆ256 WikiText paragraphs) was marginally best. Calibration diversity matters more than domain-match here.

Known limits

  • Cosine ceiling ~0.72 vs FP32 β€” real, but it maps to only a ~7% retrieval drop, and it's hard to close. We swept every DFC 5.3 accuracy lever on the SciFact metric (Kaggle, 30 GB RAM, so QAT was not RAM-blocked): QAT (optimization_level=2, compression 0) made it worse (βˆ’31% nDCG, overfits the small calib); in-domain calibration worse (βˆ’16%, too narrow); compression worse; 16-bit (a16_w16 on matmul/conv/softmax) doesn't fit Hailo-10H (AccelerasUnsupportedError); per-channel weights are not a DFC knob (conv is already per-channel; A8W4 group-wise QuaROT/GPTQ ships only in Hailo's genai LLM path). The only non-backfire was larger diverse calibration: 256 WikiText samples nudged nDCG 0.664β†’0.670 (gap βˆ’7.4%β†’βˆ’6.5%, near noise), while 1024 overshot to βˆ’8.8%. Conclusion: the ~7% gap is intrinsic to per-tensor INT8 of this model on this stack. Full sweep + numbers: PLAN.md "Session 5".
  • Mask is dropped from the NPU graph. Padded positions contribute attention noise that the host-side mean-pool then averages over the unpadded tokens. For sequences with high padding fraction (e.g. seq_len=128 with 20-token actual input), the relative attention noise grows.
  • Sequence length is static at 128. For shorter inputs, pad with the tokenizer's [PAD] token; the attention noise above applies.
  • Quantized to INT8 with selective a16_w16. ew_add* layers raised to 16-bit activations + 16-bit weights to keep residual paths numerically stable. Other layers are INT8.

How to deploy on Hailo-10H

Prereqs:

  • Hailo-10H device (Pi 5 + Hailo-10H, AI Hat+, or M.2 carrier)
  • HailoRT β‰₯ 5.0 installed
  • Python 3.10+ with hailo_platform (the runtime, separate from DFC)

Minimal inference loop (Python, runtime only β€” no DFC):

import numpy as np
from hailo_platform import (HEF, VDevice, FormatType, HailoStreamInterface,
                             InputVStreamParams, OutputVStreamParams,
                             InferVStreams, ConfigureParams)

# Tokenize + FP32 embed + LayerNorm on host (small cost)
def host_prep(text, tokenizer, embed_layer, layernorm):
    enc = tokenizer(text, padding="max_length", truncation=True,
                    max_length=128, return_tensors="np")
    emb = embed_layer(enc.input_ids)             # [1, 128, 384]
    h0  = layernorm(emb + token_type_embed +     # add positions/segments
                    position_embed)
    return h0.astype(np.float32), enc.attention_mask

# Configure and run the HEF
hef = HEF("minilm-l6-ruvector.hef")
with VDevice() as dev:
    cfg_params = ConfigureParams.create_from_hef(hef, interface=HailoStreamInterface.PCIe)
    network_group = dev.configure(hef, cfg_params)[0]
    in_params  = InputVStreamParams.make(network_group, format_type=FormatType.FLOAT32)
    out_params = OutputVStreamParams.make(network_group, format_type=FormatType.FLOAT32)
    with network_group.activate(network_group.create_params()):
        with InferVStreams(network_group, in_params, out_params) as pipe:
            x, mask = host_prep("Hello world", tokenizer, embed, layernorm)
            # HEF expects [1, 1, 128, 384] NCHW β€” reshape if needed
            out = pipe.infer({list(in_params)[0]: x[:, None, :, :]})

last_hidden = list(out.values())[0]              # [1, 1, 128, 384] NCHW
# Mean-pool with real mask, then L2-normalize
last_hidden = last_hidden[:, 0]                  # [1, 128, 384]
m = mask.astype(np.float32)[..., None]
pooled = (last_hidden * m).sum(axis=1) / np.clip(m.sum(axis=1), 1e-9, None)
embedding = pooled / np.clip(np.linalg.norm(pooled, axis=-1, keepdims=True), 1e-12, None)

For the FP32 host preprocessing (embedding lookup + LayerNorm), keep using the HuggingFace transformers model β€” copy out just bert.embeddings.{word,position,token_type}_embeddings + bert.embeddings.LayerNorm, frozen to FP32.

How to recompile from scratch

Requires:

  • Linux x86_64, glibc β‰₯ 2.35 (Ubuntu 22.04 tested)
  • Python 3.10 (DFC wheel ABI)
  • 16 GB+ RAM
  • DFC 5.3.0 wheel from Hailo Developer Zone (gated download, not redistributable)
# 0. Set up Python 3.10 venv with DFC
python3.10 -m venv dfcvenv
. dfcvenv/bin/activate
pip install hailo_dataflow_compiler-5.3.0-py3-none-linux_x86_64.whl
pip install numpy onnx onnxruntime onnxsim transformers sentence-transformers

# 1. Build calibration + eval NPZs from MS-MARCO-style prompts
python recipe/make_bert_assets.py \
    --out-dir work/ --seq 128 --calib-n 50 --eval-n 16

# 2. Compile (uses the included source ONNX, no mask)
python recipe/compile_minilm_ruvector.py \
    --onnx source/minilm-l6-encoder-only-seq128.onnx \
    --calib work/bert-calib-seq128.npz \
    --hef-out minilm-l6-ruvector.hef \
    --har-out minilm-l6-ruvector.har \
    --hw-arch hailo10h \
    --opt-level 0 --compression-level 0

# 3. Verify cosine on INT8 emulator (no hardware needed, ~15 min wall)
python recipe/eval_ruvector_int8.py \
    --har minilm-l6-ruvector.har \
    --eval work/bert-eval-seq128.npz

Expected wall time on the bringup VPS (7.7 GB RAM, 4 vCPU): ~17 min for the compile end-to-end at opt-level=0.

Files in this repo

minilm-l6-ruvector.hef                     ← compiled HEF (11.21 MiB)
source/
  minilm-l6-encoder-only-seq128.onnx       ← post-no-mask-surgery ONNX (40.77 MiB)
recipe/
  minilm_l6_nomask.alls                    ← DFC alls script
  compile_minilm_ruvector.py               ← compile driver (RuVector monkey-patch)
  make_bert_assets.py                      ← calib/eval NPZ generator
  eval_ruvector_int8.py                    ← INT8 emulator eval (cosine vs FP32)

The compile recipe (minilm_l6_nomask.alls) in full:

model_optimization_config(calibration, batch_size=16, calibset_size=50)
model_optimization_config(globals, multiproc_policy=disabled)
pre_quantization_optimization(ew_add_fusing, policy=disabled)
model_optimization_flavor(optimization_level=0, compression_level=0)
pre_quantization_optimization(matmul_correction, layers={matmul*}, correction_type=zp_comp_block)
quantization_param({ew_add*}, precision_mode=a16_w16)
quantization_param({conv*}, precision_mode=a16_w16)
pre_quantization_optimization(layer_norm_decomposition, equalization=disabled, bit_decomposition_mode=uniform_precision)
allocator_param(spatial_defuse_legacy=True)

(Note: the version used for this specific HEF was the inline alls in compile_minilm_ruvector.py, which omits conv* a16_w16 because that variant failed at Unsupported layers for the target mercury: precision_change11 on Hailo-10H. The cfg/minilm_l6_nomask.alls shipped here documents the Hailo-8 reference variant.)

Attribution + licensing

  • Source model: sentence-transformers/all-MiniLM-L6-v2 β€” Apache-2.0. Originally from Microsoft (Wang et al., 2020, MiniLM paper) and fine-tuned by sentence-transformers (Reimers & Gurevych, 2019).
  • Compile recipe: adapted from RuVector β€” compile-encoder-hef.py β€” MIT, Copyright (c) 2025 rUv. The Keras-serializable monkey-patch + multiproc_policy=disabled pattern is the key insight; we adapted it from Hailo-8/DFC 3.x to Hailo-10H/DFC 5.x.
  • Hailo DFC: compilation requires the Hailo Dataflow Compiler under Hailo's Developer Zone EULA. This repo does NOT redistribute the DFC.
  • This HEF: distributed under Apache-2.0 (inherits source model license). The included Python recipe scripts are licensed under Apache-2.0; see individual file headers for original source attribution.

See also

  • cstr/Kokoro-82M-encoder-hailo10h β€” sister HEF for the Kokoro-82M ALBERT phoneme encoder, same recipe family, same cosine plateau. Uses the matching tools/make_kokoro_encoder_only.py + tools/replace_pow3_with_mul.py ONNX surgery for ALBERT-specific Pow(x, 3.0) and embedding-dim projection.
  • CrispHailo β€” full bringup log + PLAN.md with 4-session research history, 18 failed mask-aware iterations, the recipe ceiling analysis, and the next-greenfield target list.

Citation

If you use this HEF or the recipe, please cite the chain:

@misc{minilm-l6-hailo10h,
  title  = {all-MiniLM-L6-v2 Hailo-10H HEF (experimental)},
  author = {CrispHailo project},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/cstr/all-MiniLM-L6-v2-hailo10h}}
}

@inproceedings{wang2020minilm,
  title  = {MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers},
  author = {Wang, Wenhui and Wei, Furu and Dong, Li and Bao, Hangbo and Yang, Nan and Zhou, Ming},
  booktitle = {NeurIPS},
  year   = {2020}
}

@inproceedings{reimers2019sbert,
  title  = {Sentence-{BERT}: Sentence Embeddings using {S}iamese {BERT}-Networks},
  author = {Reimers, Nils and Gurevych, Iryna},
  booktitle = {EMNLP-IJCNLP},
  year   = {2019}
}

Bringup log dates: research sessions 1–5, 2026-05-22 β†’ 2026-05-30. Compile wall: ~17 min on a 7.7 GB RAM VPS at optimization_level=0. Emulator eval wall: ~15 min for 16 samples on Kaggle CPU (DFC SDK_QUANTIZED is ~22Γ— slower than SDK_FP_OPTIMIZED). Session 5 added a real MTEB/BEIR SciFact retrieval benchmark + a full quant-lever sweep (see PLAN.md).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for cstr/all-MiniLM-L6-v2-hailo10h

Quantized
(79)
this model

Paper for cstr/all-MiniLM-L6-v2-hailo10h