MiniLM-L12 Fine-Tuned INT8 ONNX Embedding Model

Overview

This repository contains a fine-tuned sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model exported to INT8 ONNX format for efficient semantic search and Retrieval-Augmented Generation (RAG) applications on resource-constrained devices.

Features

  • Fine-tuned for semantic similarity and retrieval tasks
  • Exported to INT8 ONNX for fast inference
  • Suitable for local RAG pipelines
  • Optimized for mobile and edge deployment
  • Supports multilingual text processing

Base Model

  • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Files

  • model-int8.onnx
  • config.json
  • tokenizer.json
  • tokenizer_config.json
  • special_tokens_map.json

Intended Use

This model can be used for:

  • Semantic search
  • Question-answer retrieval
  • Embedding generation
  • Local AI assistants
  • Mobile RAG systems
  • Vector database indexing

License

Apache-2.0

language:

  • en license: apache-2.0 library_name: onnxruntime tags:
  • sentence-transformers
  • embeddings
  • onnx
  • int8
  • retrieval
  • semantic-search
  • rag
  • mobile
  • quantization pipeline_tag: feature-extraction

MiniLM-L12 Fine-Tuned INT8 ONNX Embedding Model

Model Description

This repository contains a fine-tuned MiniLM-L12 sentence embedding model exported to INT8 ONNX format for efficient deployment in mobile and edge environments.

The model is optimized for semantic similarity, dense retrieval, and Retrieval-Augmented Generation (RAG) applications where low latency and reduced memory usage are important.

Intended Use

This model is suitable for:

  • Semantic search
  • Dense document retrieval
  • Retrieval-Augmented Generation (RAG)
  • FAQ matching
  • Question-answer retrieval
  • Mobile AI assistants
  • Offline embedding generation
  • Local vector search

Model Architecture

  • Base Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
  • Framework: ONNX Runtime
  • Precision: INT8 Quantized
  • Embedding Type: Dense sentence embeddings

Fine-Tuning

The model was fine-tuned on a custom dataset of query–answer pairs to improve retrieval quality for local RAG scenarios.

The training objective was to produce embeddings where semantically related queries and answers are positioned close together in vector space.

Evaluation Summary

Performance was evaluated using retrieval metrics on a held-out test dataset.

Metric Fine-Tuned Score
Recall@1 96.39%
Recall@5 99.51%
Mean Reciprocal Rank (MRR) 97.69%
Accuracy 96.39%
F1 Score 98.16%

These results indicate strong retrieval performance for semantic search and local RAG pipelines.

Files Included

  • model-int8.onnx
  • config.json
  • tokenizer.json
  • tokenizer_config.json
  • special_tokens_map.json

Usage Notes

This model is intended to be executed with ONNX Runtime.

Typical workflow:

  1. Tokenize the input text using the accompanying tokenizer.
  2. Run inference with the ONNX model.
  3. Apply the same pooling strategy used during training.
  4. Normalize embeddings before similarity search if required by your retrieval pipeline.
  5. Compare embeddings using cosine similarity.

Mobile Deployment

The INT8 quantized model is designed for resource-constrained environments and can be integrated into:

  • Android applications
  • iOS applications
  • Flutter applications
  • React Native applications
  • Edge AI deployments

The reduced precision helps lower memory usage and improve inference speed while maintaining strong retrieval quality.

Limitations

  • Retrieval quality depends on the similarity between deployment data and the fine-tuning dataset.
  • This model generates embeddings and is not intended to generate natural language responses.
  • Performance may vary across domains and languages not represented during fine-tuning.

Citation

If you use this model in academic work or publications, please cite the original MiniLM architecture and the Sentence Transformers framework in addition to referencing this repository.

Acknowledgements

  • Sentence Transformers
  • MiniLM
  • ONNX Runtime
  • Hugging Face
Downloads last month
73
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support