MiniLM-L12 Fine-Tuned INT8 ONNX Embedding Model

Overview

This repository contains a fine-tuned sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model exported to INT8 ONNX format for efficient semantic search and Retrieval-Augmented Generation (RAG) applications on resource-constrained devices.

Features

Fine-tuned for semantic similarity and retrieval tasks
Exported to INT8 ONNX for fast inference
Suitable for local RAG pipelines
Optimized for mobile and edge deployment
Supports multilingual text processing

Base Model

sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

Files

model-int8.onnx
config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json

Intended Use

This model can be used for:

Semantic search
Question-answer retrieval
Embedding generation
Local AI assistants
Mobile RAG systems
Vector database indexing

License

Apache-2.0

language:

en license: apache-2.0 library_name: onnxruntime tags:
sentence-transformers
embeddings
onnx
int8
retrieval
semantic-search
rag
mobile
quantization pipeline_tag: feature-extraction

MiniLM-L12 Fine-Tuned INT8 ONNX Embedding Model

Model Description

This repository contains a fine-tuned MiniLM-L12 sentence embedding model exported to INT8 ONNX format for efficient deployment in mobile and edge environments.

The model is optimized for semantic similarity, dense retrieval, and Retrieval-Augmented Generation (RAG) applications where low latency and reduced memory usage are important.

Intended Use

This model is suitable for:

Semantic search
Dense document retrieval
Retrieval-Augmented Generation (RAG)
FAQ matching
Question-answer retrieval
Mobile AI assistants
Offline embedding generation
Local vector search

Model Architecture

Base Model: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Framework: ONNX Runtime
Precision: INT8 Quantized
Embedding Type: Dense sentence embeddings

Fine-Tuning

The model was fine-tuned on a custom dataset of query–answer pairs to improve retrieval quality for local RAG scenarios.

The training objective was to produce embeddings where semantically related queries and answers are positioned close together in vector space.

Evaluation Summary

Performance was evaluated using retrieval metrics on a held-out test dataset.

Metric	Fine-Tuned Score
Recall@1	96.39%
Recall@5	99.51%
Mean Reciprocal Rank (MRR)	97.69%
Accuracy	96.39%
F1 Score	98.16%

These results indicate strong retrieval performance for semantic search and local RAG pipelines.

Files Included

model-int8.onnx
config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json

Usage Notes

This model is intended to be executed with ONNX Runtime.

Typical workflow:

Tokenize the input text using the accompanying tokenizer.
Run inference with the ONNX model.
Apply the same pooling strategy used during training.
Normalize embeddings before similarity search if required by your retrieval pipeline.
Compare embeddings using cosine similarity.

Mobile Deployment

The INT8 quantized model is designed for resource-constrained environments and can be integrated into:

Android applications
iOS applications
Flutter applications
React Native applications
Edge AI deployments

The reduced precision helps lower memory usage and improve inference speed while maintaining strong retrieval quality.

Limitations

Retrieval quality depends on the similarity between deployment data and the fine-tuning dataset.
This model generates embeddings and is not intended to generate natural language responses.
Performance may vary across domains and languages not represented during fine-tuning.

Citation

If you use this model in academic work or publications, please cite the original MiniLM architecture and the Sentence Transformers framework in addition to referencing this repository.

Acknowledgements

Sentence Transformers
MiniLM
ONNX Runtime
Hugging Face

Downloads last month: 73