Instructions to use Tamil78/MiniLM-L12-finetuned-int8 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use Tamil78/MiniLM-L12-finetuned-int8 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("Tamil78/MiniLM-L12-finetuned-int8") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
MiniLM-L12 Fine-Tuned INT8 ONNX Embedding Model
Overview
This repository contains a fine-tuned sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 model exported to INT8 ONNX format for efficient semantic search and Retrieval-Augmented Generation (RAG) applications on resource-constrained devices.
Features
- Fine-tuned for semantic similarity and retrieval tasks
- Exported to INT8 ONNX for fast inference
- Suitable for local RAG pipelines
- Optimized for mobile and edge deployment
- Supports multilingual text processing
Base Model
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
Files
model-int8.onnxconfig.jsontokenizer.jsontokenizer_config.jsonspecial_tokens_map.json
Intended Use
This model can be used for:
- Semantic search
- Question-answer retrieval
- Embedding generation
- Local AI assistants
- Mobile RAG systems
- Vector database indexing
License
Apache-2.0
language:
- en license: apache-2.0 library_name: onnxruntime tags:
- sentence-transformers
- embeddings
- onnx
- int8
- retrieval
- semantic-search
- rag
- mobile
- quantization pipeline_tag: feature-extraction
MiniLM-L12 Fine-Tuned INT8 ONNX Embedding Model
Model Description
This repository contains a fine-tuned MiniLM-L12 sentence embedding model exported to INT8 ONNX format for efficient deployment in mobile and edge environments.
The model is optimized for semantic similarity, dense retrieval, and Retrieval-Augmented Generation (RAG) applications where low latency and reduced memory usage are important.
Intended Use
This model is suitable for:
- Semantic search
- Dense document retrieval
- Retrieval-Augmented Generation (RAG)
- FAQ matching
- Question-answer retrieval
- Mobile AI assistants
- Offline embedding generation
- Local vector search
Model Architecture
- Base Model:
sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 - Framework: ONNX Runtime
- Precision: INT8 Quantized
- Embedding Type: Dense sentence embeddings
Fine-Tuning
The model was fine-tuned on a custom dataset of query–answer pairs to improve retrieval quality for local RAG scenarios.
The training objective was to produce embeddings where semantically related queries and answers are positioned close together in vector space.
Evaluation Summary
Performance was evaluated using retrieval metrics on a held-out test dataset.
| Metric | Fine-Tuned Score |
|---|---|
| Recall@1 | 96.39% |
| Recall@5 | 99.51% |
| Mean Reciprocal Rank (MRR) | 97.69% |
| Accuracy | 96.39% |
| F1 Score | 98.16% |
These results indicate strong retrieval performance for semantic search and local RAG pipelines.
Files Included
model-int8.onnxconfig.jsontokenizer.jsontokenizer_config.jsonspecial_tokens_map.json
Usage Notes
This model is intended to be executed with ONNX Runtime.
Typical workflow:
- Tokenize the input text using the accompanying tokenizer.
- Run inference with the ONNX model.
- Apply the same pooling strategy used during training.
- Normalize embeddings before similarity search if required by your retrieval pipeline.
- Compare embeddings using cosine similarity.
Mobile Deployment
The INT8 quantized model is designed for resource-constrained environments and can be integrated into:
- Android applications
- iOS applications
- Flutter applications
- React Native applications
- Edge AI deployments
The reduced precision helps lower memory usage and improve inference speed while maintaining strong retrieval quality.
Limitations
- Retrieval quality depends on the similarity between deployment data and the fine-tuning dataset.
- This model generates embeddings and is not intended to generate natural language responses.
- Performance may vary across domains and languages not represented during fine-tuning.
Citation
If you use this model in academic work or publications, please cite the original MiniLM architecture and the Sentence Transformers framework in addition to referencing this repository.
Acknowledgements
- Sentence Transformers
- MiniLM
- ONNX Runtime
- Hugging Face
- Downloads last month
- 73