DeBERTa-v3-small ONNX with Attention Weights (INT8)
This repository provides a quantized INT8 ONNX export of microsoft/deberta-v3-small. This version is specifically modified to expose raw attention weights from DeBERTa's unique Disentangled Attention mechanism, enabling advanced semantic analysis and context pruning.
Key Features
- Architecture: DeBERTa v3 with Disentangled Attention and Electra-style pre-training.
- Modification: Custom ONNX graph export that includes
attentions.{0..11}as additional outputs. - Accuracy: Maintains high semantic fidelity while providing raw weight access.
- Optimization: INT8 quantization (AVX-512 VNNI optimized) for efficient CPU-based inference.
Usage with WAMP
This model is a supported component of the Weighted Attention Message Pruner (WAMP-proxy). WAMP utilizes these weights to precisely score message relevance in long conversational contexts.
Quick Inference (ONNX Runtime)
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np
# Load model and tokenizer
session = ort.InferenceSession("model_quantized.onnx")
tokenizer = AutoTokenizer.from_pretrained("naranor/DeBERTa-v3-small-ONNX-Attentions")
# Prepare input
inputs = tokenizer("Your text here", return_tensors="np")
onnx_inputs = {
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"]
}
# Run with attentions
outputs = session.run(None, onnx_inputs)
# Last hidden state is outputs[0], attentions are in subsequent indices
Attribution & Original Work
- Original Model: DeBERTa-v3-small by Microsoft.
- Exported by: naranor using the WAMP Universal Exporter.
License
This model is licensed under the MIT License. You are free to use, modify, and distribute this model for any purpose, including commercial applications, provided you include the original license notice.
License details: MIT License
- Downloads last month
- 3
Model tree for naranor/DeBERTa-v3-small-ONNX-Attentions
Base model
microsoft/deberta-v3-small