DeBERTa-v3-small ONNX with Attention Weights (INT8)

This repository provides a quantized INT8 ONNX export of microsoft/deberta-v3-small. This version is specifically modified to expose raw attention weights from DeBERTa's unique Disentangled Attention mechanism, enabling advanced semantic analysis and context pruning.

Key Features

Architecture: DeBERTa v3 with Disentangled Attention and Electra-style pre-training.
Modification: Custom ONNX graph export that includes attentions.{0..11} as additional outputs.
Accuracy: Maintains high semantic fidelity while providing raw weight access.
Optimization: INT8 quantization (AVX-512 VNNI optimized) for efficient CPU-based inference.

Usage with WAMP

This model is a supported component of the Weighted Attention Message Pruner (WAMP-proxy). WAMP utilizes these weights to precisely score message relevance in long conversational contexts.

Quick Inference (ONNX Runtime)

import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

# Load model and tokenizer
session = ort.InferenceSession("model_quantized.onnx")
tokenizer = AutoTokenizer.from_pretrained("naranor/DeBERTa-v3-small-ONNX-Attentions")

# Prepare input
inputs = tokenizer("Your text here", return_tensors="np")
onnx_inputs = {
    "input_ids": inputs["input_ids"],
    "attention_mask": inputs["attention_mask"]
}

# Run with attentions
outputs = session.run(None, onnx_inputs)
# Last hidden state is outputs[0], attentions are in subsequent indices

Attribution & Original Work

Original Model: DeBERTa-v3-small by Microsoft.
Exported by: naranor using the WAMP Universal Exporter.

License

This model is licensed under the MIT License. You are free to use, modify, and distribute this model for any purpose, including commercial applications, provided you include the original license notice.

License details: MIT License

Downloads last month: 3

Model tree for naranor/DeBERTa-v3-small-ONNX-Attentions

Base model

microsoft/deberta-v3-small

Quantized

(13)

this model