CyberSentinel-9B-bnb-4bit
Pre-quantized bitsandbytes 4bit (nf4, double-quant, bf16 compute) version of lkjiop8/CyberSentinel-9B.
Download size ~5GB, runtime VRAM ~6-8GB with 8K-16K context. Fits 12GB GPU easily.
Install
pip install -U torch --index-url https://download.pytorch.org/whl/cu121
pip install -U "transformers>=4.46" accelerate "bitsandbytes>=0.43" sentencepiece
Load (already 4bit, no runtime quantization)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
REPO = "lkjiop8/CyberSentinel-9B-bnb-4bit"
tok = AutoTokenizer.from_pretrained(REPO, trust_remote_code=True)
mdl = AutoModelForCausalLM.from_pretrained(
REPO, device_map="auto", trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
mdl.eval()
msgs = [
{"role":"system","content":"You are a red-team security assistant."},
{"role":"user","content":"Found suspected SQLi on 10.10.50.23, give full plan."},
]
ids = tok.apply_chat_template(msgs, add_generation_prompt=True, return_tensors="pt").to(mdl.device)
out = mdl.generate(ids, max_new_tokens=2048, temperature=0.6, top_p=0.95, top_k=20, repetition_penalty=1.05, do_sample=True)
print(tok.decode(out[0][ids.shape[-1]:], skip_special_tokens=True))
Recommended sampling
- temperature=0.6, top_p=0.95, top_k=20, repetition_penalty=1.05
- max_new_tokens 2048-4096
VRAM on 12GB GPU
| context | total VRAM |
|---|---|
| 8192 | ~6.8 GB |
| 16384 | ~8.2 GB |
| 32768 | ~10.7 GB |
Note
Based on Qwen3-Next hybrid linear-attention architecture, which llama.cpp / Ollama do not support. Use this 4bit HF version or vLLM for deployment.
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support
Model tree for lkjiop8/CyberSentinel-9B-bnb-4bit
Base model
lkjiop8/CyberSentinel-9B