VectraYX-Nano v14 (Experimental)

⚠️ Experimental release. v14 is the first nano checkpoint that emits tool-call syntax non-trivially (B4=0.16 vs the v2/v4/v6/v10 floor of 0.000), trained on top of v10's Chinchilla-optimal pretrain (~894 M tokens-procesados) with an SFT mixture rebalanced toward curated tool corpus density. B5 conversational gate stays at 0.70 and B1 (CVE keyword recall) recovers to 0.337. For production use, prefer the v7 headline release at jsantillana/vectrayx-nano.

VectraYX-Nano v14

A 42M-parameter Spanish-first language model for cybersecurity, optimized for Latin America, with native tool-call output.

Params 41.95 M
Architecture Decoder-only Transformer · 8 layers · 8 heads (2 KV) · RoPE · SwiGLU · QK-Norm · tied embeddings
Context 1,024 tokens
Tokenizer SentencePiece BPE 16,384 vocab (special tokens for chat + cyber: <|user|>, <|assistant|>, <|cve|>, <|tool_call|>, <|/tool_call|>, etc.)
Languages Spanish (primary), Portuguese, English (technical terms)
Pretrain tokens ~894 M tokens-procesados (≈ 21 tok/param, Chinchilla-optimal) — inherits v10 pretrain
SFT v14 recipe: 6 epochs over the curated tool_sft_mini_v1.jsonl (2,801 ex) + sft_conversational.jsonl + oasst1_es.jsonl. Excludes the uncurated tooluse_dataset.jsonl (v1–v6 corpus) which had diluted v13. Tool-exposure-per-example ≈ 1.53× (vs v13's 0.38×).
Hardware 1× NVIDIA A10G (SageMaker ml.g5.xlarge) · BF16 · ~30 min SFT-only on top of v10 phase-3 ckpt
License Apache 2.0

Benchmarks

Evaluation suite B1–B5 designed to test Spanish cybersecurity knowledge + chat register at the nano scale (bench_v14.json in this repo).

Benchmark v14 v10 (previous experimental) v2 paper headline (N=4) Notes
B1 CVE Q&A (keyword) 0.337 0.307 0.226 ± 0.065 Best nano result on B1
B2 Classification (f1_macro) 0.205 0.200 0.196 ± 0.014 Capacity-bound at 42 M
B3 Commands (tool_match) 0.029 0.000 0.029 ± 0.000 Recovered to v2 baseline
B4 Tool-use 0.160 0.000 0.230 ± 0.052 (v7) First nano > 0 without LoRA; v7 with 4-seed mean reaches 0.23
B5 Conversational gate 0.700 0.800 0.775 ± 0.043 Slight regression vs v10 (SFT mix favored tools)

Single-seed (seed=42). For multi-seed B1–B5 with confidence intervals see the paper §8 Tables 7–8.

Quick start (HuggingFace transformers)

from transformers import AutoModelForCausalLM
import sentencepiece as spm
import torch

# Load model (custom_code; requires trust_remote_code)
model = AutoModelForCausalLM.from_pretrained(
    "jsantillana/vectrayx-nano-experimental",
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
).eval()

# Tokenizer is SentencePiece (no HF tokenizer wrapper yet)
sp = spm.SentencePieceProcessor()
sp.load("tokenizer.model")  # download alongside the repo

# Chat format expected by the model
prompt = "<|user|>¿Qué es un ataque de phishing?<|end|><|assistant|>"
ids = torch.tensor([sp.encode(prompt)])
out = model.generate_simple(ids, max_new_tokens=200, temperature=0.7, top_k=40)
print(sp.decode(out[0].tolist()))

Tool-call output format

v14 emits structured tool calls when the system prompt advertises tools. The wire format is:

<|tool_call|>{"name": "<tool_name>", "arguments": {<args>}}<|/tool_call|>

Example prompt:

SYSTEM = """Eres VectraYX-Nano. Tienes acceso a estas herramientas:
[
  {"name": "search_cve", "description": "Look up a CVE by ID", "parameters": {"cve_id": "string"}},
  {"name": "nmap_scan", "description": "Run nmap against a target", "parameters": {"target": "string", "ports": "string"}}
]
Cuando necesites una herramienta emite <|tool_call|>{...}<|/tool_call|>."""

prompt = f"<|system|>{SYSTEM}<|end|><|user|>Busca el CVE-2024-1234<|end|><|assistant|>"

Empirical B4 score: 0.16 — the model emits the bracketed format reliably, though argument selection is approximate at 42 M params (better at larger scales; see the Pro 3B / Analyst 7B paper rows).

Quick start (Ollama / llama.cpp)

⚠️ GGUF / Ollama support is currently broken. VectraYX-Nano uses QK-Norm (per-head-dim RMSNorm applied before RoPE) which matches the Qwen3 architecture on paper, but llama.cpp's Qwen3 implementation has subtle differences (likely in build_qkv tensor layout or attention scale) that produce garbage output when loading our GGUF. Switching to arch=llama drops QK-Norm and degrades output to "mostly coherent then diverges". A clean fix requires either:

  1. Adding a vectrayx arch to llama.cpp upstream (~6–10 h C++ work + PR review), or
  2. Re-training v14 without QK-Norm so the model becomes natively arch=llama compatible.

Both options are tracked but out of scope for this experimental release. For now, use the HuggingFace transformers path above; PyTorch inference works correctly. Track the issue here if you want an update.

Intended use

  • Designed for: defensive security education, cyber-incident triage assistance, CVE summarization in Spanish, FAQ for SOC analysts in LATAM, embedded chat in DevSecOps tooling, tool-call dispatch in MCP-aware agents.
  • Out of scope: factual Q&A about events post-2024, code generation beyond shell snippets, long-context reasoning (>1 k tokens), English chat.

Known limitations

  • Tool-call arguments are approximate. v14's B4=0.16 means the model emits the <|tool_call|>...<|/tool_call|> envelope correctly but argument content can be hallucinated or pick a wrong tool name. Treat outputs as suggestions, not authoritative dispatch. Validate against your tool registry before execution.
  • Capacity-bound at 42 M params. B2 classification stays at the harness floor (0.20). For higher-fidelity tool use see the larger-tier checkpoints in the paper (Base 260M, Pro 3B, Analyst 7B).
  • No safety RLHF — the model can be steered to produce harmful security-related content. Run behind a safety filter for production.
  • Hallucinates LATAM institutional facts (DIVINDAT founding date, INDECOPI regulations, ANPD/LGPD article numbers, etc.). A LATAM-specific corpus was experimented with in v16 (full SFT — showed catastrophic forgetting) and v17 (LoRA — showed insufficient knowledge internalization at 3 K examples); neither is released. Robust LATAM factuality requires either a substantially larger LATAM corpus or training a larger base model with LATAM in pretrain (Base 260M v2 work in progress).

Training recipe

v14 = v10 pretrain checkpoint + clean SFT with curated tool corpus.

Stage Mix Source Purpose
v10 P1 100 % OpenSubtitles-ES Helsinki-NLP/open_subtitles Spanish chat register
v10 P2 corpus_nano tech (NVD, Wiki-cyber, blogs, papers, malware, exploits) corpus_nano.tar.gz Cybersecurity domain
v10 P3 glaive_fc_v2 + code_alpaca_bash + codefeedback_bash + exploitdb + github_repos HuggingFace + corpus_nano Function-calling + bash
v14 SFT sft_conversational + oasst1_es + tool_sft_mini_v1 (curated, 2,801 ex) local + curated Tool-format + conv (6 ep)

Pretrain budget: ~894 M tokens-procesados (≈ 21 tok/param @ 42 M = Chinchilla-optimal). v14 SFT runs ~30 min on top of v10's P3 checkpoint.

Citation

@misc{santillana2026vectrayx,
  author = {Santillana, Juan},
  title = {VectraYX-Nano: a 42M-parameter Spanish-first cybersecurity language model with native tool use},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/jsantillana/vectrayx-nano-experimental},
}

Authors

Juan Santillana — DevOps engineer at Globant.

See also

  • Paper (in preparation): VectraYX paper with full ablations, corpus details, Chinchilla analysis, B1–B5 multi-seed results.
  • Headline release: jsantillana/vectrayx-nano — v2/v4/v5/v6/v7 multi-seed checkpoints + LoRA adapters.
  • Code: github.com/vectrayx/vectrayx-paper (training scripts, eval suite, prep pipeline).
Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support