stratabert-tiny-smoke

Model Summary

This is a StrataBERT diagnostic checkpoint from run_001. Claim status: diagnostic_only. It is not a release-quality checkpoint and must not be used for public quality or efficiency claims.

Architecture

tokens -> embeddings -> [global attention / bidirectional SSM / local attention]* -> mask-aware pooling -> task head

Architecture class: StrataBertForSequenceClassification. Layer types: ['global_attention', 'ssm', 'local_attention']. Hidden size: 48. Max positions: 128.

Parameter Count

Total parameters: 2498404.

Training Data

Data artifacts:

  • train_index: data/eval_frozen/run_001/ag_news_train_index_sample64.json
  • eval_index: data/eval_frozen/run_001/ag_news_eval_index_sample200.json

Raw text is not embedded in this card or the frozen eval indices.

Objective Mix

  • task: 1.0

Teacher Models

No teacher model is used for this checkpoint.

Licenses

Project code license: MIT. Dataset audit summary:

  • ag_news_v001: restricted_noncommercial_unclear; No standard permissive license is declared.
  • arxiv_classification_v001: needs_review_full_text_rights; Selected HF repo does not declare a data license.
  • bc5cdr_v001: needs_review_bc5cdr_tner_mirror; No source-license research entry is present; manifest note: Canonical bigbio/bc5cdr script is disabled by current datasets versions; executable manifest uses TNER BC5CDR converted parquet.
  • conll2003_v001: restricted_avoid_publication_claims; Highest-risk MVP dataset because the source text is Reuters copyrighted newswire.
  • eurlex57k_v001: needs_review_lexglue_eurlex; No source-license research entry is present; manifest note: HF datasets metadata inspected with datasets.load_dataset_builder('coastalcph/lex_glue', 'eurlex') on 2026-06-10.
  • hyperpartisan_news_v001: needs_review_hyperpartisan_mirror; No source-license research entry is present; manifest note: HF parquet metadata inspected on 2026-06-10 via jonathanli/hyperpartisan-longformer-split.
  • imdb_v001: restricted_noncommercial_unclear; HF license tag is other rather than a permissive license.
  • openpii_1m_v001: approved_cc_by_4_0_attribution_required; No source-license research entry is present; manifest note: HF datasets metadata inspected with datasets.load_dataset_builder('ai4privacy/pii-masking-openpii-1m', 'default') on 2026-06-10.
  • patent_classification_v001: needs_review_mirror_license; The selected ccdv sample repo does not declare its own license.
  • pubmed_200k_rct_v001: needs_review_pubmed_rct_mirror; No source-license research entry is present; manifest note: HF parquet metadata inspected on 2026-06-10.
  • scicite_v001: needs_review_allenai_scicite; No source-license research entry is present; manifest note: Legacy dataset script is disabled by current datasets versions; executable manifest uses HF converted parquet files.
  • twenty_newsgroups_v001: needs_review_dataset_card_blank; No source-license research entry is present; manifest note: HF parquet metadata inspected on 2026-06-10 via refs/convert/parquet.

Intended Uses

  • Local smoke testing of StrataBERT checkpoint loading, evaluation scripts, and metadata plumbing.
  • Reproducibility checks for run_001 diagnostic artifacts.

Out-of-Scope Uses

  • Public benchmark claims.
  • Production classification or token-classification deployment.
  • Commercial reuse of dataset-derived behavior without legal review of the relevant datasets.

Evaluation

metric value
accuracy 0.26
macro_f1 0.10317460317460318
weighted_f1 0.10730158730158731
loss 1.3858718490600586

Evaluation artifact: checkpoints/run_001/tiny_ag_news_smoke.

Length-Bucketed Results

bucket support accuracy
0_512 200 0.26

Latency and Memory

item value
device cpu
batch size 2
sequence length 128
p50 latency ms 10.763351499917917
p95 latency ms 12.447670099209063
latency 95% CI ms 0.6102587742635365
examples/sec 180.17026675821398
tokens/sec 23061.79414505139
OOM status not_oom
max batch under memory cap 2

Memory measurements are not release-grade in this diagnostic card unless explicitly listed above.

Hardware and Software

  • Training/eval torch: 2.12.0+cu130
  • CUDA available during checkpoint creation: False
  • Latency environment: {'cuda': '13.0', 'cuda_available': False, 'platform': 'Linux-6.14.0-37-generic-x86_64-with-glibc2.41', 'python': '3.12.13', 'torch': '2.12.0+cu130'}
  • Vast AI: None

Known Limitations

  • Random or tiny diagnostic training only; no release-quality pretraining.
  • Mandatory ModernBERT, Ettin, DeBERTa-v3, Longformer, BigBird, and embedding baselines are still pending.
  • Long-context 2k/4k/8k claims are unsupported by this card.
  • Dataset license caveats remain unresolved for public claims.

Ethical and Privacy Considerations

This checkpoint is diagnostic and should not be deployed. Dataset provenance and privacy review are incomplete for release use, and token-classification public claims require a publication-safe dataset replacement or legal approval.

Reproducibility

  • Training command: scripts/finetune_classification.py --train-index data/eval_frozen/run_001/ag_news_train_index_sample64.json --train-split train --eval-index data/eval_frozen/run_001/ag_news_eval_index_sample200.json --eval-split test --max-train-examples 32 --max-eval-examples 64 --batch-size 8 --epochs 1 --max-length 96 --lr 5e-4 --seed 1337 --output runs/run_001/eval_reports/stratabert_tiny_ag_news_finetune_smoke.json --checkpoint-dir checkpoints/run_001/tiny_ag_news_smoke
  • Tokenizer: {'source': 'answerdotai/ModernBERT-base', 'vocab_size': 50368}
  • Seed: 1337
  • Checkpoint path: checkpoints/run_001/tiny_ag_news_smoke/model.safetensors
  • Evaluation reports: data/eval_frozen/run_001/ag_news_eval_index_sample200.json

Citation

Use CITATION.cff from this repository. Title: StrataBERT: A Padding-Safe SSM-Attention Encoder for Efficient Long-Document Classification.

Exact Git Commit

Commit: no_commit_yet. Dirty worktree at checkpoint creation: True.

Downloads last month
-
Safetensors
Model size
2.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train dplotnikov/stratabert-tiny-ag-news-smoke