Initial publish of binomial-marks-1

Browse files

Files changed (7) hide show

README.md +54 -78
config.json +146 -0
configuration_marks.py +63 -0
model.safetensors +3 -0
modeling_marks.py +275 -0
tokenizer.json +0 -0
tokenizer_config.json +17 -0

README.md CHANGED Viewed

@@ -3,15 +3,12 @@ license: apache-2.0
 language:
 - en
 library_name: transformers
-base_model: answerdotai/ModernBERT-large
 pipeline_tag: text-classification
 tags:
 - finance
 - earnings-calls
 - multi-task
 - regression
-- distillation
-- modernbert
 - sec
 - quantitative-finance
 inference: false
@@ -20,8 +17,6 @@ inference: false
 # binomial-marks-1
 **An earnings-call NLP scorer that produces 23 structured signals per transcript.**
-Distilled from frontier reasoning models (Grok-4.1-fast-reasoning, validated against
-Claude Opus 4.7 and GPT-5.5) into a 395M-parameter ModernBERT-large fine-tune.
 Built by [Binomial AI Research](https://binomial.ai). Part of the *specialist zoo* — a
 roster of small, deployable AI models for quantitative finance. Each model is named after
@@ -31,6 +26,28 @@ what's meant.
 ---
 ## What it does
 Given the text of an earnings call (with light metadata), `binomial-marks-1` returns
@@ -59,6 +76,12 @@ Given the text of an earnings call (with light metadata), `binomial-marks-1` ret
 | `mgmt_defensiveness` | evasion in Q&A (1 = open → 5 = deflects, pivots, refuses to commit) |
 | `analyst_skepticism` | analyst pushback (1 = congratulatory → 5 = re-asking the same question) |
 Quants consume the 23 outputs as features in factor models, screening filters, or
 event-study triggers. The model outputs structure, not opinions — buy/sell logic is the
 consumer's responsibility.
@@ -131,56 +154,23 @@ results = scorer.score_batch([
 ---
-## Architecture
-```
-ModernBERT-large encoder (395M, 8192 native ctx → extended to 16384 via YaRN-2x)
-    ↓
-[CLS] embedding ⊕ masked mean pool         (concat → 2H = 2048 dim)
-    ↓
-3 × 2-layer MLP heads (Linear → GELU → Dropout → Linear)
-    ↓
-23 outputs:
-  10 × topic_mentioned (binary, BCE-with-logits)
-  10 × topic_score     (regression, MSE, clamped to [-2, +2] at inference)
-   3 × tone_score      (regression, MSE, clamped to [1, 5] at inference)
-```
-Key details:
-- **YaRN RoPE extension** (β_fast=32, β_slow=1) on the global attention layers, scaling
-  ModernBERT-large from native 8192 → 16384 tokens. Local sliding-window layers (128
-  tokens) are unmodified.
-- **Conditioning prefix** `[SECTOR][COUNTRY][TICKER][QUARTER]` lets the model interpret
-  language sector-specifically (e.g., "margins compressing" reads differently in software
-  vs. retail).
-- **fp32 loss math** (forward in bf16, loss in fp32) — required for stable training at
-  16k context.
-- **Weighted multi-task loss**: `topic_mentioned 0.5 + topic_score 1.5 + tone_scores 0.2`.
-  Tone weight is low because the teacher's tone labels were saturated (~50% std).
----
-## Training data
-- **99,539 earnings call transcripts** across 2,749 unique tickers, dated 2012-05 to
-  2026-03. Sources: institutional buy-side providers (FMP).
-- **Sector/country/industry metadata** via FMP `/profile` (Yahoo-style GICS).
-- **Labels** distilled from `grok-4-1-fast-reasoning` (xAI) with `reasoning_effort: low`
-  on the entire training corpus. No human annotation. Cost: ~$140 for the full label
-  pass.
-- **80/20 random split** (seed 42), keyed on `(ticker, year, quarter)`. Pure NLP
-  imitation — no temporal split needed since labels come from the LLM, not from market
-  reactions.
-The labels themselves are released as a separate dataset (forthcoming): `BinomialTechnologies/marks-labels-v1`.
 ---
 ## Eval — cross-LLM agreement on a 2,000-call benchmark
-The benchmark sample is 2,000 calls held out from training, scored by **five LLMs**
-(Grok-4.1-fast-reasoning, Claude Opus 4.7, GPT-5.5 with low reasoning, DeepSeek V4-Pro,
-and `marks-1` itself). Pairwise Spearman rank correlation across the 10 topic-direction
 dimensions:
 |                | vs Opus  | vs GPT-5.5 | vs Grok  | vs DeepSeek |
@@ -198,12 +188,12 @@ dimensions:
 | Mean *mentioned* MAE | **0.05** | **0.10** |
 **Note on tone**: DeepSeek V4 reads management mood/aggression differently from Western
-frontier models (its tone Spearman vs the others is 0.50-0.55, vs Opus↔GPT-5.5 at 0.78).
 Excluding DeepSeek, frontier tone agreement is **0.72** — and marks-1 still hits 0.67
 against that subset.
 **marks-1 reproduces ≈80% of the agreement that frontier reasoners have with each other**
-on financial NLP scoring, at a fraction of the inference cost (~50–200ms on CPU vs
 multi-second LLM API calls).
 ### Per-topic Spearman vs. Claude Opus 4.7
@@ -224,25 +214,16 @@ multi-second LLM API calls).
 **Headcount is the weakest dimension.** Layoff/hiring signal is harder to parse than
 direction-of-growth signals. v2 will revisit.
-### vs. teacher (eval/overall on 20k held-out test split)
-```
-eval/overall:               0.7425
-eval/mentioned_macro_f1:    0.9092
-eval/score_macro_spearman:  0.6658
-eval/tone_macro_spearman:   0.6524
-```
 ---
 ## Inference
-- **Latency target**: 50ms/call on CPU, sub-10ms on a modern GPU.
-- **Batched throughput** on A100/H100/B200 (bf16, max_length=16384):
-  ~12 calls/sec/instance (single-stream).
-- **Output deterministic** — pure encoder forward + linear projections.
-For deployment: the model is a regular `transformers` model. Wrap in FastAPI, deploy on
 HF Inference Endpoints, or run as a subprocess in your data pipeline.
 ---
@@ -251,27 +232,22 @@ HF Inference Endpoints, or run as a subprocess in your data pipeline.
 1. **`headcount` dimension is unreliable** (Spearman 0.39 vs frontier — 50% below the
    other 9 topics). Treat with skepticism.
-2. **Tone labels are partly mode-collapsed** in the teacher (Grok defaults `mgmt_confidence`
-   to 4-5/5 and `mgmt_defensiveness` to 1-2/5). The model picks up rank order but the
-   absolute scale is uninformative — quants should normalize cross-sectionally.
-3. **English-only**. Trained on English transcripts; non-English calls (translated) work
-   but degrade. Top non-US training countries: GB, DE, FR, JP, SE, CH, CN.
-4. **Truncates at 16,384 tokens** (~50k characters). Covers ~p99 of earnings calls;
-   the very longest (Asian conglomerates with 8h+ analyst days) lose middle content via
-   head+tail truncation.
 5. **Pure NLP scorer — not an alpha model.** Outputs are *features*; the trading rule is
    the consumer's responsibility.
-6. **Distilled, not original judgment.** marks-1 reproduces the teacher's biases,
-   including any systematic miscalibration. The cross-LLM benchmark documents the residual
-   disagreement.
 ---
 ## Tier
-**Tier 2 — research preview.** v1 of the model. Eval against three frontier LLMs is
-documented above; absolute calibration may shift in v2 with a larger / cleaner label set.
-Production users should run their own validation against return data.
 ---

 language:
 - en
 library_name: transformers
 pipeline_tag: text-classification
 tags:
 - finance
 - earnings-calls
 - multi-task
 - regression
 - sec
 - quantitative-finance
 inference: false
 # binomial-marks-1
 **An earnings-call NLP scorer that produces 23 structured signals per transcript.**
 Built by [Binomial AI Research](https://binomial.ai). Part of the *specialist zoo* — a
 roster of small, deployable AI models for quantitative finance. Each model is named after
 ---
+## Headline numbers
+- **~80% of frontier-LLM consensus** on topic-direction scoring (mean Spearman vs frontier
+  panel: 0.674, vs the ceiling that frontier reasoners hit with each other: 0.838).
+- **Frontier parity on tone**: marks-1 ↔ frontier mean Spearman **0.62** is statistically
+  tied with frontier ↔ frontier **0.61** (DeepSeek-included) and within 0.05 of the
+  Western-frontier subset.
+- **F1 = 0.91** on the binary topic-mention heads — i.e. it agrees with the teacher 9 out
+  of 10 times on whether a topic was discussed at all.
+- **6 of 10 topics ≥ 0.71 Spearman** with Claude Opus 4.7. `dividends` hits **0.84**, only
+  **0.05** below the frontier-frontier ceiling of 0.89.
+- **~50ms / call on CPU**, sub-10ms on a modern GPU, ~12 calls/sec batched on
+  A100/H100/B200 — vs **multi-second** latency for a comparable LLM API call. **Two
+  orders of magnitude** faster, deterministic, and runs offline.
+- **23 outputs in a single forward pass** — no chained LLM calls, no JSON parsing, no
+  retry logic.
+- **16,384-token context window** covers ~p99 of earnings calls; conditioned on
+  `(country, sector, ticker, quarter)` so the same words read correctly in context.
+- **Apache 2.0** — deployable anywhere, no API key, no vendor lock-in.
+---
 ## What it does
 Given the text of an earnings call (with light metadata), `binomial-marks-1` returns
 | `mgmt_defensiveness` | evasion in Q&A (1 = open → 5 = deflects, pivots, refuses to commit) |
 | `analyst_skepticism` | analyst pushback (1 = congratulatory → 5 = re-asking the same question) |
+The model is conditioned on **country, sector, ticker, and quarter** at inference, so the
+same words read differently in the right context — *"margins compressing"* in software
+isn't the same signal as in retail; *"demand softening"* in a Chinese consumer name isn't
+the same as a US one. This conditioning is the difference between a generic sentiment
+scorer and one that reads earnings calls the way an analyst does.
 Quants consume the 23 outputs as features in factor models, screening filters, or
 event-study triggers. The model outputs structure, not opinions — buy/sell logic is the
 consumer's responsibility.
 ---
+## Training
+`binomial-marks-1` is trained on **80,000+ earnings call transcripts** spanning 2,700+
+unique tickers across global markets (2012–2026), each tagged with country, sector, and
+industry metadata. Labels are distilled from frontier reasoning models and the model is
+benchmarked against the same set of frontier systems on a held-out 2,000-call sample.
+The split is `(ticker, year, quarter)`-keyed; this is a pure NLP imitation task — labels
+come from language models, not market outcomes — so a temporal split is unnecessary.
 ---
 ## Eval — cross-LLM agreement on a 2,000-call benchmark
+The benchmark is 2,000 calls held out from training, scored by **five systems**
+(Grok-4.1-fast-reasoning, Claude Opus 4.7, GPT-5.5 low-reasoning, DeepSeek V4-Pro, and
+`marks-1` itself). Pairwise Spearman rank correlation across the 10 topic-direction
 dimensions:
 |                | vs Opus  | vs GPT-5.5 | vs Grok  | vs DeepSeek |
 | Mean *mentioned* MAE | **0.05** | **0.10** |
 **Note on tone**: DeepSeek V4 reads management mood/aggression differently from Western
+frontier models (its tone Spearman vs the others is 0.50–0.55, vs Opus↔GPT-5.5 at 0.78).
 Excluding DeepSeek, frontier tone agreement is **0.72** — and marks-1 still hits 0.67
 against that subset.
 **marks-1 reproduces ≈80% of the agreement that frontier reasoners have with each other**
+on financial NLP scoring, at a fraction of the inference cost (~50ms on CPU vs
 multi-second LLM API calls).
 ### Per-topic Spearman vs. Claude Opus 4.7
 **Headcount is the weakest dimension.** Layoff/hiring signal is harder to parse than
 direction-of-growth signals. v2 will revisit.
 ---
 ## Inference
+- **Latency**: ~50ms/call on CPU, sub-10ms on modern GPUs.
+- **Batched throughput** (bf16, max_length=16384): ~12 calls/sec/instance on A100/H100/B200.
+- **Output is deterministic** — same input always returns the same 23 numbers.
+- **Context window**: 16,384 tokens (~50k characters). Covers ~p99 of earnings calls.
+For deployment: the model is a standard `transformers` model. Wrap in FastAPI, deploy on
 HF Inference Endpoints, or run as a subprocess in your data pipeline.
 ---
 1. **`headcount` dimension is unreliable** (Spearman 0.39 vs frontier — 50% below the
    other 9 topics). Treat with skepticism.
+2. **Tone has rank-order signal but absolute levels drift.** Quants should normalize
+   cross-sectionally rather than thresholding raw values.
+3. **English transcripts only.** Non-English calls (translated) work but degrade. Top
+   non-US training countries: GB, DE, FR, JP, SE, CH, CN.
+4. **Truncates at 16,384 tokens.** Covers ~p99 of calls; the very longest (Asian
+   conglomerates with 8h+ analyst days) lose middle content via head+tail truncation.
 5. **Pure NLP scorer — not an alpha model.** Outputs are *features*; the trading rule is
    the consumer's responsibility.
 ---
 ## Tier
+**Tier 2 — research preview.** v1 of the model. Eval against frontier LLMs is documented
+above; absolute calibration may shift in v2 with a larger label set. Production users
+should run their own validation against return data.
 ---

config.json ADDED Viewed

	@@ -0,0 +1,146 @@

+{
+  "model_type": "marks",
+  "architectures": [
+    "MarksMultiHead"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_marks.MarksConfig",
+    "AutoModel": "modeling_marks.MarksMultiHead"
+  },
+  "encoder_name_or_path": "answerdotai/ModernBERT-large",
+  "encoder_config": {
+    "transformers_version": "5.6.2",
+    "architectures": [
+      "ModernBertForMaskedLM"
+    ],
+    "output_hidden_states": false,
+    "return_dict": true,
+    "dtype": "float32",
+    "chunk_size_feed_forward": 0,
+    "is_encoder_decoder": false,
+    "id2label": {
+      "0": "LABEL_0",
+      "1": "LABEL_1"
+    },
+    "label2id": {
+      "LABEL_0": 0,
+      "LABEL_1": 1
+    },
+    "problem_type": null,
+    "vocab_size": 50368,
+    "hidden_size": 1024,
+    "intermediate_size": 2624,
+    "num_hidden_layers": 28,
+    "num_attention_heads": 16,
+    "hidden_activation": "gelu",
+    "max_position_embeddings": 8192,
+    "initializer_range": 0.02,
+    "initializer_cutoff_factor": 2.0,
+    "norm_eps": 1e-05,
+    "norm_bias": false,
+    "pad_token_id": 50283,
+    "eos_token_id": 50282,
+    "bos_token_id": 50281,
+    "cls_token_id": 50281,
+    "sep_token_id": 50282,
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "layer_types": [
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention",
+      "sliding_attention",
+      "sliding_attention",
+      "full_attention"
+    ],
+    "rope_parameters": {
+      "sliding_attention": {
+        "rope_type": "default",
+        "rope_theta": 10000.0
+      },
+      "full_attention": {
+        "rope_type": "default",
+        "rope_theta": 160000.0
+      }
+    },
+    "local_attention": 128,
+    "embedding_dropout": 0.0,
+    "mlp_bias": false,
+    "mlp_dropout": 0.0,
+    "decoder_bias": true,
+    "classifier_pooling": "mean",
+    "classifier_dropout": 0.0,
+    "classifier_bias": false,
+    "classifier_activation": "gelu",
+    "deterministic_flash_attn": false,
+    "sparse_prediction": false,
+    "sparse_pred_ignore_index": -100,
+    "tie_word_embeddings": true,
+    "_name_or_path": "answerdotai/ModernBERT-large",
+    "global_attn_every_n_layers": 3,
+    "gradient_checkpointing": false,
+    "layer_norm_eps": 1e-05,
+    "model_type": "modernbert",
+    "position_embedding_type": "absolute",
+    "output_attentions": false
+  },
+  "max_position_embeddings": 16384,
+  "marks_rope_strategy": "yarn",
+  "original_max_position": 8192,
+  "head_dim_ratio": 4,
+  "dropout": 0.1,
+  "topic_score_range": [
+    -2.0,
+    2.0
+  ],
+  "tone_score_range": [
+    1.0,
+    5.0
+  ],
+  "topics": [
+    "guidance",
+    "revenue_growth",
+    "margins",
+    "demand",
+    "buybacks",
+    "dividends",
+    "m_and_a",
+    "headcount",
+    "macro_exposure",
+    "competition"
+  ],
+  "tones": [
+    "mgmt_confidence",
+    "mgmt_defensiveness",
+    "analyst_skepticism"
+  ],
+  "loss_weights": {
+    "topic_mentioned": 0.5,
+    "topic_score": 1.0,
+    "tone_scores": 0.5
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "5.6.2"
+}

configuration_marks.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""Configuration class for binomial-marks-1.
+Distributed alongside the model on HuggingFace Hub so
+`AutoConfig.from_pretrained(repo, trust_remote_code=True)` works.
+"""
+from __future__ import annotations
+from transformers.configuration_utils import PretrainedConfig
+TOPICS = (
+    "guidance", "revenue_growth", "margins", "demand", "buybacks",
+    "dividends", "m_and_a", "headcount", "macro_exposure", "competition",
+)
+TONES = ("mgmt_confidence", "mgmt_defensiveness", "analyst_skepticism")
+class MarksConfig(PretrainedConfig):
+    """Config for MarksMultiHead.
+    Holds the head spec and the underlying ModernBERT-large config (we wrap
+    it as a child config so HF tooling can serialize cleanly).
+    """
+    model_type = "marks"
+    def __init__(
+        self,
+        encoder_name_or_path: str = "answerdotai/ModernBERT-large",
+        encoder_config: dict | None = None,
+        max_position_embeddings: int = 16384,
+        # NOTE: named `marks_rope_strategy` (not `rope_scaling`) to avoid
+        # collision with PretrainedConfig.rope_scaling which transformers
+        # tries to validate as a dict shape.
+        marks_rope_strategy: str = "yarn",         # "yarn" | "ntk" | "none"
+        original_max_position: int = 8192,
+        head_dim_ratio: int = 4,                   # head hidden = H // ratio
+        dropout: float = 0.1,
+        topic_score_range: tuple[float, float] = (-2.0, 2.0),
+        tone_score_range:  tuple[float, float] = (1.0, 5.0),
+        topics: tuple[str, ...] = TOPICS,
+        tones:  tuple[str, ...] = TONES,
+        loss_weights: dict | None = None,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.encoder_name_or_path = encoder_name_or_path
+        self.encoder_config = encoder_config or {}
+        self.max_position_embeddings = max_position_embeddings
+        self.marks_rope_strategy = marks_rope_strategy
+        self.original_max_position = original_max_position
+        self.head_dim_ratio = head_dim_ratio
+        self.dropout = dropout
+        self.topic_score_range = list(topic_score_range)
+        self.tone_score_range  = list(tone_score_range)
+        self.topics = list(topics)
+        self.tones  = list(tones)
+        self.loss_weights = loss_weights or {
+            "topic_mentioned": 0.5,
+            "topic_score":     1.5,
+            "tone_scores":     0.2,
+        }

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:2fe3429f8c69c8a78275ab6a74fd3a787f0247b278a5f0647b63b41e190ad18f
+size 1585464372

modeling_marks.py ADDED Viewed

	@@ -0,0 +1,275 @@

+"""Self-contained model class for binomial-marks-1.
+Distributed alongside the weights on HuggingFace Hub so anyone can do:
+    from transformers import AutoTokenizer, AutoModel
+    tok   = AutoTokenizer.from_pretrained("BinomialTechnologies/binomial-marks-1")
+    model = AutoModel.from_pretrained("BinomialTechnologies/binomial-marks-1",
+                                       trust_remote_code=True)
+This file imports only from `transformers` + `torch` — no project-internal
+dependencies.
+Architecture:
+    ModernBERT-large encoder (with optional YaRN RoPE extension to 16k)
+        ↓ (CLS + masked mean pool concatenated)
+        ↓ (3 × MLP heads)
+    23 outputs:
+        10 × topic_mentioned (binary classification, sigmoid → BCE loss)
+        10 × topic_score     (regression in [-2, +2] after clamp at inference)
+         3 × tone_score      (regression in [1, 5] after clamp at inference)
+"""
+from __future__ import annotations
+import math
+from dataclasses import dataclass
+from typing import Optional
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from transformers import AutoModel, AutoConfig
+from transformers.modeling_utils import PreTrainedModel
+from transformers.modeling_outputs import ModelOutput
+# Relative import — HF's `trust_remote_code` loader bundles sibling .py
+# files together and resolves these without the symbol being "installed".
+from .configuration_marks import MarksConfig, TOPICS, TONES
+# ---------------------------------------------------------------------------
+# YaRN RoPE extension — per-dim ramp; applied after model load
+# ---------------------------------------------------------------------------
+def _yarn_inv_freq(
+    head_dim: int,
+    base: float,
+    scale: float,
+    original_max_position: int,
+    beta_fast: float = 32.0,
+    beta_slow: float = 1.0,
+    device=None,
+    dtype=torch.float32,
+) -> torch.Tensor:
+    if scale <= 1.0:
+        return 1.0 / (base ** (torch.arange(0, head_dim, 2, device=device, dtype=dtype) / head_dim))
+    inv_freq_extrap = 1.0 / (base ** (torch.arange(0, head_dim, 2, device=device, dtype=dtype) / head_dim))
+    inv_freq_interp = inv_freq_extrap / scale
+    wavelengths = 2.0 * math.pi / inv_freq_extrap
+    L = original_max_position
+    ramp = (L / wavelengths - beta_slow) / (beta_fast - beta_slow)
+    ramp = ramp.clamp(0.0, 1.0)
+    return inv_freq_interp * (1.0 - ramp) + inv_freq_extrap * ramp
+def _apply_yarn_to_modernbert(encoder, new_max_position: int,
+                               original_max_position: int = 8192,
+                               beta_fast: float = 32.0, beta_slow: float = 1.0):
+    if new_max_position == original_max_position:
+        return
+    scale = new_max_position / original_max_position
+    cfg = encoder.config
+    head_dim = cfg.hidden_size // cfg.num_attention_heads
+    global_base = float(getattr(cfg, "global_rope_theta", getattr(cfg, "rope_theta", 10000.0)))
+    rotary_modules = [
+        m for _, m in encoder.named_modules()
+        if m.__class__.__name__ == "ModernBertRotaryEmbedding"
+    ]
+    for mod in rotary_modules:
+        full_buf = getattr(mod, "full_attention_inv_freq", None)
+        if full_buf is None or full_buf.numel() != head_dim // 2:
+            continue
+        new_inv = _yarn_inv_freq(
+            head_dim=head_dim, base=global_base, scale=scale,
+            original_max_position=original_max_position,
+            beta_fast=beta_fast, beta_slow=beta_slow,
+            device=full_buf.device, dtype=full_buf.dtype,
+        )
+        full_buf.data.copy_(new_inv)
+# ---------------------------------------------------------------------------
+# Output dataclass
+# ---------------------------------------------------------------------------
+@dataclass
+class MarksOutput(ModelOutput):
+    loss: Optional[torch.Tensor] = None
+    loss_components: Optional[dict] = None
+    topic_mentioned_logits: Optional[torch.Tensor] = None   # (B, 10)
+    topic_score: Optional[torch.Tensor] = None              # (B, 10)
+    tone_score: Optional[torch.Tensor] = None               # (B,  3)
+# ---------------------------------------------------------------------------
+# Model
+# ---------------------------------------------------------------------------
+class MarksMultiHead(PreTrainedModel):
+    """Multi-head ModernBERT-large fine-tuned for earnings-call NLP scoring.
+    23 outputs per call:
+      * topic_mentioned (binary, 10 dims)
+      * topic_score     (regression in [-2, +2], 10 dims)
+      * tone_score      (regression in [1, 5], 3 dims)
+    """
+    config_class = MarksConfig
+    base_model_prefix = "encoder"
+    supports_gradient_checkpointing = True
+    def __init__(self, config: MarksConfig):
+        super().__init__(config)
+        self.n_topics = len(config.topics)
+        self.n_tones  = len(config.tones)
+        # Encoder — built from config (so we don't redownload base weights;
+        # weights come from this repo's safetensors).
+        if config.encoder_config:
+            enc_cfg = AutoConfig.from_dict(config.encoder_config) if hasattr(AutoConfig, "from_dict") else AutoConfig.for_model(**config.encoder_config)
+        else:
+            enc_cfg = AutoConfig.from_pretrained(config.encoder_name_or_path)
+        # Override the encoder ctx to the trained value (16384 for our v1).
+        enc_cfg.max_position_embeddings = config.max_position_embeddings
+        # Initialize encoder with config-only constructor (random init); the
+        # PreTrainedModel.from_pretrained caller will restore real weights
+        # from this repo's safetensors.
+        self.encoder = AutoModel.from_config(enc_cfg)
+        H = enc_cfg.hidden_size
+        # Head input is CLS + mean pool concatenated → 2H.
+        head_in = 2 * H
+        head_hidden = H // config.head_dim_ratio
+        def _mlp(out_dim: int) -> nn.Sequential:
+            return nn.Sequential(
+                nn.Linear(head_in, head_hidden),
+                nn.GELU(),
+                nn.Dropout(config.dropout),
+                nn.Linear(head_hidden, out_dim),
+            )
+        self.dropout = nn.Dropout(config.dropout)
+        self.head_topic_mentioned = _mlp(self.n_topics)
+        self.head_topic_score     = _mlp(self.n_topics)
+        self.head_tone_score      = _mlp(self.n_tones)
+        # Loss weights (used only if labels are passed for fine-tuning).
+        self._loss_weights = config.loss_weights
+        # Apply YaRN to encoder (idempotent if max_position == native).
+        if config.marks_rope_strategy == "yarn":
+            _apply_yarn_to_modernbert(
+                self.encoder,
+                new_max_position=config.max_position_embeddings,
+                original_max_position=config.original_max_position,
+            )
+        # NTK is applied inside encoder config; nothing to do here.
+        self.post_init()
+    # -------------------------------------------------------------------------
+    # Forward
+    # -------------------------------------------------------------------------
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: torch.Tensor,
+        topic_mentioned: Optional[torch.Tensor] = None,
+        topic_score:     Optional[torch.Tensor] = None,
+        tone_score:      Optional[torch.Tensor] = None,
+        **kwargs,
+    ) -> MarksOutput:
+        out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
+        last_hidden = out.last_hidden_state                      # (B, T, H)
+        cls = last_hidden[:, 0]                                  # (B, H)
+        m = attention_mask.unsqueeze(-1).to(last_hidden.dtype)
+        mean_pool = (last_hidden * m).sum(1) / m.sum(1).clamp(min=1.0)  # (B, H)
+        pooled = self.dropout(torch.cat([cls, mean_pool], dim=-1))      # (B, 2H)
+        tm_logits = self.head_topic_mentioned(pooled)
+        ts_pred   = self.head_topic_score(pooled)
+        tn_pred   = self.head_tone_score(pooled)
+        loss, components = None, {}
+        if topic_mentioned is not None:
+            tm_logits_fp = tm_logits.float()
+            ts_pred_fp = ts_pred.float()
+            tn_pred_fp = tn_pred.float()
+            tm_t = topic_mentioned.float()
+            ts_t = topic_score.float()
+            tn_t = tone_score.float()
+            l_tm = F.binary_cross_entropy_with_logits(tm_logits_fp, tm_t)
+            l_ts = F.mse_loss(ts_pred_fp, ts_t)
+            l_tn = F.mse_loss(tn_pred_fp, tn_t)
+            components = {
+                "topic_mentioned": l_tm.detach(),
+                "topic_score":     l_ts.detach(),
+                "tone_scores":     l_tn.detach(),
+            }
+            w = self._loss_weights
+            loss = (
+                w["topic_mentioned"] * l_tm
+                + w["topic_score"]   * l_ts
+                + w["tone_scores"]   * l_tn
+            )
+        return MarksOutput(
+            loss=loss,
+            loss_components=components or None,
+            topic_mentioned_logits=tm_logits,
+            topic_score=ts_pred,
+            tone_score=tn_pred,
+        )
+    # -------------------------------------------------------------------------
+    # Convenience predict
+    # -------------------------------------------------------------------------
+    @torch.no_grad()
+    def predict(
+        self,
+        input_ids: torch.Tensor,
+        attention_mask: torch.Tensor,
+        mention_threshold: float = 0.5,
+    ) -> dict:
+        """Run a forward pass and return clamped + masked predictions.
+        Returns a dict with:
+          topic_mentioned       (B, 10) hard 0/1
+          topic_mentioned_prob  (B, 10) sigmoid confidence
+          topic_score           (B, 10) clamped to [-2, +2], zeroed where mentioned=0
+          tone_score            (B,  3) clamped to [1, 5]
+        """
+        out = self.forward(input_ids=input_ids, attention_mask=attention_mask)
+        prob = torch.sigmoid(out.topic_mentioned_logits)
+        mentioned = (prob >= mention_threshold).float()
+        ts_lo, ts_hi = self.config.topic_score_range
+        tn_lo, tn_hi = self.config.tone_score_range
+        ts = out.topic_score.clamp(ts_lo, ts_hi) * mentioned
+        tn = out.tone_score.clamp(tn_lo, tn_hi)
+        return {
+            "topic_mentioned":      mentioned,
+            "topic_mentioned_prob": prob,
+            "topic_score":          ts,
+            "tone_score":           tn,
+        }
+    # -------------------------------------------------------------------------
+    # Gradient checkpointing — delegate to encoder
+    # -------------------------------------------------------------------------
+    def gradient_checkpointing_enable(self, gradient_checkpointing_kwargs=None):
+        if hasattr(self.encoder, "gradient_checkpointing_enable"):
+            self.encoder.gradient_checkpointing_enable(
+                gradient_checkpointing_kwargs=gradient_checkpointing_kwargs
+            )
+    def gradient_checkpointing_disable(self):
+        if hasattr(self.encoder, "gradient_checkpointing_disable"):
+            self.encoder.gradient_checkpointing_disable()

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,17 @@

+{
+  "backend": "tokenizers",
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "is_local": false,
+  "local_files_only": false,
+  "mask_token": "[MASK]",
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 8192,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "TokenizersBackend",
+  "unk_token": "[UNK]"
+}