ilayibrahimzadeh commited on
Commit
f7b715f
·
verified ·
1 Parent(s): c2b1993

Initial publish of binomial-marks-1

Browse files
README.md CHANGED
@@ -3,15 +3,12 @@ license: apache-2.0
3
  language:
4
  - en
5
  library_name: transformers
6
- base_model: answerdotai/ModernBERT-large
7
  pipeline_tag: text-classification
8
  tags:
9
  - finance
10
  - earnings-calls
11
  - multi-task
12
  - regression
13
- - distillation
14
- - modernbert
15
  - sec
16
  - quantitative-finance
17
  inference: false
@@ -20,8 +17,6 @@ inference: false
20
  # binomial-marks-1
21
 
22
  **An earnings-call NLP scorer that produces 23 structured signals per transcript.**
23
- Distilled from frontier reasoning models (Grok-4.1-fast-reasoning, validated against
24
- Claude Opus 4.7 and GPT-5.5) into a 395M-parameter ModernBERT-large fine-tune.
25
 
26
  Built by [Binomial AI Research](https://binomial.ai). Part of the *specialist zoo* — a
27
  roster of small, deployable AI models for quantitative finance. Each model is named after
@@ -31,6 +26,28 @@ what's meant.
31
 
32
  ---
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  ## What it does
35
 
36
  Given the text of an earnings call (with light metadata), `binomial-marks-1` returns
@@ -59,6 +76,12 @@ Given the text of an earnings call (with light metadata), `binomial-marks-1` ret
59
  | `mgmt_defensiveness` | evasion in Q&A (1 = open → 5 = deflects, pivots, refuses to commit) |
60
  | `analyst_skepticism` | analyst pushback (1 = congratulatory → 5 = re-asking the same question) |
61
 
 
 
 
 
 
 
62
  Quants consume the 23 outputs as features in factor models, screening filters, or
63
  event-study triggers. The model outputs structure, not opinions — buy/sell logic is the
64
  consumer's responsibility.
@@ -131,56 +154,23 @@ results = scorer.score_batch([
131
 
132
  ---
133
 
134
- ## Architecture
135
 
136
- ```
137
- ModernBERT-large encoder (395M, 8192 native ctx extended to 16384 via YaRN-2x)
138
-
139
- [CLS] embedding masked mean pool (concat 2H = 2048 dim)
140
-
141
- 3 × 2-layer MLP heads (Linear → GELU → Dropout → Linear)
142
-
143
- 23 outputs:
144
- 10 × topic_mentioned (binary, BCE-with-logits)
145
- 10 × topic_score (regression, MSE, clamped to [-2, +2] at inference)
146
- 3 × tone_score (regression, MSE, clamped to [1, 5] at inference)
147
- ```
148
 
149
- Key details:
150
- - **YaRN RoPE extension** (β_fast=32, β_slow=1) on the global attention layers, scaling
151
- ModernBERT-large from native 8192 → 16384 tokens. Local sliding-window layers (128
152
- tokens) are unmodified.
153
- - **Conditioning prefix** `[SECTOR][COUNTRY][TICKER][QUARTER]` lets the model interpret
154
- language sector-specifically (e.g., "margins compressing" reads differently in software
155
- vs. retail).
156
- - **fp32 loss math** (forward in bf16, loss in fp32) — required for stable training at
157
- 16k context.
158
- - **Weighted multi-task loss**: `topic_mentioned 0.5 + topic_score 1.5 + tone_scores 0.2`.
159
- Tone weight is low because the teacher's tone labels were saturated (~50% std).
160
-
161
- ---
162
-
163
- ## Training data
164
-
165
- - **99,539 earnings call transcripts** across 2,749 unique tickers, dated 2012-05 to
166
- 2026-03. Sources: institutional buy-side providers (FMP).
167
- - **Sector/country/industry metadata** via FMP `/profile` (Yahoo-style GICS).
168
- - **Labels** distilled from `grok-4-1-fast-reasoning` (xAI) with `reasoning_effort: low`
169
- on the entire training corpus. No human annotation. Cost: ~$140 for the full label
170
- pass.
171
- - **80/20 random split** (seed 42), keyed on `(ticker, year, quarter)`. Pure NLP
172
- imitation — no temporal split needed since labels come from the LLM, not from market
173
- reactions.
174
-
175
- The labels themselves are released as a separate dataset (forthcoming): `BinomialTechnologies/marks-labels-v1`.
176
 
177
  ---
178
 
179
  ## Eval — cross-LLM agreement on a 2,000-call benchmark
180
 
181
- The benchmark sample is 2,000 calls held out from training, scored by **five LLMs**
182
- (Grok-4.1-fast-reasoning, Claude Opus 4.7, GPT-5.5 with low reasoning, DeepSeek V4-Pro,
183
- and `marks-1` itself). Pairwise Spearman rank correlation across the 10 topic-direction
184
  dimensions:
185
 
186
  | | vs Opus | vs GPT-5.5 | vs Grok | vs DeepSeek |
@@ -198,12 +188,12 @@ dimensions:
198
  | Mean *mentioned* MAE | **0.05** | **0.10** |
199
 
200
  **Note on tone**: DeepSeek V4 reads management mood/aggression differently from Western
201
- frontier models (its tone Spearman vs the others is 0.50-0.55, vs Opus↔GPT-5.5 at 0.78).
202
  Excluding DeepSeek, frontier tone agreement is **0.72** — and marks-1 still hits 0.67
203
  against that subset.
204
 
205
  **marks-1 reproduces ≈80% of the agreement that frontier reasoners have with each other**
206
- on financial NLP scoring, at a fraction of the inference cost (~50–200ms on CPU vs
207
  multi-second LLM API calls).
208
 
209
  ### Per-topic Spearman vs. Claude Opus 4.7
@@ -224,25 +214,16 @@ multi-second LLM API calls).
224
  **Headcount is the weakest dimension.** Layoff/hiring signal is harder to parse than
225
  direction-of-growth signals. v2 will revisit.
226
 
227
- ### vs. teacher (eval/overall on 20k held-out test split)
228
-
229
- ```
230
- eval/overall: 0.7425
231
- eval/mentioned_macro_f1: 0.9092
232
- eval/score_macro_spearman: 0.6658
233
- eval/tone_macro_spearman: 0.6524
234
- ```
235
-
236
  ---
237
 
238
  ## Inference
239
 
240
- - **Latency target**: 50ms/call on CPU, sub-10ms on a modern GPU.
241
- - **Batched throughput** on A100/H100/B200 (bf16, max_length=16384):
242
- ~12 calls/sec/instance (single-stream).
243
- - **Output deterministic** pure encoder forward + linear projections.
244
 
245
- For deployment: the model is a regular `transformers` model. Wrap in FastAPI, deploy on
246
  HF Inference Endpoints, or run as a subprocess in your data pipeline.
247
 
248
  ---
@@ -251,27 +232,22 @@ HF Inference Endpoints, or run as a subprocess in your data pipeline.
251
 
252
  1. **`headcount` dimension is unreliable** (Spearman 0.39 vs frontier — 50% below the
253
  other 9 topics). Treat with skepticism.
254
- 2. **Tone labels are partly mode-collapsed** in the teacher (Grok defaults `mgmt_confidence`
255
- to 4-5/5 and `mgmt_defensiveness` to 1-2/5). The model picks up rank order but the
256
- absolute scale is uninformative quants should normalize cross-sectionally.
257
- 3. **English-only**. Trained on English transcripts; non-English calls (translated) work
258
- but degrade. Top non-US training countries: GB, DE, FR, JP, SE, CH, CN.
259
- 4. **Truncates at 16,384 tokens** (~50k characters). Covers ~p99 of earnings calls;
260
- the very longest (Asian conglomerates with 8h+ analyst days) lose middle content via
261
- head+tail truncation.
262
  5. **Pure NLP scorer — not an alpha model.** Outputs are *features*; the trading rule is
263
  the consumer's responsibility.
264
- 6. **Distilled, not original judgment.** marks-1 reproduces the teacher's biases,
265
- including any systematic miscalibration. The cross-LLM benchmark documents the residual
266
- disagreement.
267
 
268
  ---
269
 
270
  ## Tier
271
 
272
- **Tier 2 — research preview.** v1 of the model. Eval against three frontier LLMs is
273
- documented above; absolute calibration may shift in v2 with a larger / cleaner label set.
274
- Production users should run their own validation against return data.
275
 
276
  ---
277
 
 
3
  language:
4
  - en
5
  library_name: transformers
 
6
  pipeline_tag: text-classification
7
  tags:
8
  - finance
9
  - earnings-calls
10
  - multi-task
11
  - regression
 
 
12
  - sec
13
  - quantitative-finance
14
  inference: false
 
17
  # binomial-marks-1
18
 
19
  **An earnings-call NLP scorer that produces 23 structured signals per transcript.**
 
 
20
 
21
  Built by [Binomial AI Research](https://binomial.ai). Part of the *specialist zoo* — a
22
  roster of small, deployable AI models for quantitative finance. Each model is named after
 
26
 
27
  ---
28
 
29
+ ## Headline numbers
30
+
31
+ - **~80% of frontier-LLM consensus** on topic-direction scoring (mean Spearman vs frontier
32
+ panel: 0.674, vs the ceiling that frontier reasoners hit with each other: 0.838).
33
+ - **Frontier parity on tone**: marks-1 ↔ frontier mean Spearman **0.62** is statistically
34
+ tied with frontier ↔ frontier **0.61** (DeepSeek-included) and within 0.05 of the
35
+ Western-frontier subset.
36
+ - **F1 = 0.91** on the binary topic-mention heads — i.e. it agrees with the teacher 9 out
37
+ of 10 times on whether a topic was discussed at all.
38
+ - **6 of 10 topics ≥ 0.71 Spearman** with Claude Opus 4.7. `dividends` hits **0.84**, only
39
+ **0.05** below the frontier-frontier ceiling of 0.89.
40
+ - **~50ms / call on CPU**, sub-10ms on a modern GPU, ~12 calls/sec batched on
41
+ A100/H100/B200 — vs **multi-second** latency for a comparable LLM API call. **Two
42
+ orders of magnitude** faster, deterministic, and runs offline.
43
+ - **23 outputs in a single forward pass** — no chained LLM calls, no JSON parsing, no
44
+ retry logic.
45
+ - **16,384-token context window** covers ~p99 of earnings calls; conditioned on
46
+ `(country, sector, ticker, quarter)` so the same words read correctly in context.
47
+ - **Apache 2.0** — deployable anywhere, no API key, no vendor lock-in.
48
+
49
+ ---
50
+
51
  ## What it does
52
 
53
  Given the text of an earnings call (with light metadata), `binomial-marks-1` returns
 
76
  | `mgmt_defensiveness` | evasion in Q&A (1 = open → 5 = deflects, pivots, refuses to commit) |
77
  | `analyst_skepticism` | analyst pushback (1 = congratulatory → 5 = re-asking the same question) |
78
 
79
+ The model is conditioned on **country, sector, ticker, and quarter** at inference, so the
80
+ same words read differently in the right context — *"margins compressing"* in software
81
+ isn't the same signal as in retail; *"demand softening"* in a Chinese consumer name isn't
82
+ the same as a US one. This conditioning is the difference between a generic sentiment
83
+ scorer and one that reads earnings calls the way an analyst does.
84
+
85
  Quants consume the 23 outputs as features in factor models, screening filters, or
86
  event-study triggers. The model outputs structure, not opinions — buy/sell logic is the
87
  consumer's responsibility.
 
154
 
155
  ---
156
 
157
+ ## Training
158
 
159
+ `binomial-marks-1` is trained on **80,000+ earnings call transcripts** spanning 2,700+
160
+ unique tickers across global markets (2012–2026), each tagged with country, sector, and
161
+ industry metadata. Labels are distilled from frontier reasoning models and the model is
162
+ benchmarked against the same set of frontier systems on a held-out 2,000-call sample.
 
 
 
 
 
 
 
 
163
 
164
+ The split is `(ticker, year, quarter)`-keyed; this is a pure NLP imitation task — labels
165
+ come from language models, not market outcomes so a temporal split is unnecessary.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
 
167
  ---
168
 
169
  ## Eval — cross-LLM agreement on a 2,000-call benchmark
170
 
171
+ The benchmark is 2,000 calls held out from training, scored by **five systems**
172
+ (Grok-4.1-fast-reasoning, Claude Opus 4.7, GPT-5.5 low-reasoning, DeepSeek V4-Pro, and
173
+ `marks-1` itself). Pairwise Spearman rank correlation across the 10 topic-direction
174
  dimensions:
175
 
176
  | | vs Opus | vs GPT-5.5 | vs Grok | vs DeepSeek |
 
188
  | Mean *mentioned* MAE | **0.05** | **0.10** |
189
 
190
  **Note on tone**: DeepSeek V4 reads management mood/aggression differently from Western
191
+ frontier models (its tone Spearman vs the others is 0.500.55, vs Opus↔GPT-5.5 at 0.78).
192
  Excluding DeepSeek, frontier tone agreement is **0.72** — and marks-1 still hits 0.67
193
  against that subset.
194
 
195
  **marks-1 reproduces ≈80% of the agreement that frontier reasoners have with each other**
196
+ on financial NLP scoring, at a fraction of the inference cost (~50ms on CPU vs
197
  multi-second LLM API calls).
198
 
199
  ### Per-topic Spearman vs. Claude Opus 4.7
 
214
  **Headcount is the weakest dimension.** Layoff/hiring signal is harder to parse than
215
  direction-of-growth signals. v2 will revisit.
216
 
 
 
 
 
 
 
 
 
 
217
  ---
218
 
219
  ## Inference
220
 
221
+ - **Latency**: ~50ms/call on CPU, sub-10ms on modern GPUs.
222
+ - **Batched throughput** (bf16, max_length=16384): ~12 calls/sec/instance on A100/H100/B200.
223
+ - **Output is deterministic** — same input always returns the same 23 numbers.
224
+ - **Context window**: 16,384 tokens (~50k characters). Covers ~p99 of earnings calls.
225
 
226
+ For deployment: the model is a standard `transformers` model. Wrap in FastAPI, deploy on
227
  HF Inference Endpoints, or run as a subprocess in your data pipeline.
228
 
229
  ---
 
232
 
233
  1. **`headcount` dimension is unreliable** (Spearman 0.39 vs frontier — 50% below the
234
  other 9 topics). Treat with skepticism.
235
+ 2. **Tone has rank-order signal but absolute levels drift.** Quants should normalize
236
+ cross-sectionally rather than thresholding raw values.
237
+ 3. **English transcripts only.** Non-English calls (translated) work but degrade. Top
238
+ non-US training countries: GB, DE, FR, JP, SE, CH, CN.
239
+ 4. **Truncates at 16,384 tokens.** Covers ~p99 of calls; the very longest (Asian
240
+ conglomerates with 8h+ analyst days) lose middle content via head+tail truncation.
 
 
241
  5. **Pure NLP scorer — not an alpha model.** Outputs are *features*; the trading rule is
242
  the consumer's responsibility.
 
 
 
243
 
244
  ---
245
 
246
  ## Tier
247
 
248
+ **Tier 2 — research preview.** v1 of the model. Eval against frontier LLMs is documented
249
+ above; absolute calibration may shift in v2 with a larger label set. Production users
250
+ should run their own validation against return data.
251
 
252
  ---
253
 
config.json ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_type": "marks",
3
+ "architectures": [
4
+ "MarksMultiHead"
5
+ ],
6
+ "auto_map": {
7
+ "AutoConfig": "configuration_marks.MarksConfig",
8
+ "AutoModel": "modeling_marks.MarksMultiHead"
9
+ },
10
+ "encoder_name_or_path": "answerdotai/ModernBERT-large",
11
+ "encoder_config": {
12
+ "transformers_version": "5.6.2",
13
+ "architectures": [
14
+ "ModernBertForMaskedLM"
15
+ ],
16
+ "output_hidden_states": false,
17
+ "return_dict": true,
18
+ "dtype": "float32",
19
+ "chunk_size_feed_forward": 0,
20
+ "is_encoder_decoder": false,
21
+ "id2label": {
22
+ "0": "LABEL_0",
23
+ "1": "LABEL_1"
24
+ },
25
+ "label2id": {
26
+ "LABEL_0": 0,
27
+ "LABEL_1": 1
28
+ },
29
+ "problem_type": null,
30
+ "vocab_size": 50368,
31
+ "hidden_size": 1024,
32
+ "intermediate_size": 2624,
33
+ "num_hidden_layers": 28,
34
+ "num_attention_heads": 16,
35
+ "hidden_activation": "gelu",
36
+ "max_position_embeddings": 8192,
37
+ "initializer_range": 0.02,
38
+ "initializer_cutoff_factor": 2.0,
39
+ "norm_eps": 1e-05,
40
+ "norm_bias": false,
41
+ "pad_token_id": 50283,
42
+ "eos_token_id": 50282,
43
+ "bos_token_id": 50281,
44
+ "cls_token_id": 50281,
45
+ "sep_token_id": 50282,
46
+ "attention_bias": false,
47
+ "attention_dropout": 0.0,
48
+ "layer_types": [
49
+ "full_attention",
50
+ "sliding_attention",
51
+ "sliding_attention",
52
+ "full_attention",
53
+ "sliding_attention",
54
+ "sliding_attention",
55
+ "full_attention",
56
+ "sliding_attention",
57
+ "sliding_attention",
58
+ "full_attention",
59
+ "sliding_attention",
60
+ "sliding_attention",
61
+ "full_attention",
62
+ "sliding_attention",
63
+ "sliding_attention",
64
+ "full_attention",
65
+ "sliding_attention",
66
+ "sliding_attention",
67
+ "full_attention",
68
+ "sliding_attention",
69
+ "sliding_attention",
70
+ "full_attention",
71
+ "sliding_attention",
72
+ "sliding_attention",
73
+ "full_attention",
74
+ "sliding_attention",
75
+ "sliding_attention",
76
+ "full_attention"
77
+ ],
78
+ "rope_parameters": {
79
+ "sliding_attention": {
80
+ "rope_type": "default",
81
+ "rope_theta": 10000.0
82
+ },
83
+ "full_attention": {
84
+ "rope_type": "default",
85
+ "rope_theta": 160000.0
86
+ }
87
+ },
88
+ "local_attention": 128,
89
+ "embedding_dropout": 0.0,
90
+ "mlp_bias": false,
91
+ "mlp_dropout": 0.0,
92
+ "decoder_bias": true,
93
+ "classifier_pooling": "mean",
94
+ "classifier_dropout": 0.0,
95
+ "classifier_bias": false,
96
+ "classifier_activation": "gelu",
97
+ "deterministic_flash_attn": false,
98
+ "sparse_prediction": false,
99
+ "sparse_pred_ignore_index": -100,
100
+ "tie_word_embeddings": true,
101
+ "_name_or_path": "answerdotai/ModernBERT-large",
102
+ "global_attn_every_n_layers": 3,
103
+ "gradient_checkpointing": false,
104
+ "layer_norm_eps": 1e-05,
105
+ "model_type": "modernbert",
106
+ "position_embedding_type": "absolute",
107
+ "output_attentions": false
108
+ },
109
+ "max_position_embeddings": 16384,
110
+ "marks_rope_strategy": "yarn",
111
+ "original_max_position": 8192,
112
+ "head_dim_ratio": 4,
113
+ "dropout": 0.1,
114
+ "topic_score_range": [
115
+ -2.0,
116
+ 2.0
117
+ ],
118
+ "tone_score_range": [
119
+ 1.0,
120
+ 5.0
121
+ ],
122
+ "topics": [
123
+ "guidance",
124
+ "revenue_growth",
125
+ "margins",
126
+ "demand",
127
+ "buybacks",
128
+ "dividends",
129
+ "m_and_a",
130
+ "headcount",
131
+ "macro_exposure",
132
+ "competition"
133
+ ],
134
+ "tones": [
135
+ "mgmt_confidence",
136
+ "mgmt_defensiveness",
137
+ "analyst_skepticism"
138
+ ],
139
+ "loss_weights": {
140
+ "topic_mentioned": 0.5,
141
+ "topic_score": 1.0,
142
+ "tone_scores": 0.5
143
+ },
144
+ "torch_dtype": "float32",
145
+ "transformers_version": "5.6.2"
146
+ }
configuration_marks.py ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Configuration class for binomial-marks-1.
2
+
3
+ Distributed alongside the model on HuggingFace Hub so
4
+ `AutoConfig.from_pretrained(repo, trust_remote_code=True)` works.
5
+ """
6
+
7
+ from __future__ import annotations
8
+
9
+ from transformers.configuration_utils import PretrainedConfig
10
+
11
+
12
+ TOPICS = (
13
+ "guidance", "revenue_growth", "margins", "demand", "buybacks",
14
+ "dividends", "m_and_a", "headcount", "macro_exposure", "competition",
15
+ )
16
+ TONES = ("mgmt_confidence", "mgmt_defensiveness", "analyst_skepticism")
17
+
18
+
19
+ class MarksConfig(PretrainedConfig):
20
+ """Config for MarksMultiHead.
21
+
22
+ Holds the head spec and the underlying ModernBERT-large config (we wrap
23
+ it as a child config so HF tooling can serialize cleanly).
24
+ """
25
+
26
+ model_type = "marks"
27
+
28
+ def __init__(
29
+ self,
30
+ encoder_name_or_path: str = "answerdotai/ModernBERT-large",
31
+ encoder_config: dict | None = None,
32
+ max_position_embeddings: int = 16384,
33
+ # NOTE: named `marks_rope_strategy` (not `rope_scaling`) to avoid
34
+ # collision with PretrainedConfig.rope_scaling which transformers
35
+ # tries to validate as a dict shape.
36
+ marks_rope_strategy: str = "yarn", # "yarn" | "ntk" | "none"
37
+ original_max_position: int = 8192,
38
+ head_dim_ratio: int = 4, # head hidden = H // ratio
39
+ dropout: float = 0.1,
40
+ topic_score_range: tuple[float, float] = (-2.0, 2.0),
41
+ tone_score_range: tuple[float, float] = (1.0, 5.0),
42
+ topics: tuple[str, ...] = TOPICS,
43
+ tones: tuple[str, ...] = TONES,
44
+ loss_weights: dict | None = None,
45
+ **kwargs,
46
+ ):
47
+ super().__init__(**kwargs)
48
+ self.encoder_name_or_path = encoder_name_or_path
49
+ self.encoder_config = encoder_config or {}
50
+ self.max_position_embeddings = max_position_embeddings
51
+ self.marks_rope_strategy = marks_rope_strategy
52
+ self.original_max_position = original_max_position
53
+ self.head_dim_ratio = head_dim_ratio
54
+ self.dropout = dropout
55
+ self.topic_score_range = list(topic_score_range)
56
+ self.tone_score_range = list(tone_score_range)
57
+ self.topics = list(topics)
58
+ self.tones = list(tones)
59
+ self.loss_weights = loss_weights or {
60
+ "topic_mentioned": 0.5,
61
+ "topic_score": 1.5,
62
+ "tone_scores": 0.2,
63
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:2fe3429f8c69c8a78275ab6a74fd3a787f0247b278a5f0647b63b41e190ad18f
3
+ size 1585464372
modeling_marks.py ADDED
@@ -0,0 +1,275 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Self-contained model class for binomial-marks-1.
2
+
3
+ Distributed alongside the weights on HuggingFace Hub so anyone can do:
4
+
5
+ from transformers import AutoTokenizer, AutoModel
6
+ tok = AutoTokenizer.from_pretrained("BinomialTechnologies/binomial-marks-1")
7
+ model = AutoModel.from_pretrained("BinomialTechnologies/binomial-marks-1",
8
+ trust_remote_code=True)
9
+
10
+ This file imports only from `transformers` + `torch` — no project-internal
11
+ dependencies.
12
+
13
+ Architecture:
14
+ ModernBERT-large encoder (with optional YaRN RoPE extension to 16k)
15
+ ↓ (CLS + masked mean pool concatenated)
16
+ ↓ (3 × MLP heads)
17
+ 23 outputs:
18
+ 10 × topic_mentioned (binary classification, sigmoid → BCE loss)
19
+ 10 × topic_score (regression in [-2, +2] after clamp at inference)
20
+ 3 × tone_score (regression in [1, 5] after clamp at inference)
21
+ """
22
+
23
+ from __future__ import annotations
24
+
25
+ import math
26
+ from dataclasses import dataclass
27
+ from typing import Optional
28
+
29
+ import torch
30
+ import torch.nn as nn
31
+ import torch.nn.functional as F
32
+ from transformers import AutoModel, AutoConfig
33
+ from transformers.modeling_utils import PreTrainedModel
34
+ from transformers.modeling_outputs import ModelOutput
35
+
36
+ # Relative import — HF's `trust_remote_code` loader bundles sibling .py
37
+ # files together and resolves these without the symbol being "installed".
38
+ from .configuration_marks import MarksConfig, TOPICS, TONES
39
+
40
+
41
+ # ---------------------------------------------------------------------------
42
+ # YaRN RoPE extension — per-dim ramp; applied after model load
43
+ # ---------------------------------------------------------------------------
44
+
45
+ def _yarn_inv_freq(
46
+ head_dim: int,
47
+ base: float,
48
+ scale: float,
49
+ original_max_position: int,
50
+ beta_fast: float = 32.0,
51
+ beta_slow: float = 1.0,
52
+ device=None,
53
+ dtype=torch.float32,
54
+ ) -> torch.Tensor:
55
+ if scale <= 1.0:
56
+ return 1.0 / (base ** (torch.arange(0, head_dim, 2, device=device, dtype=dtype) / head_dim))
57
+ inv_freq_extrap = 1.0 / (base ** (torch.arange(0, head_dim, 2, device=device, dtype=dtype) / head_dim))
58
+ inv_freq_interp = inv_freq_extrap / scale
59
+ wavelengths = 2.0 * math.pi / inv_freq_extrap
60
+ L = original_max_position
61
+ ramp = (L / wavelengths - beta_slow) / (beta_fast - beta_slow)
62
+ ramp = ramp.clamp(0.0, 1.0)
63
+ return inv_freq_interp * (1.0 - ramp) + inv_freq_extrap * ramp
64
+
65
+
66
+ def _apply_yarn_to_modernbert(encoder, new_max_position: int,
67
+ original_max_position: int = 8192,
68
+ beta_fast: float = 32.0, beta_slow: float = 1.0):
69
+ if new_max_position == original_max_position:
70
+ return
71
+ scale = new_max_position / original_max_position
72
+ cfg = encoder.config
73
+ head_dim = cfg.hidden_size // cfg.num_attention_heads
74
+ global_base = float(getattr(cfg, "global_rope_theta", getattr(cfg, "rope_theta", 10000.0)))
75
+
76
+ rotary_modules = [
77
+ m for _, m in encoder.named_modules()
78
+ if m.__class__.__name__ == "ModernBertRotaryEmbedding"
79
+ ]
80
+ for mod in rotary_modules:
81
+ full_buf = getattr(mod, "full_attention_inv_freq", None)
82
+ if full_buf is None or full_buf.numel() != head_dim // 2:
83
+ continue
84
+ new_inv = _yarn_inv_freq(
85
+ head_dim=head_dim, base=global_base, scale=scale,
86
+ original_max_position=original_max_position,
87
+ beta_fast=beta_fast, beta_slow=beta_slow,
88
+ device=full_buf.device, dtype=full_buf.dtype,
89
+ )
90
+ full_buf.data.copy_(new_inv)
91
+
92
+
93
+ # ---------------------------------------------------------------------------
94
+ # Output dataclass
95
+ # ---------------------------------------------------------------------------
96
+
97
+ @dataclass
98
+ class MarksOutput(ModelOutput):
99
+ loss: Optional[torch.Tensor] = None
100
+ loss_components: Optional[dict] = None
101
+ topic_mentioned_logits: Optional[torch.Tensor] = None # (B, 10)
102
+ topic_score: Optional[torch.Tensor] = None # (B, 10)
103
+ tone_score: Optional[torch.Tensor] = None # (B, 3)
104
+
105
+
106
+ # ---------------------------------------------------------------------------
107
+ # Model
108
+ # ---------------------------------------------------------------------------
109
+
110
+ class MarksMultiHead(PreTrainedModel):
111
+ """Multi-head ModernBERT-large fine-tuned for earnings-call NLP scoring.
112
+
113
+ 23 outputs per call:
114
+ * topic_mentioned (binary, 10 dims)
115
+ * topic_score (regression in [-2, +2], 10 dims)
116
+ * tone_score (regression in [1, 5], 3 dims)
117
+ """
118
+
119
+ config_class = MarksConfig
120
+ base_model_prefix = "encoder"
121
+ supports_gradient_checkpointing = True
122
+
123
+ def __init__(self, config: MarksConfig):
124
+ super().__init__(config)
125
+ self.n_topics = len(config.topics)
126
+ self.n_tones = len(config.tones)
127
+
128
+ # Encoder — built from config (so we don't redownload base weights;
129
+ # weights come from this repo's safetensors).
130
+ if config.encoder_config:
131
+ enc_cfg = AutoConfig.from_dict(config.encoder_config) if hasattr(AutoConfig, "from_dict") else AutoConfig.for_model(**config.encoder_config)
132
+ else:
133
+ enc_cfg = AutoConfig.from_pretrained(config.encoder_name_or_path)
134
+
135
+ # Override the encoder ctx to the trained value (16384 for our v1).
136
+ enc_cfg.max_position_embeddings = config.max_position_embeddings
137
+
138
+ # Initialize encoder with config-only constructor (random init); the
139
+ # PreTrainedModel.from_pretrained caller will restore real weights
140
+ # from this repo's safetensors.
141
+ self.encoder = AutoModel.from_config(enc_cfg)
142
+ H = enc_cfg.hidden_size
143
+
144
+ # Head input is CLS + mean pool concatenated → 2H.
145
+ head_in = 2 * H
146
+ head_hidden = H // config.head_dim_ratio
147
+
148
+ def _mlp(out_dim: int) -> nn.Sequential:
149
+ return nn.Sequential(
150
+ nn.Linear(head_in, head_hidden),
151
+ nn.GELU(),
152
+ nn.Dropout(config.dropout),
153
+ nn.Linear(head_hidden, out_dim),
154
+ )
155
+
156
+ self.dropout = nn.Dropout(config.dropout)
157
+ self.head_topic_mentioned = _mlp(self.n_topics)
158
+ self.head_topic_score = _mlp(self.n_topics)
159
+ self.head_tone_score = _mlp(self.n_tones)
160
+
161
+ # Loss weights (used only if labels are passed for fine-tuning).
162
+ self._loss_weights = config.loss_weights
163
+
164
+ # Apply YaRN to encoder (idempotent if max_position == native).
165
+ if config.marks_rope_strategy == "yarn":
166
+ _apply_yarn_to_modernbert(
167
+ self.encoder,
168
+ new_max_position=config.max_position_embeddings,
169
+ original_max_position=config.original_max_position,
170
+ )
171
+ # NTK is applied inside encoder config; nothing to do here.
172
+
173
+ self.post_init()
174
+
175
+ # -------------------------------------------------------------------------
176
+ # Forward
177
+ # -------------------------------------------------------------------------
178
+ def forward(
179
+ self,
180
+ input_ids: torch.Tensor,
181
+ attention_mask: torch.Tensor,
182
+ topic_mentioned: Optional[torch.Tensor] = None,
183
+ topic_score: Optional[torch.Tensor] = None,
184
+ tone_score: Optional[torch.Tensor] = None,
185
+ **kwargs,
186
+ ) -> MarksOutput:
187
+
188
+ out = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
189
+ last_hidden = out.last_hidden_state # (B, T, H)
190
+
191
+ cls = last_hidden[:, 0] # (B, H)
192
+ m = attention_mask.unsqueeze(-1).to(last_hidden.dtype)
193
+ mean_pool = (last_hidden * m).sum(1) / m.sum(1).clamp(min=1.0) # (B, H)
194
+ pooled = self.dropout(torch.cat([cls, mean_pool], dim=-1)) # (B, 2H)
195
+
196
+ tm_logits = self.head_topic_mentioned(pooled)
197
+ ts_pred = self.head_topic_score(pooled)
198
+ tn_pred = self.head_tone_score(pooled)
199
+
200
+ loss, components = None, {}
201
+ if topic_mentioned is not None:
202
+ tm_logits_fp = tm_logits.float()
203
+ ts_pred_fp = ts_pred.float()
204
+ tn_pred_fp = tn_pred.float()
205
+ tm_t = topic_mentioned.float()
206
+ ts_t = topic_score.float()
207
+ tn_t = tone_score.float()
208
+
209
+ l_tm = F.binary_cross_entropy_with_logits(tm_logits_fp, tm_t)
210
+ l_ts = F.mse_loss(ts_pred_fp, ts_t)
211
+ l_tn = F.mse_loss(tn_pred_fp, tn_t)
212
+ components = {
213
+ "topic_mentioned": l_tm.detach(),
214
+ "topic_score": l_ts.detach(),
215
+ "tone_scores": l_tn.detach(),
216
+ }
217
+ w = self._loss_weights
218
+ loss = (
219
+ w["topic_mentioned"] * l_tm
220
+ + w["topic_score"] * l_ts
221
+ + w["tone_scores"] * l_tn
222
+ )
223
+
224
+ return MarksOutput(
225
+ loss=loss,
226
+ loss_components=components or None,
227
+ topic_mentioned_logits=tm_logits,
228
+ topic_score=ts_pred,
229
+ tone_score=tn_pred,
230
+ )
231
+
232
+ # -------------------------------------------------------------------------
233
+ # Convenience predict
234
+ # -------------------------------------------------------------------------
235
+ @torch.no_grad()
236
+ def predict(
237
+ self,
238
+ input_ids: torch.Tensor,
239
+ attention_mask: torch.Tensor,
240
+ mention_threshold: float = 0.5,
241
+ ) -> dict:
242
+ """Run a forward pass and return clamped + masked predictions.
243
+
244
+ Returns a dict with:
245
+ topic_mentioned (B, 10) hard 0/1
246
+ topic_mentioned_prob (B, 10) sigmoid confidence
247
+ topic_score (B, 10) clamped to [-2, +2], zeroed where mentioned=0
248
+ tone_score (B, 3) clamped to [1, 5]
249
+ """
250
+ out = self.forward(input_ids=input_ids, attention_mask=attention_mask)
251
+ prob = torch.sigmoid(out.topic_mentioned_logits)
252
+ mentioned = (prob >= mention_threshold).float()
253
+ ts_lo, ts_hi = self.config.topic_score_range
254
+ tn_lo, tn_hi = self.config.tone_score_range
255
+ ts = out.topic_score.clamp(ts_lo, ts_hi) * mentioned
256
+ tn = out.tone_score.clamp(tn_lo, tn_hi)
257
+ return {
258
+ "topic_mentioned": mentioned,
259
+ "topic_mentioned_prob": prob,
260
+ "topic_score": ts,
261
+ "tone_score": tn,
262
+ }
263
+
264
+ # -------------------------------------------------------------------------
265
+ # Gradient checkpointing — delegate to encoder
266
+ # -------------------------------------------------------------------------
267
+ def gradient_checkpointing_enable(self, gradient_checkpointing_kwargs=None):
268
+ if hasattr(self.encoder, "gradient_checkpointing_enable"):
269
+ self.encoder.gradient_checkpointing_enable(
270
+ gradient_checkpointing_kwargs=gradient_checkpointing_kwargs
271
+ )
272
+
273
+ def gradient_checkpointing_disable(self):
274
+ if hasattr(self.encoder, "gradient_checkpointing_disable"):
275
+ self.encoder.gradient_checkpointing_disable()
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,17 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "backend": "tokenizers",
3
+ "clean_up_tokenization_spaces": true,
4
+ "cls_token": "[CLS]",
5
+ "is_local": false,
6
+ "local_files_only": false,
7
+ "mask_token": "[MASK]",
8
+ "model_input_names": [
9
+ "input_ids",
10
+ "attention_mask"
11
+ ],
12
+ "model_max_length": 8192,
13
+ "pad_token": "[PAD]",
14
+ "sep_token": "[SEP]",
15
+ "tokenizer_class": "TokenizersBackend",
16
+ "unk_token": "[UNK]"
17
+ }