rtferraz commited on
Commit
f580186
·
verified ·
1 Parent(s): 6c4ad4d

Update README v0.3.0 — add usage example, update roadmap status, add implementation report link

Browse files
Files changed (1) hide show
  1. README.md +86 -70
README.md CHANGED
@@ -18,6 +18,46 @@ Text LLM: "The cat sat on the mat" → [The] [cat] [sat] [on] [the] [mat]
18
  domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase
19
  ```
20
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  ## 🏦 Industry Validation: Nubank's nuFormer
22
 
23
  This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
@@ -38,98 +78,74 @@ This isn't just theory. **Nubank** (100M+ customers, Latin America's largest dig
38
  | Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
39
  | Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token |
40
 
41
- ## Research Foundation
42
-
43
- This project is grounded in 35+ papers from Google, Google DeepMind, Nubank, Yandex, and the broader research community. The key finding: **any sequential domain data can be tokenized and modeled with the LLM paradigm** — the challenge is *how* to tokenize.
44
-
45
- | Paradigm | Method | Key Paper |
46
- |----------|--------|-----------|
47
- | **Semantic IDs** | RQ-VAE quantization of item embeddings | [TIGER](https://arxiv.org/abs/2305.05065) (Google, 2023) |
48
- | **Action Tokenization** | BPE-like merging of feature patterns | [ActionPiece](https://arxiv.org/abs/2502.13581) (DeepMind, 2025) |
49
- | **Transaction Tokenization** | Special tokens + BPE hybrid | [nuFormer](https://arxiv.org/abs/2507.23267) (Nubank, 2025) |
50
- | **Tabular Tokenization** | Periodic embeddings for numbers | [PLR](https://arxiv.org/abs/2203.05556) (Yandex, 2022) |
51
- | **Universal Tokenization** | All modalities → shared discrete space | [Meta-Transformer](https://arxiv.org/abs/2307.10802) (2023) |
52
-
53
  ## Documentation
54
 
55
  | Document | Description |
56
  |----------|-------------|
57
- | 📄 [`docs/research_report.md`](docs/research_report.md) | **Comprehensive research survey** — 31 papers across 5 paradigms, technical taxonomy, full blueprint |
58
- | 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** — complete pipeline reconstruction, 4 academic pillars, adaptation playbooks |
59
- | 🏗️ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** — framework choice (PyTorch+HF), trade-offs vs JAX/Keras, detailed implementation roadmap with code |
60
-
61
- ## Implementation Decision
62
-
63
- After auditing all 6 reference papers and evaluating PyTorch, JAX/Flax NNX, and Keras 3 + JAX:
64
-
65
- **Decision: PyTorch + HuggingFace Transformers** (with JAX as future scaling path)
66
-
67
- Key reasons:
68
- - **5 of 6 reference papers use PyTorch** (including Google DeepMind's ActionPiece)
69
- - **HuggingFace has the only complete custom tokenizer pipeline** (`PreTrainedTokenizerFast` → Trainer → push_to_hub)
70
- - **Production deployment is direct:** ONNX, TGI, vLLM all first-class
71
- - JAX advantages (TPU, XLA) only matter at >1B params on 256+ accelerators — not at our 24M–330M scale
72
-
73
- Full analysis: [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md)
74
 
75
  ## Project Roadmap
76
 
77
  ### Phase 1: Research & Survey ✅
78
- - Literature survey (35+ papers)
79
- - Nubank nuFormer reverse-engineering
80
- - Framework ADR with detailed implementation plan
81
-
82
- ### Phase 2: Core Library (Next ~9 weeks)
83
- - **Weeks 1–3:** Domain tokenizer library (schema per-field tokenizers → HF-compatible composite tokenizer)
84
- - **Weeks 3–5:** GPT-style Transformer with NoPE + PLR embeddings + DCNv2 joint fusion
85
- - **Weeks 5–7:** Pre-training pipeline (CLM on domain sequences via HF Trainer)
86
- - **Weeks 7–9:** Fine-tuning pipeline (nuFormer-style joint fusion)
87
-
88
- ### Phase 3: Domain Demos (Weeks 9–12)
89
- - Finance: fraud detection, credit scoring
90
  - E-commerce: next purchase prediction, customer segmentation
91
 
92
- ### Phase 4: Scale & Optimize (Weeks 12+)
93
  - 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary
94
 
95
  ## Repo Structure
96
 
97
  ```
98
- domainTokenizer/
99
- ├── docs/
100
- ├── research_report.md # 51KB Full research survey
101
- ├── nubank_nuformer_analysis.md # 29KB — Nubank pipeline analysis
102
- ── adr/
103
- └── ADR-001-implementation-framework.md # Framework decision + roadmap
104
- ├── src/ # (Phase 2) Core library
105
- ── tokenizers/ # Schema, field tokenizers, composite builder
106
- ├── models/ # DomainTransformer, PLR, DCNv2, JointFusion
107
- ── training/ # Data pipeline, pre-training, fine-tuning
108
- ├── examples/ # (Phase 3) Domain-specific demos
109
- ── README.md
 
 
 
 
 
 
 
 
110
  ```
111
 
112
  ## Key References
113
 
114
- | Paper | Year | What It Does | Link |
115
- |-------|------|-------------|------|
116
- | **nuFormer** (Nubank) | 2025 | Transaction foundation model at production scale | [arXiv](https://arxiv.org/abs/2507.23267) |
117
- | TIGER (Google) | 2023 | Semantic IDs for products via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
118
- | ActionPiece (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
119
- | RecFormer | 2023 | Items as key-value text representations | [arXiv](https://arxiv.org/abs/2305.13731) |
120
- | PLR Embeddings (Yandex) | 2022 | Periodic embeddings for numerical features | [arXiv](https://arxiv.org/abs/2203.05556) |
121
- | DCN V2 (Google) | 2021 | Feature crossing for tabular data | [arXiv](https://arxiv.org/abs/2008.13535) |
122
- | NoPE | 2023 | No positional encoding beats RoPE/ALiBi | [arXiv](https://arxiv.org/abs/2305.19466) |
123
- | KL3M Tokenizers | 2025 | Domain-specific BPE for finance/legal | [arXiv](https://arxiv.org/abs/2503.17247) |
124
- | Banking TF | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
125
- | Nested Learning (HOPE) | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
126
 
127
  Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)
128
 
129
  ## License
130
 
131
  MIT
132
-
133
- ---
134
-
135
- *domainTokenizer is an early-stage research project exploring the frontier of domain-specific tokenization for small, efficient AI models.*
 
18
  domainTokenizer: Customer purchase history → [HighEndElectronics] [WeekdayCredit] [Accessory+SameDay] → Transformer → next purchase
19
  ```
20
 
21
+ ## Quick Start
22
+
23
+ ```python
24
+ from domain_tokenizer import (
25
+ DomainTokenizerBuilder, DomainTransformerConfig,
26
+ DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
27
+ )
28
+ from domain_tokenizer.schemas import FINANCE_SCHEMA
29
+
30
+ # 1. Build tokenizer from schema (Nubank-style: 97 domain tokens + BPE)
31
+ builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
32
+ builder.fit(all_events) # fit magnitude bins on training data
33
+ hf_tokenizer = builder.build(text_corpus=descriptions, bpe_vocab_size=8000)
34
+
35
+ # 2. Prepare packed training data (100% token utilization, zero padding waste)
36
+ dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=512)
37
+
38
+ # 3. Create model (GPT-style, NoPE, pre-norm — 24M params)
39
+ config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
40
+ model = DomainTransformerForCausalLM(config)
41
+
42
+ # 4. Pre-train with HF Trainer (cosine schedule, CLM objective)
43
+ pretrain_domain_model(
44
+ model, hf_tokenizer, dataset,
45
+ hub_model_id="org/finance-24m", # auto push to HF Hub
46
+ num_epochs=10, learning_rate=3e-4,
47
+ bf16=True, # A100/H100
48
+ report_to="trackio", # live monitoring
49
+ )
50
+
51
+ # 5. Fine-tune for downstream tasks (nuFormer-style joint fusion)
52
+ from domain_tokenizer import JointFusionModel
53
+ fusion = JointFusionModel(
54
+ transformer_model=model, # pre-trained, unfrozen
55
+ n_tabular_features=291, # hand-crafted tabular features
56
+ n_classes=1, # binary: will user activate product?
57
+ )
58
+ # Train fusion model end-to-end on labeled data...
59
+ ```
60
+
61
  ## 🏦 Industry Validation: Nubank's nuFormer
62
 
63
  This isn't just theory. **Nubank** (100M+ customers, Latin America's largest digital bank) built exactly this and published the full recipe:
 
78
  | Timestamp `2025-03-15` | Calendar-unaware text fragments | `[Wednesday, Afternoon, 2_days_later]` |
79
  | Cross-field patterns | Lost in flat token stream | Discovered via BPE-like merging: `{Electronics + $50-100}` → composite token |
80
 
 
 
 
 
 
 
 
 
 
 
 
 
81
  ## Documentation
82
 
83
  | Document | Description |
84
  |----------|-------------|
85
+ | 📄 [`docs/research_report.md`](docs/research_report.md) | **Research survey** — 31 papers across 5 paradigms, technical taxonomy, blueprint |
86
+ | 🏦 [`docs/nubank_nuformer_analysis.md`](docs/nubank_nuformer_analysis.md) | **Nubank reverse-engineering** — full pipeline reconstruction, 4 academic pillars |
87
+ | 🏗️ [`docs/adr/ADR-001-implementation-framework.md`](docs/adr/ADR-001-implementation-framework.md) | **Architecture Decision Record** — PyTorch+HF vs JAX/Keras, trade-offs, roadmap |
88
+ | 📊 [`docs/phase2_implementation_report.md`](docs/phase2_implementation_report.md) | **Implementation report** — Phase 2A-2C technical decisions, architecture, 124 tests |
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
  ## Project Roadmap
91
 
92
  ### Phase 1: Research & Survey ✅
93
+ - Literature survey (35+ papers), Nubank reverse-engineering, framework ADR
94
+
95
+ ### Phase 2: Core Library (v0.3.0 — 124 tests passing)
96
+ - **2A:** Domain tokenizer library — schema, 5 field tokenizers, HF-compatible builder
97
+ - **2B:** Model architectureDomainTransformerForCausalLM (NoPE GPT), PLR embeddings, DCNv2 + JointFusion
98
+ - **2C:** Pre-training pipeline sequence packing, DataCollatorForLanguageModeling, HF Trainer
99
+ - **2D:** Fine-tuning pipeline (next)
100
+
101
+ ### Phase 3: Domain Demos
102
+ - Finance: fraud detection, credit scoring on real data
 
 
103
  - E-commerce: next purchase prediction, customer segmentation
104
 
105
+ ### Phase 4: Scale & Optimize
106
  - 330M param models, `torch.compile()`, ONNX export, ActionPiece vocabulary
107
 
108
  ## Repo Structure
109
 
110
  ```
111
+ src/domain_tokenizer/
112
+ ├── __init__.py # v0.3.0 — all public exports
113
+ ├── schema.py # DomainSchema, FieldSpec, FieldType
114
+ ├── tokenizers/
115
+ ── field_tokenizers.py # Sign, MagnitudeBucket, Calendar, Categorical, Discrete
116
+ └── domain_tokenizer.py # DomainTokenizerBuilder HF PreTrainedTokenizerFast
117
+ ├── schemas/
118
+ ── predefined.py # FINANCE_SCHEMA, ECOMMERCE_SCHEMA, HEALTHCARE_SCHEMA
119
+ ├── models/
120
+ ── configuration.py # DomainTransformerConfig (24M/85M/330M presets)
121
+ ├── modeling.py # DomainTransformerForCausalLM (NoPE, SDPA, weight-tied)
122
+ │ ├── plr_embeddings.py # PeriodicLinearReLU (Gorishniy et al. 2022)
123
+ │ └── joint_fusion.py # DCNv2 + JointFusionModel (nuFormer-style)
124
+ └── training/
125
+ ├── data_pipeline.py # tokenize → pack → HFDataset
126
+ └── pretrain.py # pretrain_domain_model (HF Trainer)
127
+ tests/
128
+ ├── test_tokenizer.py # 72 tests
129
+ ├── test_model.py # 33 tests
130
+ └── test_training.py # 19 tests
131
  ```
132
 
133
  ## Key References
134
 
135
+ | Paper | Year | Role in domainTokenizer | Link |
136
+ |-------|------|------------------------|------|
137
+ | **nuFormer** (Nubank) | 2025 | Overall architecture blueprint | [arXiv](https://arxiv.org/abs/2507.23267) |
138
+ | **NoPE** | 2023 | No positional encoding our attention design | [arXiv](https://arxiv.org/abs/2305.19466) |
139
+ | **PLR Embeddings** (Yandex) | 2022 | Numerical feature embeddings | [arXiv](https://arxiv.org/abs/2203.05556) |
140
+ | **DCN V2** (Google) | 2021 | Tabular feature crossing in joint fusion | [arXiv](https://arxiv.org/abs/2008.13535) |
141
+ | **RecFormer** | 2023 | Items-as-text tokenization philosophy | [arXiv](https://arxiv.org/abs/2305.13731) |
142
+ | **TIGER** (Google) | 2023 | Semantic IDs via RQ-VAE | [arXiv](https://arxiv.org/abs/2305.05065) |
143
+ | **ActionPiece** (DeepMind) | 2025 | BPE for user action sequences | [arXiv](https://arxiv.org/abs/2502.13581) |
144
+ | **Banking TF** | 2024 | Transaction tokenizer for French banking | [arXiv](https://arxiv.org/abs/2410.08243) |
145
+ | **Nested Learning (HOPE)** | 2025 | Continual learning via multi-timescale memory | [arXiv](https://arxiv.org/abs/2512.24695) |
 
146
 
147
  Full reference table (35+ papers): [`docs/research_report.md`](docs/research_report.md#10-complete-paper-reference-table)
148
 
149
  ## License
150
 
151
  MIT