Add ADR-002: Dataset selection for Phase 3 demos — research findings, rationale, phased plan
Browse files
docs/adr/ADR-002-dataset-selection.md
ADDED
|
@@ -0,0 +1,322 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# ADR-002: Dataset Selection for Phase 3 Domain Demos
|
| 2 |
+
|
| 3 |
+
> **Status:** Accepted
|
| 4 |
+
> **Date:** April 30, 2026
|
| 5 |
+
> **Decision:** Start with `mindweave/bank-transactions-us` for pipeline validation, then scale to Sparkov (finance), REES46 (e-commerce), and Synthea (healthcare)
|
| 6 |
+
|
| 7 |
+
---
|
| 8 |
+
|
| 9 |
+
## 1. Context
|
| 10 |
+
|
| 11 |
+
Phase 2 delivered a complete library (v0.4.0, 139 tests) with tokenizers, models, and training pipelines — all validated on synthetic data generated in test fixtures. Phase 3 requires running the full pipeline on **real public datasets** to produce trained models and benchmark against baselines.
|
| 12 |
+
|
| 13 |
+
We need datasets for three domains matching our predefined schemas:
|
| 14 |
+
|
| 15 |
+
| Schema | Required Fields | Minimum Scale for Demo |
|
| 16 |
+
|--------|----------------|----------------------|
|
| 17 |
+
| `FINANCE_SCHEMA` | timestamp, signed amount, text description | 100+ users × 10+ events |
|
| 18 |
+
| `ECOMMERCE_SCHEMA` | timestamp, price, event type, category, product text | 1000+ users × 10+ events |
|
| 19 |
+
| `HEALTHCARE_SCHEMA` | timestamp, event type, severity, cost, clinical text | 1000+ patients × 10+ events |
|
| 20 |
+
|
| 21 |
+
The strategy: **start small to validate the pipeline end-to-end, then scale to production-sized datasets.**
|
| 22 |
+
|
| 23 |
+
---
|
| 24 |
+
|
| 25 |
+
## 2. Dataset Analysis
|
| 26 |
+
|
| 27 |
+
### 2.1 Candidates Evaluated
|
| 28 |
+
|
| 29 |
+
We evaluated 8 datasets across 3 domains. Each was checked for: HuggingFace Hub availability, schema compatibility with our field types, scale (users × events), licensing, and accessibility (instant download vs. gated/external).
|
| 30 |
+
|
| 31 |
+
#### Finance Candidates
|
| 32 |
+
|
| 33 |
+
| Dataset | Source | Users | Events | Schema Fit | Access |
|
| 34 |
+
|---------|--------|-------|--------|-----------|--------|
|
| 35 |
+
| **mindweave/bank-transactions-us** | HF Hub | ~20 accounts | ~400 | ✅ Perfect | Instant |
|
| 36 |
+
| **Sparkov CC Fraud** (kartik2112) | Kaggle | ~1,000 | 1.3M | ✅ Excellent | Kaggle account |
|
| 37 |
+
| IBM AML Transactions | GitHub | Thousands | 550K–55M | ✅ Good | Direct download |
|
| 38 |
+
|
| 39 |
+
**mindweave/bank-transactions-us** — inspected in detail:
|
| 40 |
+
- Config `bank_transactions`: 11 columns, 0.4MB Parquet
|
| 41 |
+
- `transaction_date` (string, `"2024-01-04"`) → maps to `FINANCE_SCHEMA.timestamp` ✅
|
| 42 |
+
- `amount` (float64, signed: `-17584.14` for payroll, `+1413.94` for deposits) → maps to `amount` and `amount_sign` ✅
|
| 43 |
+
- `description` (string, `"Payroll - net wages"`, `"Customer payment received"`) → maps to `description` ✅
|
| 44 |
+
- `source_module` (string, `"payroll"`, `"sales"`, `"purchases"`) → bonus categorical field ✅
|
| 45 |
+
- `transaction_type` (string, `"withdrawal"`, `"deposit"`) → redundant with sign, but useful for validation
|
| 46 |
+
- `bank_account_id` (UUID) → user grouping key ✅
|
| 47 |
+
- Linked `bank_accounts` table has company/bank metadata for potential tabular features
|
| 48 |
+
|
| 49 |
+
**Scale limitation:** ~20 accounts × ~20 transactions each = ~400 total. This is too small for meaningful pre-training, but **schema-perfect for pipeline validation**: every field maps directly to `FINANCE_SCHEMA` without transformation. The model will overfit immediately, but that's exactly what confirms the pipeline works.
|
| 50 |
+
|
| 51 |
+
**Sparkov CC Fraud** — the scale-up target:
|
| 52 |
+
- ~1,000 cardholders × ~1,300 transactions each = 1.3M total events
|
| 53 |
+
- `trans_date_trans_time`, `amt`, `merchant`, `category`, `cc_num`, `is_fraud`
|
| 54 |
+
- CC0 license (public domain)
|
| 55 |
+
- `is_fraud` provides a natural fine-tuning label (binary classification)
|
| 56 |
+
- Requires Kaggle account for download (free, instant)
|
| 57 |
+
|
| 58 |
+
#### E-Commerce Candidates
|
| 59 |
+
|
| 60 |
+
| Dataset | Source | Users | Events | Schema Fit | Access |
|
| 61 |
+
|---------|--------|-------|--------|-----------|--------|
|
| 62 |
+
| **REES46 Behavioral** | HF Hub | Millions | 42M | ✅ Perfect | Instant |
|
| 63 |
+
| Amazon Reviews 2023 | HF Hub (gated) | 33M | 571M | ⚠️ No price in reviews | HF token |
|
| 64 |
+
|
| 65 |
+
**REES46** (`kevykibbz/ecommerce-behavior-data-from-multi-category-store_oct-nov_2019`) — inspected:
|
| 66 |
+
- 2GB Parquet (10 files), fully accessible on HF Hub
|
| 67 |
+
- `event_time` (ISO 8601), `event_type` (`"view"`, `"cart"`, `"purchase"`), `product_id`, `category_code` (`"electronics.smartphone"`), `brand`, `price` (float64), `user_id`
|
| 68 |
+
- Every ECOMMERCE_SCHEMA field maps directly
|
| 69 |
+
- For pre-training: filter to `purchase` events for clean transaction sequences
|
| 70 |
+
- Scale: can subsample to 10K–100K users for demo, millions available for production
|
| 71 |
+
|
| 72 |
+
#### Healthcare Candidates
|
| 73 |
+
|
| 74 |
+
| Dataset | Source | Patients | Events | Schema Fit | Access |
|
| 75 |
+
|---------|--------|----------|--------|-----------|--------|
|
| 76 |
+
| **Synthea 575K** | HF Hub | 575K | Millions | ✅ Excellent | Instant |
|
| 77 |
+
| Synthea Direct | synthea.mitre.org | 100K–1M | Millions | ✅ Same | Direct download |
|
| 78 |
+
| MIMIC-IV | PhysioNet | 40K+ ICU | Millions | ✅ Gold standard | 1-2 day DUA |
|
| 79 |
+
|
| 80 |
+
**Synthea 575K** (`richardyoung/synthea-575k-patients`) — inspected:
|
| 81 |
+
- 136GB total across 18 Parquet files (allergies, conditions, encounters, medications, observations, procedures, etc.)
|
| 82 |
+
- Default config shows allergies table: `START` (date), `PATIENT` (UUID), `DESCRIPTION`, `TYPE`, `CATEGORY`, `SEVERITY1`
|
| 83 |
+
- For richer sequences: load `encounters.parquet` (5.1GB) with `Start`, `DESCRIPTION`, `Base_Cost`, `REASONDESCRIPTION`
|
| 84 |
+
- Fully synthetic — no IRB, no access restrictions, MIT/Apache 2.0 license
|
| 85 |
+
|
| 86 |
+
### 2.2 Schema Mapping Verification
|
| 87 |
+
|
| 88 |
+
Direct field mapping from `mindweave/bank-transactions-us` to `FINANCE_SCHEMA`:
|
| 89 |
+
|
| 90 |
+
```
|
| 91 |
+
Dataset Column → FINANCE_SCHEMA Field → Tokenizer
|
| 92 |
+
─────────────────────────────────────────────────────────────────
|
| 93 |
+
amount (sign) → amount_sign → SignTokenizer (2 tokens)
|
| 94 |
+
amount (magnitude) → amount → MagnitudeBucketTokenizer (21 bins)
|
| 95 |
+
transaction_date → timestamp → CalendarTokenizer (month/dow/dom/hour)
|
| 96 |
+
description → description → BPE subword tokenizer
|
| 97 |
+
─────────────────────────────────────────────────────────────────
|
| 98 |
+
bank_account_id → (user grouping key) → group-by for user sequences
|
| 99 |
+
source_module → (bonus: not in schema) → could extend schema
|
| 100 |
+
transaction_type → (redundant with sign) → validation check
|
| 101 |
+
```
|
| 102 |
+
|
| 103 |
+
**Zero transformation needed.** The `amount` field is already signed (negative = withdrawal, positive = deposit). The `description` field contains natural text suitable for BPE. The `transaction_date` is a standard date string. This is the cleanest possible mapping to our schema.
|
| 104 |
+
|
| 105 |
+
---
|
| 106 |
+
|
| 107 |
+
## 3. Decision
|
| 108 |
+
|
| 109 |
+
### Phased approach: validate small → scale up
|
| 110 |
+
|
| 111 |
+
| Phase | Dataset | Purpose | Scale |
|
| 112 |
+
|-------|---------|---------|-------|
|
| 113 |
+
| **3.0: Pipeline Validation** | `mindweave/bank-transactions-us` | Verify end-to-end: load → tokenize → pack → train → loss decreases | ~400 events, ~20 accounts |
|
| 114 |
+
| **3.1: Finance Demo** | Sparkov CC Fraud (Kaggle) | Train 24M model, fine-tune fraud detection, benchmark vs LightGBM | 1.3M events, 1K users |
|
| 115 |
+
| **3.2: E-Commerce Demo** | REES46 (HF Hub) | Train 24M model, next-purchase prediction | 42M events, subsample to 100K users |
|
| 116 |
+
| **3.3: Healthcare Demo** | Synthea 575K (HF Hub) | Train 24M model, condition prediction | 575K patients, subsample encounters |
|
| 117 |
+
|
| 118 |
+
### Rationale
|
| 119 |
+
|
| 120 |
+
1. **Start with mindweave because it's schema-perfect and instant.** No data cleaning, no field renaming, no Kaggle credentials needed. The pipeline either works or it doesn't — this dataset tells us in minutes.
|
| 121 |
+
|
| 122 |
+
2. **The model will overfit on 400 events — that's the point.** If loss doesn't decrease on 400 events, the pipeline is broken. If it does, the pipeline works and we can scale with confidence.
|
| 123 |
+
|
| 124 |
+
3. **Sparkov is the real finance demo.** 1,000 users × 1,300 events is the exact scale where a 24M-parameter model should learn meaningful patterns. The `is_fraud` label enables a direct comparison with LightGBM on the same data.
|
| 125 |
+
|
| 126 |
+
4. **REES46 is the flagship demo.** Millions of events, real behavioral data, perfect schema fit, instant HF download. This is the dataset that demonstrates domainTokenizer's value proposition most compellingly.
|
| 127 |
+
|
| 128 |
+
5. **Synthea is the healthcare proof point.** Fully synthetic (no access barriers), massive scale, multiple event types. Validates that the domain tokenizer approach generalizes beyond finance and e-commerce.
|
| 129 |
+
|
| 130 |
+
---
|
| 131 |
+
|
| 132 |
+
## 4. Implementation
|
| 133 |
+
|
| 134 |
+
### 4.1 Phase 3.0: Pipeline Validation with mindweave
|
| 135 |
+
|
| 136 |
+
**Goal:** Run the complete pipeline end-to-end on real data, verify loss decreases, confirm no bugs.
|
| 137 |
+
|
| 138 |
+
**Step 1: Load and explore the data**
|
| 139 |
+
|
| 140 |
+
```python
|
| 141 |
+
from datasets import load_dataset
|
| 142 |
+
import pandas as pd
|
| 143 |
+
|
| 144 |
+
# Load bank transactions
|
| 145 |
+
ds = load_dataset("mindweave/bank-transactions-us", "bank_transactions", split="train")
|
| 146 |
+
df = ds.to_pandas()
|
| 147 |
+
|
| 148 |
+
# Basic stats
|
| 149 |
+
print(f"Total transactions: {len(df)}")
|
| 150 |
+
print(f"Unique accounts: {df['bank_account_id'].nunique()}")
|
| 151 |
+
print(f"Date range: {df['transaction_date'].min()} to {df['transaction_date'].max()}")
|
| 152 |
+
print(f"Amount range: {df['amount'].min():.2f} to {df['amount'].max():.2f}")
|
| 153 |
+
print(f"Descriptions: {df['description'].nunique()} unique")
|
| 154 |
+
print(f"Source modules: {df['source_module'].value_counts().to_dict()}")
|
| 155 |
+
```
|
| 156 |
+
|
| 157 |
+
**Step 2: Convert to domainTokenizer event format**
|
| 158 |
+
|
| 159 |
+
```python
|
| 160 |
+
from datetime import datetime
|
| 161 |
+
|
| 162 |
+
def row_to_event(row):
|
| 163 |
+
"""Convert a DataFrame row to a FINANCE_SCHEMA event dict."""
|
| 164 |
+
return {
|
| 165 |
+
"amount_sign": row["amount"], # SignTokenizer reads the sign
|
| 166 |
+
"amount": row["amount"], # MagnitudeBucketTokenizer reads abs value
|
| 167 |
+
"timestamp": datetime.strptime(row["transaction_date"], "%Y-%m-%d"),
|
| 168 |
+
"description": row["description"], # BPE tokenizer
|
| 169 |
+
}
|
| 170 |
+
|
| 171 |
+
# Group by account → list of event sequences
|
| 172 |
+
user_sequences = []
|
| 173 |
+
for account_id, group in df.sort_values("transaction_date").groupby("bank_account_id"):
|
| 174 |
+
events = [row_to_event(row) for _, row in group.iterrows()]
|
| 175 |
+
user_sequences.append(events)
|
| 176 |
+
|
| 177 |
+
print(f"Users: {len(user_sequences)}")
|
| 178 |
+
print(f"Events per user: {[len(s) for s in user_sequences]}")
|
| 179 |
+
```
|
| 180 |
+
|
| 181 |
+
**Step 3: Build tokenizer, prepare data, train**
|
| 182 |
+
|
| 183 |
+
```python
|
| 184 |
+
from domain_tokenizer import (
|
| 185 |
+
DomainTokenizerBuilder, DomainTransformerConfig,
|
| 186 |
+
DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
|
| 187 |
+
)
|
| 188 |
+
from domain_tokenizer.schemas import FINANCE_SCHEMA
|
| 189 |
+
|
| 190 |
+
# Build tokenizer
|
| 191 |
+
all_events = [e for seq in user_sequences for e in seq]
|
| 192 |
+
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
|
| 193 |
+
builder.fit(all_events)
|
| 194 |
+
hf_tokenizer = builder.build(
|
| 195 |
+
text_corpus=[e["description"] for e in all_events],
|
| 196 |
+
bpe_vocab_size=500, # small vocab for small dataset
|
| 197 |
+
)
|
| 198 |
+
|
| 199 |
+
# Prepare packed dataset
|
| 200 |
+
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=128)
|
| 201 |
+
print(f"Packed blocks: {len(dataset)} × 128 tokens")
|
| 202 |
+
|
| 203 |
+
# Create tiny model (for validation, not real training)
|
| 204 |
+
config = DomainTransformerConfig(
|
| 205 |
+
vocab_size=hf_tokenizer.vocab_size,
|
| 206 |
+
hidden_size=128, num_hidden_layers=4, num_attention_heads=4,
|
| 207 |
+
intermediate_size=512,
|
| 208 |
+
)
|
| 209 |
+
model = DomainTransformerForCausalLM(config)
|
| 210 |
+
print(f"Model params: {sum(p.numel() for p in model.parameters()):,}")
|
| 211 |
+
|
| 212 |
+
# Train — expect loss to decrease rapidly (overfitting on small data = pipeline works)
|
| 213 |
+
pretrain_domain_model(
|
| 214 |
+
model, hf_tokenizer, dataset,
|
| 215 |
+
num_epochs=20,
|
| 216 |
+
per_device_batch_size=4,
|
| 217 |
+
gradient_accumulation_steps=1,
|
| 218 |
+
learning_rate=3e-4,
|
| 219 |
+
warmup_steps=10,
|
| 220 |
+
logging_steps=5,
|
| 221 |
+
save_strategy="no",
|
| 222 |
+
report_to="none",
|
| 223 |
+
)
|
| 224 |
+
```
|
| 225 |
+
|
| 226 |
+
**Expected outcome:** Loss should drop from ~6.0 to <2.0 within 20 epochs on 400 events. If it does, the pipeline is validated. If it doesn't, there's a bug in tokenization, packing, or model architecture.
|
| 227 |
+
|
| 228 |
+
**Validation checks after training:**
|
| 229 |
+
- [ ] Loss decreased monotonically (overfitting expected and desired)
|
| 230 |
+
- [ ] No NaN/inf in loss or gradients
|
| 231 |
+
- [ ] Token distribution is reasonable (no >50% UNK tokens)
|
| 232 |
+
- [ ] `builder.tokenize_event()` produces expected token strings for sample events
|
| 233 |
+
- [ ] `hf_tokenizer.decode()` on model output produces recognizable token strings
|
| 234 |
+
|
| 235 |
+
### 4.2 Phase 3.1: Finance Demo with Sparkov (After Validation)
|
| 236 |
+
|
| 237 |
+
```bash
|
| 238 |
+
# Download from Kaggle
|
| 239 |
+
kaggle datasets download kartik2112/fraud-detection -p data/
|
| 240 |
+
unzip data/fraud-detection.zip -d data/sparkov/
|
| 241 |
+
```
|
| 242 |
+
|
| 243 |
+
```python
|
| 244 |
+
import pandas as pd
|
| 245 |
+
|
| 246 |
+
df = pd.read_csv("data/sparkov/fraudTrain.csv")
|
| 247 |
+
|
| 248 |
+
def sparkov_to_event(row):
|
| 249 |
+
return {
|
| 250 |
+
"amount_sign": row["amt"], # always positive in Sparkov; sign from context
|
| 251 |
+
"amount": row["amt"],
|
| 252 |
+
"timestamp": datetime.strptime(row["trans_date_trans_time"], "%Y-%m-%d %H:%M:%S"),
|
| 253 |
+
"description": f"{row['merchant']} {row['category']}",
|
| 254 |
+
}
|
| 255 |
+
|
| 256 |
+
# Group by cardholder
|
| 257 |
+
user_sequences = []
|
| 258 |
+
labels = [] # for fine-tuning: any fraud in user's history?
|
| 259 |
+
for cc_num, group in df.sort_values("trans_date_trans_time").groupby("cc_num"):
|
| 260 |
+
events = [sparkov_to_event(row) for _, row in group.iterrows()]
|
| 261 |
+
user_sequences.append(events)
|
| 262 |
+
labels.append(int(group["is_fraud"].any()))
|
| 263 |
+
|
| 264 |
+
# Pre-train 24M model on 1K users × 1.3K events
|
| 265 |
+
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
|
| 266 |
+
model = DomainTransformerForCausalLM(config)
|
| 267 |
+
# ... pretrain_domain_model(model, ..., bf16=True) # requires GPU
|
| 268 |
+
|
| 269 |
+
# Fine-tune for fraud detection
|
| 270 |
+
# ... finetune_domain_model(fusion_model, ft_dataset, ...)
|
| 271 |
+
```
|
| 272 |
+
|
| 273 |
+
**Hardware:** a10g-large (24GB VRAM), ~2-3 hours for 24M model on 1.3M events.
|
| 274 |
+
|
| 275 |
+
### 4.3 Phase 3.2: E-Commerce Demo with REES46
|
| 276 |
+
|
| 277 |
+
```python
|
| 278 |
+
from datasets import load_dataset
|
| 279 |
+
|
| 280 |
+
ds = load_dataset(
|
| 281 |
+
"kevykibbz/ecommerce-behavior-data-from-multi-category-store_oct-nov_2019",
|
| 282 |
+
split="train",
|
| 283 |
+
)
|
| 284 |
+
|
| 285 |
+
# Filter to purchases and subsample users
|
| 286 |
+
purchases = ds.filter(lambda x: x["event_type"] == "purchase")
|
| 287 |
+
# Group by user_id, take top 100K users by event count
|
| 288 |
+
# ... build ECOMMERCE_SCHEMA tokenizer, train 24M model
|
| 289 |
+
```
|
| 290 |
+
|
| 291 |
+
### 4.4 Phase 3.3: Healthcare Demo with Synthea
|
| 292 |
+
|
| 293 |
+
```python
|
| 294 |
+
from huggingface_hub import hf_hub_download
|
| 295 |
+
import pandas as pd
|
| 296 |
+
|
| 297 |
+
encounters = pd.read_parquet(hf_hub_download(
|
| 298 |
+
"richardyoung/synthea-575k-patients",
|
| 299 |
+
"data/encounters.parquet",
|
| 300 |
+
repo_type="dataset",
|
| 301 |
+
))
|
| 302 |
+
|
| 303 |
+
# Group by PATIENT, sort by Start date
|
| 304 |
+
# Map: Start→timestamp, Base_Cost→amount, DESCRIPTION→description
|
| 305 |
+
# ... build HEALTHCARE_SCHEMA tokenizer, train 24M model
|
| 306 |
+
```
|
| 307 |
+
|
| 308 |
+
---
|
| 309 |
+
|
| 310 |
+
## 5. Risks and Mitigations
|
| 311 |
+
|
| 312 |
+
| Risk | Impact | Mitigation |
|
| 313 |
+
|------|--------|-----------|
|
| 314 |
+
| mindweave too small to catch scale bugs | Bugs only surface at 1M+ events | Run Sparkov immediately after validation passes |
|
| 315 |
+
| Sparkov has no negative amounts | `SignTokenizer` always produces `[AMT_SIGN_POS]` | Concatenate merchant+category as description; test sign tokenizer separately on mindweave (which has signed amounts) |
|
| 316 |
+
| REES46 2GB download slow | Delays e-commerce demo | Stream via HF datasets `streaming=True` or subsample first |
|
| 317 |
+
| Synthea encounters lack numerical values | `MagnitudeBucketTokenizer` underutilized | Use `Base_Cost` for cost binning; join with `observations.parquet` for lab values |
|
| 318 |
+
| Model overfits on 400 events | Expected — not a bug | Overfitting on validation set = pipeline works. Move to Sparkov for real training. |
|
| 319 |
+
|
| 320 |
+
---
|
| 321 |
+
|
| 322 |
+
*This ADR will be updated with results from each phase as demos are completed.*
|