domainTokenizer / docs /adr /ADR-002-dataset-selection.md

Add ADR-002: Dataset selection for Phase 3 demos — research findings, rationale, phased plan

756d197 verified 8 days ago

15.3 kB

ADR-002: Dataset Selection for Phase 3 Domain Demos

Status: Accepted Date: April 30, 2026 Decision: Start with mindweave/bank-transactions-us for pipeline validation, then scale to Sparkov (finance), REES46 (e-commerce), and Synthea (healthcare)

1. Context

Phase 2 delivered a complete library (v0.4.0, 139 tests) with tokenizers, models, and training pipelines — all validated on synthetic data generated in test fixtures. Phase 3 requires running the full pipeline on real public datasets to produce trained models and benchmark against baselines.

We need datasets for three domains matching our predefined schemas:

Schema	Required Fields	Minimum Scale for Demo
`FINANCE_SCHEMA`	timestamp, signed amount, text description	100+ users × 10+ events
`ECOMMERCE_SCHEMA`	timestamp, price, event type, category, product text	1000+ users × 10+ events
`HEALTHCARE_SCHEMA`	timestamp, event type, severity, cost, clinical text	1000+ patients × 10+ events

The strategy: start small to validate the pipeline end-to-end, then scale to production-sized datasets.

2. Dataset Analysis

2.1 Candidates Evaluated

We evaluated 8 datasets across 3 domains. Each was checked for: HuggingFace Hub availability, schema compatibility with our field types, scale (users × events), licensing, and accessibility (instant download vs. gated/external).

Finance Candidates

Dataset	Source	Users	Events	Schema Fit	Access
mindweave/bank-transactions-us	HF Hub	~20 accounts	~400	✅ Perfect	Instant
Sparkov CC Fraud (kartik2112)	Kaggle	~1,000	1.3M	✅ Excellent	Kaggle account
IBM AML Transactions	GitHub	Thousands	550K–55M	✅ Good	Direct download

mindweave/bank-transactions-us — inspected in detail:

Config bank_transactions: 11 columns, 0.4MB Parquet
transaction_date (string, "2024-01-04") → maps to FINANCE_SCHEMA.timestamp ✅
amount (float64, signed: -17584.14 for payroll, +1413.94 for deposits) → maps to amount and amount_sign ✅
description (string, "Payroll - net wages", "Customer payment received") → maps to description ✅
source_module (string, "payroll", "sales", "purchases") → bonus categorical field ✅
transaction_type (string, "withdrawal", "deposit") → redundant with sign, but useful for validation
bank_account_id (UUID) → user grouping key ✅
Linked bank_accounts table has company/bank metadata for potential tabular features

Scale limitation: ~20 accounts × ~20 transactions each = ~400 total. This is too small for meaningful pre-training, but schema-perfect for pipeline validation: every field maps directly to FINANCE_SCHEMA without transformation. The model will overfit immediately, but that's exactly what confirms the pipeline works.

Sparkov CC Fraud — the scale-up target:

~1,000 cardholders × ~1,300 transactions each = 1.3M total events
trans_date_trans_time, amt, merchant, category, cc_num, is_fraud
CC0 license (public domain)
is_fraud provides a natural fine-tuning label (binary classification)
Requires Kaggle account for download (free, instant)

E-Commerce Candidates

Dataset	Source	Users	Events	Schema Fit	Access
REES46 Behavioral	HF Hub	Millions	42M	✅ Perfect	Instant
Amazon Reviews 2023	HF Hub (gated)	33M	571M	⚠️ No price in reviews	HF token

REES46 (kevykibbz/ecommerce-behavior-data-from-multi-category-store_oct-nov_2019) — inspected:

2GB Parquet (10 files), fully accessible on HF Hub
event_time (ISO 8601), event_type ("view", "cart", "purchase"), product_id, category_code ("electronics.smartphone"), brand, price (float64), user_id
Every ECOMMERCE_SCHEMA field maps directly
For pre-training: filter to purchase events for clean transaction sequences
Scale: can subsample to 10K–100K users for demo, millions available for production

Healthcare Candidates

Dataset	Source	Patients	Events	Schema Fit	Access
Synthea 575K	HF Hub	575K	Millions	✅ Excellent	Instant
Synthea Direct	synthea.mitre.org	100K–1M	Millions	✅ Same	Direct download
MIMIC-IV	PhysioNet	40K+ ICU	Millions	✅ Gold standard	1-2 day DUA

Synthea 575K (richardyoung/synthea-575k-patients) — inspected:

136GB total across 18 Parquet files (allergies, conditions, encounters, medications, observations, procedures, etc.)
Default config shows allergies table: START (date), PATIENT (UUID), DESCRIPTION, TYPE, CATEGORY, SEVERITY1
For richer sequences: load encounters.parquet (5.1GB) with Start, DESCRIPTION, Base_Cost, REASONDESCRIPTION
Fully synthetic — no IRB, no access restrictions, MIT/Apache 2.0 license

2.2 Schema Mapping Verification

Direct field mapping from mindweave/bank-transactions-us to FINANCE_SCHEMA:

Dataset Column          →  FINANCE_SCHEMA Field     →  Tokenizer
─────────────────────────────────────────────────────────────────
amount (sign)           →  amount_sign              →  SignTokenizer (2 tokens)
amount (magnitude)      →  amount                   →  MagnitudeBucketTokenizer (21 bins)
transaction_date        →  timestamp                →  CalendarTokenizer (month/dow/dom/hour)
description             →  description              →  BPE subword tokenizer
─────────────────────────────────────────────────────────────────
bank_account_id         →  (user grouping key)      →  group-by for user sequences
source_module           →  (bonus: not in schema)   →  could extend schema
transaction_type        →  (redundant with sign)    →  validation check

Zero transformation needed. The amount field is already signed (negative = withdrawal, positive = deposit). The description field contains natural text suitable for BPE. The transaction_date is a standard date string. This is the cleanest possible mapping to our schema.

3. Decision

Phased approach: validate small → scale up

Phase	Dataset	Purpose	Scale
3.0: Pipeline Validation	`mindweave/bank-transactions-us`	Verify end-to-end: load → tokenize → pack → train → loss decreases	~400 events, ~20 accounts
3.1: Finance Demo	Sparkov CC Fraud (Kaggle)	Train 24M model, fine-tune fraud detection, benchmark vs LightGBM	1.3M events, 1K users
3.2: E-Commerce Demo	REES46 (HF Hub)	Train 24M model, next-purchase prediction	42M events, subsample to 100K users
3.3: Healthcare Demo	Synthea 575K (HF Hub)	Train 24M model, condition prediction	575K patients, subsample encounters

Rationale

Start with mindweave because it's schema-perfect and instant. No data cleaning, no field renaming, no Kaggle credentials needed. The pipeline either works or it doesn't — this dataset tells us in minutes.
The model will overfit on 400 events — that's the point. If loss doesn't decrease on 400 events, the pipeline is broken. If it does, the pipeline works and we can scale with confidence.
Sparkov is the real finance demo. 1,000 users × 1,300 events is the exact scale where a 24M-parameter model should learn meaningful patterns. The is_fraud label enables a direct comparison with LightGBM on the same data.
REES46 is the flagship demo. Millions of events, real behavioral data, perfect schema fit, instant HF download. This is the dataset that demonstrates domainTokenizer's value proposition most compellingly.
Synthea is the healthcare proof point. Fully synthetic (no access barriers), massive scale, multiple event types. Validates that the domain tokenizer approach generalizes beyond finance and e-commerce.

4. Implementation

4.1 Phase 3.0: Pipeline Validation with mindweave

Goal: Run the complete pipeline end-to-end on real data, verify loss decreases, confirm no bugs.

Step 1: Load and explore the data

from datasets import load_dataset
import pandas as pd

# Load bank transactions
ds = load_dataset("mindweave/bank-transactions-us", "bank_transactions", split="train")
df = ds.to_pandas()

# Basic stats
print(f"Total transactions: {len(df)}")
print(f"Unique accounts: {df['bank_account_id'].nunique()}")
print(f"Date range: {df['transaction_date'].min()} to {df['transaction_date'].max()}")
print(f"Amount range: {df['amount'].min():.2f} to {df['amount'].max():.2f}")
print(f"Descriptions: {df['description'].nunique()} unique")
print(f"Source modules: {df['source_module'].value_counts().to_dict()}")

Step 2: Convert to domainTokenizer event format

from datetime import datetime

def row_to_event(row):
    """Convert a DataFrame row to a FINANCE_SCHEMA event dict."""
    return {
        "amount_sign": row["amount"],         # SignTokenizer reads the sign
        "amount": row["amount"],               # MagnitudeBucketTokenizer reads abs value
        "timestamp": datetime.strptime(row["transaction_date"], "%Y-%m-%d"),
        "description": row["description"],     # BPE tokenizer
    }

# Group by account → list of event sequences
user_sequences = []
for account_id, group in df.sort_values("transaction_date").groupby("bank_account_id"):
    events = [row_to_event(row) for _, row in group.iterrows()]
    user_sequences.append(events)

print(f"Users: {len(user_sequences)}")
print(f"Events per user: {[len(s) for s in user_sequences]}")

Step 3: Build tokenizer, prepare data, train

from domain_tokenizer import (
    DomainTokenizerBuilder, DomainTransformerConfig,
    DomainTransformerForCausalLM, prepare_clm_dataset, pretrain_domain_model,
)
from domain_tokenizer.schemas import FINANCE_SCHEMA

# Build tokenizer
all_events = [e for seq in user_sequences for e in seq]
builder = DomainTokenizerBuilder(FINANCE_SCHEMA)
builder.fit(all_events)
hf_tokenizer = builder.build(
    text_corpus=[e["description"] for e in all_events],
    bpe_vocab_size=500,  # small vocab for small dataset
)

# Prepare packed dataset
dataset = prepare_clm_dataset(user_sequences, builder, hf_tokenizer, block_size=128)
print(f"Packed blocks: {len(dataset)} × 128 tokens")

# Create tiny model (for validation, not real training)
config = DomainTransformerConfig(
    vocab_size=hf_tokenizer.vocab_size,
    hidden_size=128, num_hidden_layers=4, num_attention_heads=4,
    intermediate_size=512,
)
model = DomainTransformerForCausalLM(config)
print(f"Model params: {sum(p.numel() for p in model.parameters()):,}")

# Train — expect loss to decrease rapidly (overfitting on small data = pipeline works)
pretrain_domain_model(
    model, hf_tokenizer, dataset,
    num_epochs=20,
    per_device_batch_size=4,
    gradient_accumulation_steps=1,
    learning_rate=3e-4,
    warmup_steps=10,
    logging_steps=5,
    save_strategy="no",
    report_to="none",
)

Expected outcome: Loss should drop from ~6.0 to <2.0 within 20 epochs on 400 events. If it does, the pipeline is validated. If it doesn't, there's a bug in tokenization, packing, or model architecture.

Validation checks after training:

Loss decreased monotonically (overfitting expected and desired)
No NaN/inf in loss or gradients
Token distribution is reasonable (no >50% UNK tokens)
builder.tokenize_event() produces expected token strings for sample events
hf_tokenizer.decode() on model output produces recognizable token strings

4.2 Phase 3.1: Finance Demo with Sparkov (After Validation)

# Download from Kaggle
kaggle datasets download kartik2112/fraud-detection -p data/
unzip data/fraud-detection.zip -d data/sparkov/

import pandas as pd

df = pd.read_csv("data/sparkov/fraudTrain.csv")

def sparkov_to_event(row):
    return {
        "amount_sign": row["amt"],           # always positive in Sparkov; sign from context
        "amount": row["amt"],
        "timestamp": datetime.strptime(row["trans_date_trans_time"], "%Y-%m-%d %H:%M:%S"),
        "description": f"{row['merchant']} {row['category']}",
    }

# Group by cardholder
user_sequences = []
labels = []  # for fine-tuning: any fraud in user's history?
for cc_num, group in df.sort_values("trans_date_trans_time").groupby("cc_num"):
    events = [sparkov_to_event(row) for _, row in group.iterrows()]
    user_sequences.append(events)
    labels.append(int(group["is_fraud"].any()))

# Pre-train 24M model on 1K users × 1.3K events
config = DomainTransformerConfig.from_preset("24m", vocab_size=hf_tokenizer.vocab_size)
model = DomainTransformerForCausalLM(config)
# ... pretrain_domain_model(model, ..., bf16=True)  # requires GPU

# Fine-tune for fraud detection
# ... finetune_domain_model(fusion_model, ft_dataset, ...)

Hardware: a10g-large (24GB VRAM), ~2-3 hours for 24M model on 1.3M events.

4.3 Phase 3.2: E-Commerce Demo with REES46

from datasets import load_dataset

ds = load_dataset(
    "kevykibbz/ecommerce-behavior-data-from-multi-category-store_oct-nov_2019",
    split="train",
)

# Filter to purchases and subsample users
purchases = ds.filter(lambda x: x["event_type"] == "purchase")
# Group by user_id, take top 100K users by event count
# ... build ECOMMERCE_SCHEMA tokenizer, train 24M model

4.4 Phase 3.3: Healthcare Demo with Synthea

from huggingface_hub import hf_hub_download
import pandas as pd

encounters = pd.read_parquet(hf_hub_download(
    "richardyoung/synthea-575k-patients",
    "data/encounters.parquet",
    repo_type="dataset",
))

# Group by PATIENT, sort by Start date
# Map: Start→timestamp, Base_Cost→amount, DESCRIPTION→description
# ... build HEALTHCARE_SCHEMA tokenizer, train 24M model

5. Risks and Mitigations

Risk	Impact	Mitigation
mindweave too small to catch scale bugs	Bugs only surface at 1M+ events	Run Sparkov immediately after validation passes
Sparkov has no negative amounts	`SignTokenizer` always produces `[AMT_SIGN_POS]`	Concatenate merchant+category as description; test sign tokenizer separately on mindweave (which has signed amounts)
REES46 2GB download slow	Delays e-commerce demo	Stream via HF datasets `streaming=True` or subsample first
Synthea encounters lack numerical values	`MagnitudeBucketTokenizer` underutilized	Use `Base_Cost` for cost binning; join with `observations.parquet` for lab values
Model overfits on 400 events	Expected — not a bug	Overfitting on validation set = pipeline works. Move to Sparkov for real training.

This ADR will be updated with results from each phase as demos are completed.