DDIA Expert

Expert knowledge base for Designing Data-Intensive Applications reference implementations. Contains 1,405 justified beliefs extracted from working Python implementations of the algorithms and data structures described in Martin Kleppmann's DDIA.

What is this?

This is an External Epistemic Memory (EEM) — a model-agnostic knowledge base that any LLM can use via the reasons CLI or tool calling. Unlike a LoRA or fine-tune, this knowledge is not baked into model weights. It is external, inspectable, correctable, and works with any model.

Stats

Metric	Value
Total beliefs	1,405
Status	1,405 IN / 0 OUT
Premises (observations)	1,224
Derived (justified conclusions)	181
Nogoods (contradictions)	0
Retraction rate	0%
Max derivation depth	7

Topic	Beliefs
wal	177
btree	79
bitcask	55
lsm	55
sstable	49
compaction	46
range	42
hash	40
index	40
page	40
event	36
fsync	36
scan	36
recovery	32
merge	31
commit	31
hint	30

Domain Coverage

Write-Ahead Logging: WAL design tradeoffs, logical vs physical WAL, fsync ordering, batch writes, checkpoint and truncation strategies (133 beliefs)
B-Trees: page management, splits, deletes, free list, crash recovery, COW variants (94 beliefs)
LSM Trees: compaction strategies, memtable flush, WAL integration, read amplification (53 beliefs)
Bitcask / Hash Storage: in-memory index design, compaction, hint files, memory-bound constraints (49 beliefs)
SSTables: sorted string tables, merge strategies, range scans (39 beliefs)
Event Sourcing: event stores, live projections, projection catch-up, state reconstruction (32 beliefs)
Partitioning & Range Queries: range partitioning, range scans, partition-aware routing (29 beliefs)
Bloom Filters: standard and counting Bloom filters, false positive analysis (25 beliefs)
Gossip Protocol: failure detection, protocol correctness, cluster membership (24 beliefs)
Raft Consensus: leader election, log replication, partition handling, safety properties (24 beliefs)
Merkle Trees: diff-based anti-entropy, proof verification, tree construction (20 beliefs)
Two-Phase Commit: blocking windows, coordinator recovery, lock ownership, abort guarantees (18 beliefs)
Multi-Leader Replication: conflict resolution, split-brain detection, topology (18 beliefs)
Consistent Hashing: ring-based partitioning, virtual nodes, rebalancing (14 beliefs)
Hinted Handoff: temporary write forwarding, hint file format, replay (13 beliefs)
Stream Processing: stream joins, windowing, event-time processing (13 beliefs)
Lamport Clocks: logical timestamps, causal ordering, limitations (12 beliefs)
MapReduce: map-side joins, reduce-side joins, shuffle (12 beliefs)
MVCC: multi-version concurrency control, read snapshots, garbage collection (11 beliefs)
Serializable Snapshot Isolation: write skew detection, serialization graph (11 beliefs)
Anti-Entropy / Repair: Merkle-based repair, read repair, background sync (8 beliefs)
CRDTs: OR-Set, tombstone growth, merge semantics (7 beliefs)
Additional topics: unbundled databases, derived data systems, change data capture, snapshot isolation, linearizability, vector clocks, quorum reads/writes, fencing tokens, Avro schema evolution, PBFT (remaining beliefs)

How to Use

Import into a reasons database

reasons init
reasons import-json network.json

Query beliefs

reasons search "write-ahead logging"
reasons explain storage-crash-recovery-has-no-safe-path
reasons show raft-partition-creates-dual-hazard

Use as an MCP tool or CLI

Any LLM agent that can call reasons search, reasons show, and reasons explain can use this knowledge base. The agent does not need to be told it is an expert — the knowledge base speaks for itself.

Key Beliefs

Node	Summary
`storage-crash-recovery-has-no-safe-path`	No storage engine has a fully safe crash recovery path: compaction lacks atomicity, WAL replay ignores corruption
`end-to-end-correctness-requires-unmet-storage-guarantees`	End-to-end distributed correctness is unachievable: protocol-layer weaknesses combine with storage gaps
`raft-partition-creates-dual-hazard`	Network partitions create a compound safety hazard in Raft: isolated leader silently accepts writes
`protocol-safety-validated-only-under-synchronous-model`	Distributed protocol safety properties are validated exclusively under synchronous simulation
`gossip-failure-detection-governs-cluster-correctness`	Gossip-based failure detection is the single correctness bottleneck for the distributed cluster
`hash-index-is-memory-bound-by-design`	Hash index storage is fundamentally memory-bound: every key must reside in RAM
`orset-tombstones-grow-monotonically`	OR-Set tombstones only grow, creating unbounded memory pressure
`two-wal-designs-in-repo`	The repo contains both logical WAL (keyed operations) and physical WAL (raw page images) with different recovery semantics
`derived-system-consistency-requires-flush-and-old-values`	Derived systems require flush ordering and old-value capture for consistency
`storage-has-no-self-healing-at-any-layer`	Storage engines degrade monotonically during normal operation with no rebalancing or self-repair

Sources

Built from exploration of benthomasson/ddia-implementations — Python reference implementations of algorithms from Designing Data-Intensive Applications by Martin Kleppmann, covering storage engines, replication, partitioning, transactions, consensus, and derived data systems.

Files

File	Description
`network.json`	Full belief network (machine-readable, portable)
`reasons.db`	SQLite database (gitignored, regenerate with `reasons import-json network.json`)
`CLAUDE.md`	Agent instructions for using this knowledge base

Quality

All 1,405 beliefs are IN (none retracted)
1,224 premises grounded in direct code observations
181 derived beliefs justified from premises via SL justifications
0 nogoods — no contradictions detected
Max derivation depth of 7, indicating multi-step reasoning chains
Built and reviewed using ftl-reasons derive and review-beliefs pipeline

Limitations

Focused on the specific Python implementations in ddia-implementations, not DDIA the book itself
Does not cover Java/Go/Rust implementations of the same algorithms
Testing infrastructure observations may not generalize beyond this codebase
No ATMS or assumption-based beliefs (single-context TMS only)
Crash recovery and safety findings reflect implementation gaps, not protocol-level proofs

Authors

Ben Thomasson (@benthomasson)

License

mit

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

EEM-Hub
/

ddia-expert