DDIA Expert
Expert knowledge base for Designing Data-Intensive Applications reference implementations. Contains 1,405 justified beliefs extracted from working Python implementations of the algorithms and data structures described in Martin Kleppmann's DDIA.
What is this?
This is an External Epistemic Memory (EEM) โ a model-agnostic knowledge base that any LLM can use via the reasons CLI or tool calling. Unlike a LoRA or fine-tune, this knowledge is not baked into model weights. It is external, inspectable, correctable, and works with any model.
Stats
| Metric | Value |
|---|---|
| Total beliefs | 1,405 |
| Status | 1,405 IN / 0 OUT |
| Premises (observations) | 1,224 |
| Derived (justified conclusions) | 181 |
| Nogoods (contradictions) | 0 |
| Retraction rate | 0% |
| Max derivation depth | 7 |
Top Topics
| Topic | Beliefs |
|---|---|
| wal | 177 |
| btree | 79 |
| bitcask | 55 |
| lsm | 55 |
| sstable | 49 |
| compaction | 46 |
| range | 42 |
| hash | 40 |
| index | 40 |
| page | 40 |
| event | 36 |
| fsync | 36 |
| scan | 36 |
| recovery | 32 |
| merge | 31 |
| commit | 31 |
| hint | 30 |
Domain Coverage
- Write-Ahead Logging: WAL design tradeoffs, logical vs physical WAL, fsync ordering, batch writes, checkpoint and truncation strategies (133 beliefs)
- B-Trees: page management, splits, deletes, free list, crash recovery, COW variants (94 beliefs)
- LSM Trees: compaction strategies, memtable flush, WAL integration, read amplification (53 beliefs)
- Bitcask / Hash Storage: in-memory index design, compaction, hint files, memory-bound constraints (49 beliefs)
- SSTables: sorted string tables, merge strategies, range scans (39 beliefs)
- Event Sourcing: event stores, live projections, projection catch-up, state reconstruction (32 beliefs)
- Partitioning & Range Queries: range partitioning, range scans, partition-aware routing (29 beliefs)
- Bloom Filters: standard and counting Bloom filters, false positive analysis (25 beliefs)
- Gossip Protocol: failure detection, protocol correctness, cluster membership (24 beliefs)
- Raft Consensus: leader election, log replication, partition handling, safety properties (24 beliefs)
- Merkle Trees: diff-based anti-entropy, proof verification, tree construction (20 beliefs)
- Two-Phase Commit: blocking windows, coordinator recovery, lock ownership, abort guarantees (18 beliefs)
- Multi-Leader Replication: conflict resolution, split-brain detection, topology (18 beliefs)
- Consistent Hashing: ring-based partitioning, virtual nodes, rebalancing (14 beliefs)
- Hinted Handoff: temporary write forwarding, hint file format, replay (13 beliefs)
- Stream Processing: stream joins, windowing, event-time processing (13 beliefs)
- Lamport Clocks: logical timestamps, causal ordering, limitations (12 beliefs)
- MapReduce: map-side joins, reduce-side joins, shuffle (12 beliefs)
- MVCC: multi-version concurrency control, read snapshots, garbage collection (11 beliefs)
- Serializable Snapshot Isolation: write skew detection, serialization graph (11 beliefs)
- Anti-Entropy / Repair: Merkle-based repair, read repair, background sync (8 beliefs)
- CRDTs: OR-Set, tombstone growth, merge semantics (7 beliefs)
- Additional topics: unbundled databases, derived data systems, change data capture, snapshot isolation, linearizability, vector clocks, quorum reads/writes, fencing tokens, Avro schema evolution, PBFT (remaining beliefs)
How to Use
Import into a reasons database
reasons init
reasons import-json network.json
Query beliefs
reasons search "write-ahead logging"
reasons explain storage-crash-recovery-has-no-safe-path
reasons show raft-partition-creates-dual-hazard
Use as an MCP tool or CLI
Any LLM agent that can call reasons search, reasons show, and reasons explain can use this knowledge base. The agent does not need to be told it is an expert โ the knowledge base speaks for itself.
Key Beliefs
| Node | Summary |
|---|---|
storage-crash-recovery-has-no-safe-path |
No storage engine has a fully safe crash recovery path: compaction lacks atomicity, WAL replay ignores corruption |
end-to-end-correctness-requires-unmet-storage-guarantees |
End-to-end distributed correctness is unachievable: protocol-layer weaknesses combine with storage gaps |
raft-partition-creates-dual-hazard |
Network partitions create a compound safety hazard in Raft: isolated leader silently accepts writes |
protocol-safety-validated-only-under-synchronous-model |
Distributed protocol safety properties are validated exclusively under synchronous simulation |
gossip-failure-detection-governs-cluster-correctness |
Gossip-based failure detection is the single correctness bottleneck for the distributed cluster |
hash-index-is-memory-bound-by-design |
Hash index storage is fundamentally memory-bound: every key must reside in RAM |
orset-tombstones-grow-monotonically |
OR-Set tombstones only grow, creating unbounded memory pressure |
two-wal-designs-in-repo |
The repo contains both logical WAL (keyed operations) and physical WAL (raw page images) with different recovery semantics |
derived-system-consistency-requires-flush-and-old-values |
Derived systems require flush ordering and old-value capture for consistency |
storage-has-no-self-healing-at-any-layer |
Storage engines degrade monotonically during normal operation with no rebalancing or self-repair |
Sources
Built from exploration of benthomasson/ddia-implementations โ Python reference implementations of algorithms from Designing Data-Intensive Applications by Martin Kleppmann, covering storage engines, replication, partitioning, transactions, consensus, and derived data systems.
Files
| File | Description |
|---|---|
network.json |
Full belief network (machine-readable, portable) |
reasons.db |
SQLite database (gitignored, regenerate with reasons import-json network.json) |
CLAUDE.md |
Agent instructions for using this knowledge base |
Quality
- All 1,405 beliefs are IN (none retracted)
- 1,224 premises grounded in direct code observations
- 181 derived beliefs justified from premises via SL justifications
- 0 nogoods โ no contradictions detected
- Max derivation depth of 7, indicating multi-step reasoning chains
- Built and reviewed using ftl-reasons derive and review-beliefs pipeline
Limitations
- Focused on the specific Python implementations in ddia-implementations, not DDIA the book itself
- Does not cover Java/Go/Rust implementations of the same algorithms
- Testing infrastructure observations may not generalize beyond this codebase
- No ATMS or assumption-based beliefs (single-context TMS only)
- Crash recovery and safety findings reflect implementation gaps, not protocol-level proofs
Authors
- Ben Thomasson (@benthomasson)
License
mit