DDIA Expert

Expert knowledge base for Designing Data-Intensive Applications reference implementations. Contains 1,405 justified beliefs extracted from working Python implementations of the algorithms and data structures described in Martin Kleppmann's DDIA.

What is this?

This is an External Epistemic Memory (EEM) โ€” a model-agnostic knowledge base that any LLM can use via the reasons CLI or tool calling. Unlike a LoRA or fine-tune, this knowledge is not baked into model weights. It is external, inspectable, correctable, and works with any model.

Stats

Metric Value
Total beliefs 1,405
Status 1,405 IN / 0 OUT
Premises (observations) 1,224
Derived (justified conclusions) 181
Nogoods (contradictions) 0
Retraction rate 0%
Max derivation depth 7

Top Topics

Topic Beliefs
wal 177
btree 79
bitcask 55
lsm 55
sstable 49
compaction 46
range 42
hash 40
index 40
page 40
event 36
fsync 36
scan 36
recovery 32
merge 31
commit 31
hint 30

Domain Coverage

  • Write-Ahead Logging: WAL design tradeoffs, logical vs physical WAL, fsync ordering, batch writes, checkpoint and truncation strategies (133 beliefs)
  • B-Trees: page management, splits, deletes, free list, crash recovery, COW variants (94 beliefs)
  • LSM Trees: compaction strategies, memtable flush, WAL integration, read amplification (53 beliefs)
  • Bitcask / Hash Storage: in-memory index design, compaction, hint files, memory-bound constraints (49 beliefs)
  • SSTables: sorted string tables, merge strategies, range scans (39 beliefs)
  • Event Sourcing: event stores, live projections, projection catch-up, state reconstruction (32 beliefs)
  • Partitioning & Range Queries: range partitioning, range scans, partition-aware routing (29 beliefs)
  • Bloom Filters: standard and counting Bloom filters, false positive analysis (25 beliefs)
  • Gossip Protocol: failure detection, protocol correctness, cluster membership (24 beliefs)
  • Raft Consensus: leader election, log replication, partition handling, safety properties (24 beliefs)
  • Merkle Trees: diff-based anti-entropy, proof verification, tree construction (20 beliefs)
  • Two-Phase Commit: blocking windows, coordinator recovery, lock ownership, abort guarantees (18 beliefs)
  • Multi-Leader Replication: conflict resolution, split-brain detection, topology (18 beliefs)
  • Consistent Hashing: ring-based partitioning, virtual nodes, rebalancing (14 beliefs)
  • Hinted Handoff: temporary write forwarding, hint file format, replay (13 beliefs)
  • Stream Processing: stream joins, windowing, event-time processing (13 beliefs)
  • Lamport Clocks: logical timestamps, causal ordering, limitations (12 beliefs)
  • MapReduce: map-side joins, reduce-side joins, shuffle (12 beliefs)
  • MVCC: multi-version concurrency control, read snapshots, garbage collection (11 beliefs)
  • Serializable Snapshot Isolation: write skew detection, serialization graph (11 beliefs)
  • Anti-Entropy / Repair: Merkle-based repair, read repair, background sync (8 beliefs)
  • CRDTs: OR-Set, tombstone growth, merge semantics (7 beliefs)
  • Additional topics: unbundled databases, derived data systems, change data capture, snapshot isolation, linearizability, vector clocks, quorum reads/writes, fencing tokens, Avro schema evolution, PBFT (remaining beliefs)

How to Use

Import into a reasons database

reasons init
reasons import-json network.json

Query beliefs

reasons search "write-ahead logging"
reasons explain storage-crash-recovery-has-no-safe-path
reasons show raft-partition-creates-dual-hazard

Use as an MCP tool or CLI

Any LLM agent that can call reasons search, reasons show, and reasons explain can use this knowledge base. The agent does not need to be told it is an expert โ€” the knowledge base speaks for itself.

Key Beliefs

Node Summary
storage-crash-recovery-has-no-safe-path No storage engine has a fully safe crash recovery path: compaction lacks atomicity, WAL replay ignores corruption
end-to-end-correctness-requires-unmet-storage-guarantees End-to-end distributed correctness is unachievable: protocol-layer weaknesses combine with storage gaps
raft-partition-creates-dual-hazard Network partitions create a compound safety hazard in Raft: isolated leader silently accepts writes
protocol-safety-validated-only-under-synchronous-model Distributed protocol safety properties are validated exclusively under synchronous simulation
gossip-failure-detection-governs-cluster-correctness Gossip-based failure detection is the single correctness bottleneck for the distributed cluster
hash-index-is-memory-bound-by-design Hash index storage is fundamentally memory-bound: every key must reside in RAM
orset-tombstones-grow-monotonically OR-Set tombstones only grow, creating unbounded memory pressure
two-wal-designs-in-repo The repo contains both logical WAL (keyed operations) and physical WAL (raw page images) with different recovery semantics
derived-system-consistency-requires-flush-and-old-values Derived systems require flush ordering and old-value capture for consistency
storage-has-no-self-healing-at-any-layer Storage engines degrade monotonically during normal operation with no rebalancing or self-repair

Sources

Built from exploration of benthomasson/ddia-implementations โ€” Python reference implementations of algorithms from Designing Data-Intensive Applications by Martin Kleppmann, covering storage engines, replication, partitioning, transactions, consensus, and derived data systems.

Files

File Description
network.json Full belief network (machine-readable, portable)
reasons.db SQLite database (gitignored, regenerate with reasons import-json network.json)
CLAUDE.md Agent instructions for using this knowledge base

Quality

  • All 1,405 beliefs are IN (none retracted)
  • 1,224 premises grounded in direct code observations
  • 181 derived beliefs justified from premises via SL justifications
  • 0 nogoods โ€” no contradictions detected
  • Max derivation depth of 7, indicating multi-step reasoning chains
  • Built and reviewed using ftl-reasons derive and review-beliefs pipeline

Limitations

  • Focused on the specific Python implementations in ddia-implementations, not DDIA the book itself
  • Does not cover Java/Go/Rust implementations of the same algorithms
  • Testing infrastructure observations may not generalize beyond this codebase
  • No ATMS or assumption-based beliefs (single-context TMS only)
  • Crash recovery and safety findings reflect implementation gaps, not protocol-level proofs

Authors

License

mit

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support