Spaces:

Nomearod
/

agentbench

Running

Nomearod Claude Opus 4.6 (1M context) commited on 29 days ago

Commit

3241b7c

1 Parent(s): 23de799

docs(k8s): Week 1 step 2 — lock SOURCES.md categories + author QUESTION_PLAN.md

Week 1 step 2 of the v1.1 plan: lock the K8s corpus scope and author
the structural guide for step 5's 25-question golden-set authoring.
Scope deliberately narrower than "commit to 30-40 verified URLs in
one session": per cross-cutting #8 pilot-first discipline, per-URL
resolution and per-page license verification are deferred to step 4
ingestion. A category-level lock plus an explicit step-4 checklist is
the 1-hour scope the plan's step 2 budget anticipates.

SOURCES.md changes:
- Status flipped from "Placeholder" to "Locked at category level".
- 28-page category breakdown table (9 core workloads, 5 networking,
5 config+state, 4 scheduling, 1 access, 2 health/autoscaling,
2 security). 25 questions at ~1/page with 3 pages of headroom for
multi-hop fan-out.
- 8 already-pulled pages documented with best-known URLs + pilot
evidence (k8s_network_policies.md is called out as the pilot_005
flavor-B target so step 4 does not re-ingest it under a new file
name).
- 20 remaining pages listed per category with a step-4 verification
checklist (URL resolution, license confirmation, pull-date record,
rationale re-check against QUESTION_PLAN.md).
- Content license documented: CC BY 4.0 default with per-page
verification discipline (same pattern as the v1.1 plan's
Lynx/HaluBench CC BY-NC handling).
- Post-ingest smoke-query gate added before step 5 authoring.

QUESTION_PLAN.md new file (261 lines):
- Target CRAG distribution (5–6 simple, 3–4 simple_w_condition,
3–4 comparison, 5–6 multi_hop, 3–4 false_premise, 0–3 set/agg/pph).
- Per-type source-page mapping — each CRAG type points to specific
pages from SOURCES.md that support questions of that type. The
mapping is the authoring guide step 5 consults when drafting
specific question texts.
- false_premise split: at least 1 flavor A (pure refusal) + at
least 1 flavor B (documented negative) with pilot_005 called out
as the existing flavor-B reference and three candidate flavor-B
pages listed for expansion (Pod Security Standards, RBAC, more
NetworkPolicy clauses).
- time_sensitive flag placement: 2–3 questions distributed across
≥2 CRAG types, each tied to a specific K8s version state
(HPA v1 vs v2, PSA stable at 1.25, PSP removal at 1.25).
- Difficulty distribution guidance (8–10 easy, 10–12 medium, 4–6
hard).
- Authoring checklist per question — 14 required schema fields with
explicit notes on which are flavor-A-specific, which match the
v1.1 plan's source-attribution methodology, and which may be
retired (is_multi_hop → question_type migration contingent on
harness.py update).
- Pilot-first validation gates BEFORE the 25-question authoring
session: (1) step 4 ingestion verified via smoke queries;
(2) existing 6-question pilot must still pass its gates against
the expanded corpus; (3) 2–3 hand-drafted questions tested
through the pipeline before bulk authoring. Each gate honors the
cross-cutting #8 discipline that caught six issues across four
sessions with zero false positives.

What this commit does NOT contain:
- Specific 25-question texts (step 5 authoring, fresh session).
- Verified kubernetes.io URLs for the 20 remaining pages (step 4).
- Pulled markdown content for the 20 remaining pages (step 4).
- Updates to agent_bench/evaluation/datasets/k8s_golden_pilot.json
(the 6-question pilot stays as-is until step 5 replaces it).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Files changed (2) hide show

data/k8s_docs/QUESTION_PLAN.md +261 -0
data/k8s_docs/SOURCES.md +129 -29

data/k8s_docs/QUESTION_PLAN.md ADDED Viewed

	@@ -0,0 +1,261 @@

+# K8s Golden Dataset — Question Plan
+**Status:** Structural guide for Week 1 step 5 authoring (v1.1 plan).
+This document defines the 25-question target distribution, per-type
+source-page mapping, and authoring constraints. It does NOT contain
+the 25 specific question texts — those are authored during step 5 in
+a fresh session, per cross-cutting #8 pilot-first discipline.
+**Upstream contracts:**
+- Taxonomy: CRAG 8-type (Yang et al., NeurIPS 2024) — see DECISIONS.md
+  "K8s golden dataset uses CRAG's 8-type taxonomy as the schema".
+- Source pages: see `SOURCES.md` (28 pages, category-locked; 8 already
+  pulled, 20 to pull at step 4).
+- Schema: see `agent_bench/evaluation/harness.py` `GoldenQuestion`
+  plus the v1.1 plan's methodology #3 source-attribution fields.
+- Flavor A/B for `false_premise`: see DECISIONS.md "False-premise
+  questions come in two flavors".
+---
+## Target distribution (25 questions total)
+| CRAG type | Count | Schema field | Notes |
+|---|---|---|---|
+| `simple` | 5–6 | `question_type: "simple"` | Baseline retrieval: direct lookup in 1 page, 1–2 sentence answer. |
+| `simple_w_condition` | 3–4 | `question_type: "simple_w_condition"` | Answer depends on a condition stated in the question (enforcement level, volume type, Pod phase). |
+| `comparison` | 3–4 | `question_type: "comparison"` | Answer compares two concepts across 2 pages; reranker stress. |
+| `multi_hop` | 5–6 | `question_type: "multi_hop"` | Answer synthesizes 2–4 pages; reranker-stressing by construction. |
+| `false_premise` | 3–4 | `question_type: "false_premise"` | Grounded refusal stress. Flavor A (pure refusal) + flavor B (documented negative). |
+| `set` / `aggregation` / `post_processing_heavy` | 0–3 | respective values | Optional. Include only if natural from corpus content. |
+| **Total** | **25** | | |
+**Orthogonal flag:** `time_sensitive: bool` on 2–3 questions. Does
+NOT replace `question_type` — it's an independent property for
+version-bounded content (feature state, API version migration,
+deprecations).
+---
+## Per-type source-page mapping
+Each row identifies the K8s concept pages a question of that type
+should draw from. Multi-hop and comparison questions list multiple
+pages intentionally.
+### simple (5–6 slots)
+Pool questions where a 1–2 sentence answer lives inside a single page.
+| Candidate source | CRAG slot justification |
+|---|---|
+| `k8s_pods.md` | Pod IP semantics, container sharing, ephemeral containers |
+| `k8s_deployment.md` | What a Deployment is, declarative update mechanic |
+| `k8s_configmap.md` | What a ConfigMap is, immutable field |
+| `k8s_secret.md` | What a Secret is, volume mount modes |
+| RBAC Authorization *(step 4 page)* | RBAC primitive definitions (Role, RoleBinding, ClusterRole) |
+| StatefulSet *(step 4 page)* | StatefulSet identity guarantees |
+| DaemonSet *(step 4 page)* | One-per-node scheduling contract |
+| Namespaces *(step 4 page)* | Namespace scoping for resources |
+**Authoring rule:** Each `simple` question must have exactly one
+expected source page and 1–2 source snippets. KHR target ≥ 0.60 on
+the authored keywords.
+### simple_w_condition (3–4 slots)
+Pool questions where the answer explicitly depends on a condition
+named in the question.
+| Candidate source | Condition that shapes the answer |
+|---|---|
+| `k8s_pod_security_admission.md` | enforcement level: `enforce` / `audit` / `warn` |
+| `k8s_secret.md` | mount mode: environment variable vs file in volume |
+| Liveness/Readiness/Startup Probes *(step 4)* | probe type: liveness vs readiness vs startup |
+| Volumes *(step 4)* | volume type: emptyDir vs configMap vs persistentVolumeClaim |
+| Node-pressure Eviction (`k8s_node_pressure_eviction.md`) | resource under pressure: memory vs disk vs inodes |
+**Authoring rule:** The condition must be named in the question
+stem, not implied. The expected answer must change materially if the
+condition flips. Example: "How is a Secret mounted as a volume
+versus consumed as an environment variable?" is a valid
+`simple_w_condition`; "How is a Secret mounted?" is `simple`.
+### comparison (3–4 slots)
+Pool questions where the answer explicitly compares two K8s concepts
+that span 2 pages.
+| Page pair | Concept compared |
+|---|---|
+| Deployment vs StatefulSet *(step 4)* | stateless vs stateful workload semantics |
+| Deployment vs DaemonSet *(step 4)* | replica-count vs one-per-node scheduling |
+| ConfigMap vs Secret | non-confidential vs confidential data, mount parity |
+| Service vs Ingress *(step 4)* | L4 vs L7 exposure |
+| Taints/Tolerations vs Node Affinity *(step 4)* | opt-out vs opt-in placement |
+| Liveness vs Readiness probes *(step 4)* | restart vs traffic-routing semantics |
+**Authoring rule:** The question must force retrieval from both
+pages. Reranker stress is intentional — questions where BM25 would
+find one side but miss the other are the target. Expected sources:
+2 pages minimum.
+### multi_hop (5–6 slots)
+Pool questions where the answer synthesizes 2–4 pages. These are
+the primary reranker stressors.
+| Page set (example) | Hop path |
+|---|---|
+| Pod + Service + Ingress *(step 4)* | How external traffic reaches a Pod through Service → Ingress |
+| Deployment + ReplicaSet + Pod | How a Deployment rollout changes the underlying ReplicaSet and Pod set |
+| ConfigMap + Deployment | How a ConfigMap update propagates to Pods via env vars or mounted volume |
+| HPA + Deployment + Metrics Server *(partial step 4)* | How HPA reads metrics and scales a Deployment |
+| NetworkPolicy + Pod + Namespace *(partial step 4)* | How NetworkPolicy selectors resolve across namespaces |
+| Job + Pod + Container lifecycle *(partial step 4)* | How a Job's completions and parallelism interact with Pod restart policy |
+**Authoring rule:** Expected sources ≥ 2 pages. The question must
+not be answerable from any single page alone. `source_chunk_ids`
+must list at least one chunk from each expected page; partial
+credit is granted in the evaluator if at least one expected chunk is
+cited (see `agent_bench/evaluation/harness.py`).
+### false_premise (3–4 slots)
+Pool questions whose premise is wrong. Split across two flavors:
+**Flavor A — pure refusal** (at least 1 slot):
+- Premise targets a capability that does not exist in the K8s corpus
+  (not in any pulled page).
+- Example seed: "How do I configure Claude API rate limits in a
+  Kubernetes Deployment?" (wrong domain — Claude API is not a K8s
+  concept)
+- Schema: `category: "out_of_scope"`, `expected_sources: []`,
+  `source_snippets: []`.
+- Evaluator expectation: answer contains refusal phrasing AND cites
+  zero sources.
+**Flavor B — documented negative** (at least 1 slot, ideally 2):
+- Corpus contains an explicit negative statement (e.g.
+  NetworkPolicy "Anything TLS related" limitation at chunk 63 of
+  `k8s_network_policies.md`).
+- Example already in pilot: `k8s_pilot_005` (NetworkPolicy mTLS).
+- Schema: `category: "retrieval"`, `question_type: "false_premise"`,
+  `expected_sources: [<negative-answer page>]`,
+  `source_snippets: [<verbatim negative statement>]`.
+- Evaluator expectation: answer reports the documented negative
+  with citation, does NOT open with "the documentation does not
+  provide instructions" phrasing (per pilot_005 Fix 1 + Fix 2
+  revert analysis).
+**Other flavor-B candidate pages for authoring:**
+- Pod Security Standards — explicit statements about what each
+  profile does NOT permit
+- RBAC Authorization — explicit statements about what RBAC does NOT
+  provide (e.g. no deny rules)
+- NetworkPolicy — additional negative clauses beyond the pilot_005
+  mTLS one
+### set / aggregation / post_processing_heavy (0–3 slots)
+Include only if a K8s page naturally supports the pattern:
+- `set`: "Which Kubernetes resources can expose a Service?" (answer
+  is a set drawn from the Service page). Include 0–1 of this type
+  if a clean example emerges; otherwise leave slot empty.
+- `aggregation`: Unlikely to fit K8s docs (docs describe concepts,
+  not tabular data). Likely leave empty.
+- `post_processing_heavy`: Unlikely to fit K8s docs. Likely leave
+  empty.
+**Default:** Leave 0–3 as **0**. Only author these if a question
+emerges organically during step 5. Do not force-author to hit a
+target count; the plan explicitly says "0–3, included only where
+corpus content naturally supports".
+---
+## `time_sensitive` flag placement (2–3 questions)
+Flag questions whose correct answer depends on K8s version state:
+| Candidate | Why time-sensitive |
+|---|---|
+| HPA API version | `autoscaling/v1` vs `autoscaling/v2` — v2 stable since 1.23 |
+| Pod Security Admission stability | "stable as of v1.25" — feature state in the page |
+| PodSecurityPolicy removal | PSP removed in 1.25; migration path to PSA |
+**Authoring rule:** Set `time_sensitive: true` on exactly 2–3
+questions. Distribute across ≥2 different CRAG types (e.g. one
+`simple`, one `simple_w_condition`) so the flag is not concentrated
+in a single type. Each `time_sensitive` question must cite a
+specific K8s version or feature state in the source snippet,
+otherwise the flag is not load-bearing.
+---
+## Difficulty distribution
+Loose guidance, not a hard constraint:
+- `easy`: 8–10 questions — mostly `simple` and single-page
+  `simple_w_condition`
+- `medium`: 10–12 questions — `comparison`, most `multi_hop`,
+  straightforward `false_premise`
+- `hard`: 4–6 questions — deep `multi_hop`, flavor-B `false_premise`,
+  `time_sensitive` + `multi_hop` combinations
+The pilot's 6-question set is all `easy`/`medium`. Step 5 should add
+the `hard` tier.
+---
+## Authoring checklist (per question)
+For each of the 25 questions, the step 5 author must fill:
+| Field | Required | Notes |
+|---|---|---|
+| `id` | yes | `k8s_<NNN>` zero-padded (e.g. `k8s_001`) |
+| `question` | yes | Natural-language question in the voice of a recruiter or developer |
+| `expected_answer_keywords` | yes | 3–6 keywords that MUST appear in a correct answer; drives `keyword_hit_rate` |
+| `expected_sources` | yes | List of `.md` filenames from `SOURCES.md`; ≥1 for scoped questions, `[]` for flavor-A false-premise |
+| `category` | yes | `retrieval` / `calculation` / `out_of_scope` |
+| `difficulty` | yes | `easy` / `medium` / `hard` |
+| `requires_calculator` | yes | `false` for all K8s questions (no calc tool use expected) |
+| `reference_answer` | yes | 1–3 sentence answer used by the optional LLM judge |
+| `question_type` | yes | CRAG taxonomy value (exactly one of the 8 canonical strings) |
+| `time_sensitive` | yes | `bool`; `true` on exactly 2–3 questions |
+| `source_chunk_ids` | yes | Content-hashed chunk IDs (stable across reindex); must be `[]` for flavor-A false-premise |
+| `source_snippets` | yes | ~20 words verbatim per chunk; drift-detection field |
+| `source_pages` | yes | Human-readable page anchor (e.g. `"concepts/workloads/pods"`) |
+| `source_sections` | yes | Deepest heading containing the snippet |
+**Deprecation note:** The pilot schema has `is_multi_hop: bool`.
+Step 5 may retire this field in favor of `question_type == "multi_hop"`,
+but only after confirming the evaluator's partial-credit logic
+(`agent_bench/evaluation/harness.py:38`) is updated to read from
+`question_type`. Do NOT remove `is_multi_hop` without the
+corresponding harness update, or existing pilot questions will
+break partial-credit scoring.
+---
+## Pilot-first validation before step 5 authoring
+Before writing the 25 questions, step 5 author must:
+1. Confirm the 20 new pages from step 4 are ingested and reachable
+   via the pipeline (smoke-query test per `SOURCES.md`'s post-ingest
+   validation).
+2. Re-run `make evaluate` on the existing 6-question pilot dataset
+   against the newly-expanded corpus. The pilot's existing questions
+   must still pass their per-question gates — if adding 20 new
+   pages drops pilot P@5 materially, investigate before adding more
+   questions on top.
+3. Hand-draft 2–3 questions first, run them through the pipeline,
+   and confirm retrieval surfaces the expected chunks. This is the
+   final pilot-first checkpoint before bulk authoring.
+Only after these three checks pass does the step 5 author proceed
+to the full 25-question authoring session.

data/k8s_docs/SOURCES.md CHANGED Viewed

@@ -1,25 +1,38 @@
 # Kubernetes Corpus Sources
-**Status:** Placeholder — curation scheduled as a separate work session
-outside the multi-corpus refactor.
-**Target:** ~30–40 markdown files from kubernetes.io/docs covering the
-concepts a technical reviewer would naturally type into the demo —
-not comprehensive K8s coverage.
 ## Scope
 **Include:**
-- Core workload concepts: Pod, Deployment, StatefulSet, DaemonSet, Job,
-  CronJob, ReplicaSet
-- Networking: Service, Ingress, NetworkPolicy, EndpointSlice
-- Config + state: ConfigMap, Secret, Volume, PersistentVolume, Namespace
-- Access control: RBAC (Role, RoleBinding, ServiceAccount)
-- Cross-referencing overview pages: "Connecting Applications with
-  Services", "Workload Resources", "Services, Load Balancing, and
-  Networking" — these stress the reranker because relevance spreads
-  across multiple chunks per query
 **Exclude:**
@@ -36,27 +49,114 @@ This corpus targets **recruiter-likely questions**, not coverage. A
 question about etcd raft internals will be correctly refused — the
 refusal mechanism is part of the demo story, not a failure mode.
-Each ingested file below must have:
-- A URL (source of truth, for re-scraping if content drifts)
-- A date pulled (provenance, for audit)
 - A one-line rationale (why this page is in scope)
-| URL | Date pulled | Rationale |
-|-----|------------|-----------|
-| _TBD_ | _TBD_ | _TBD_ |
-See `docs/plans/2026-04-12-multi-corpus-refactor-design.md` section
-"Corpus Curation — Kubernetes" for the full policy.
 ## Ingestion
-Once curated files are in place, run:
 ```bash
 make ingest-k8s
 ```
-This populates `.cache/store_k8s/` with embeddings + BM25 index matching
-the FastAPI corpus's chunker settings (recursive, 512-token chunks,
-64-token overlap).

 # Kubernetes Corpus Sources
+**Status:** Locked at the category level (v1.1 Week 1 step 2). Per-page
+URL verification and pull dates are deferred to step 4 ingestion per
+pilot-first discipline — committing to 25 specific kubernetes.io URLs
+in this session without a verification pass would invert the
+"draft small, validate, then bulk" rule documented in the plan's
+cross-cutting #8.
+**Target:** ~25–30 markdown files from kubernetes.io/docs — enough to
+support 25 golden questions at ~1 question per page with headroom for
+multi-hop questions that draw on 2–4 pages each.
+**Content license:** All kubernetes.io/docs content is licensed under
+[CC BY 4.0](https://git.k8s.io/website/LICENSE). License verification
+happens per page at step 4 pull time; any page whose license terms
+differ from the site default is flagged in the table below and
+reassessed against the honest-evaluation brand's licensing discipline
+(same pattern the v1.1 plan uses for Lynx/HaluBench CC BY-NC).
 ## Scope
 **Include:**
+- Core workload concepts: Pod, Deployment, StatefulSet, DaemonSet,
+  Job, CronJob, ReplicaSet, Init Containers, Pod Lifecycle
+- Networking: Service, Ingress, NetworkPolicy, EndpointSlice, DNS
+- Config + state: ConfigMap, Secret, Volumes, PersistentVolumes,
+  Namespaces
+- Scheduling + resources: Resource Management, Node Assignment,
+  Taints and Tolerations, Node-pressure Eviction
+- Access control: RBAC Authorization
+- Health + autoscaling: Liveness/Readiness/Startup Probes,
+  Horizontal Pod Autoscaling
+- Security: Pod Security Admission, Pod Security Standards
 **Exclude:**
 question about etcd raft internals will be correctly refused — the
 refusal mechanism is part of the demo story, not a failure mode.
+Each ingested page below must have:
+- A canonical kubernetes.io/docs URL (source of truth, for re-scraping
+  if content drifts)
+- A date pulled (provenance, for audit; verified at step 4)
 - A one-line rationale (why this page is in scope)
+- License confirmation (default CC BY 4.0 unless a per-page notice says
+  otherwise)
+## Locked category breakdown
+| Category | Target pages | Rationale |
+|---|---|---|
+| Core workloads | 9 | Pod, Pod Lifecycle, Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob, Init Containers. The reranker-stressing multi-hop questions will draw on 2–4 of these per question. |
+| Networking | 5 | Service, Ingress, NetworkPolicy, EndpointSlice, DNS for Services and Pods. NetworkPolicy is already validated as the pilot_005 flavor-B false_premise target. |
+| Config + state | 5 | ConfigMap, Secret, Volumes, Persistent Volumes, Namespaces. Supports `simple_w_condition` questions where the answer depends on configuration context (volume type, secret mount mode, namespace scoping). |
+| Scheduling + resources | 4 | Resource Management for Pods and Containers, Assigning Pods to Nodes, Taints and Tolerations, Node-pressure Eviction (already pulled). Good source for `comparison` questions (e.g. taints vs affinity) and `time_sensitive` questions (feature-state-bound scheduler behavior). |
+| Access control | 1 | RBAC Authorization. Single page supports 1–2 `simple` questions about RBAC primitives. Not the reranker-stressing category. |
+| Health + autoscaling | 2 | Liveness/Readiness/Startup Probes, Horizontal Pod Autoscaling. HPA is a `time_sensitive` candidate (autoscaling/v2 stable state). |
+| Security | 2 | Pod Security Admission (already pulled), Pod Security Standards. Pod Security Admission is the `simple_w_condition` stressor where answer depends on enforcement level (enforce / audit / warn). |
+| **Total** | **28** | Supports 25 questions with 3 pages of headroom for multi-hop fan-out. |
+## Already-pulled pages (8 from the pilot corpus)
+These were pulled during the pilot work and are the empirical grounding
+for the threshold calibration at 0.015 and the flavor-B discipline for
+pilot_005. No re-pull required unless content drift is detected at
+step 4 verification.
+| File | Category | Best-known URL | Pilot evidence |
+|---|---|---|---|
+| `k8s_configmap.md` | Config + state | `https://kubernetes.io/docs/concepts/configuration/configmap/` | — |
+| `k8s_deployment.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/deployment/` | — |
+| `k8s_network_policies.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/network-policies/` | **pilot_005 flavor-B target** — contains "Anything TLS related (use a service mesh or ingress controller for this)" at chunk_index 63 |
+| `k8s_node_pressure_eviction.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/` | — |
+| `k8s_pod_security_admission.md` | Security | `https://kubernetes.io/docs/concepts/security/pod-security-admission/` | — |
+| `k8s_pods.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/` | pilot_001 target (Pod IP + localhost communication) |
+| `k8s_replicaset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/` | — |
+| `k8s_secret.md` | Config + state | `https://kubernetes.io/docs/concepts/configuration/secret/` | — |
+## Pages to pull at step 4 (20 remaining)
+**Core workloads (6 to add):**
+- Pod Lifecycle
+- StatefulSet
+- DaemonSet
+- Job
+- CronJob
+- Init Containers
+**Networking (4 to add):**
+- Service
+- Ingress
+- EndpointSlice
+- DNS for Services and Pods
+**Config + state (3 to add):**
+- Volumes
+- Persistent Volumes
+- Namespaces
+**Scheduling + resources (3 to add):**
+- Resource Management for Pods and Containers
+- Assigning Pods to Nodes
+- Taints and Tolerations
+**Access control (1 to add):**
+- RBAC Authorization
+**Health + autoscaling (2 to add):**
+- Configure Liveness, Readiness and Startup Probes
+- Horizontal Pod Autoscaling
+**Security (1 to add):**
+- Pod Security Standards
+**Step 4 checklist per page:**
+1. Resolve kubernetes.io/docs URL — use the best-known path in the
+   table above as a starting point; confirm the page loads at that
+   path; if redirected, update SOURCES.md with the final URL and
+   a one-line note explaining the redirect.
+2. Confirm CC BY 4.0 licensing (default); flag any exception.
+3. Pull content using the same scraper used for the pilot 8 pages
+   (matching format with inline markdown links and structured
+   headings).
+4. Record the pull date in the "date pulled" column.
+5. Verify the one-line rationale still holds after reading the
+   page — if the page content doesn't support any planned
+   question (see `QUESTION_PLAN.md`), flag for replacement with a
+   reasoned alternative.
 ## Ingestion
+Once all 28 files are in `data/k8s_docs/`, run:
 ```bash
 make ingest-k8s
 ```
+This populates `.cache/store_k8s/` with embeddings + BM25 index
+matching the FastAPI corpus's chunker settings (recursive, 512-token
+chunks, 64-token overlap).
+**Post-ingest validation (pilot-first):** Before authoring the full
+25-question golden set, run 2–3 smoke queries against the ingested
+store (e.g. `"what is a StatefulSet"`, `"how does HPA scale
+replicas"`, `"what happens when a Pod is evicted"`) and confirm that
+the retrieval returns sensible chunks from the expected pages. Any
+query that surfaces irrelevant chunks or hits the refusal gate
+indicates a chunk-boundary or content-coverage issue that should be
+debugged before the golden-set authoring session.