Spaces:
Running
docs(k8s): Week 1 step 2 — lock SOURCES.md categories + author QUESTION_PLAN.md
Browse filesWeek 1 step 2 of the v1.1 plan: lock the K8s corpus scope and author
the structural guide for step 5's 25-question golden-set authoring.
Scope deliberately narrower than "commit to 30-40 verified URLs in
one session": per cross-cutting #8 pilot-first discipline, per-URL
resolution and per-page license verification are deferred to step 4
ingestion. A category-level lock plus an explicit step-4 checklist is
the 1-hour scope the plan's step 2 budget anticipates.
SOURCES.md changes:
- Status flipped from "Placeholder" to "Locked at category level".
- 28-page category breakdown table (9 core workloads, 5 networking,
5 config+state, 4 scheduling, 1 access, 2 health/autoscaling,
2 security). 25 questions at ~1/page with 3 pages of headroom for
multi-hop fan-out.
- 8 already-pulled pages documented with best-known URLs + pilot
evidence (k8s_network_policies.md is called out as the pilot_005
flavor-B target so step 4 does not re-ingest it under a new file
name).
- 20 remaining pages listed per category with a step-4 verification
checklist (URL resolution, license confirmation, pull-date record,
rationale re-check against QUESTION_PLAN.md).
- Content license documented: CC BY 4.0 default with per-page
verification discipline (same pattern as the v1.1 plan's
Lynx/HaluBench CC BY-NC handling).
- Post-ingest smoke-query gate added before step 5 authoring.
QUESTION_PLAN.md new file (261 lines):
- Target CRAG distribution (5–6 simple, 3–4 simple_w_condition,
3–4 comparison, 5–6 multi_hop, 3–4 false_premise, 0–3 set/agg/pph).
- Per-type source-page mapping — each CRAG type points to specific
pages from SOURCES.md that support questions of that type. The
mapping is the authoring guide step 5 consults when drafting
specific question texts.
- false_premise split: at least 1 flavor A (pure refusal) + at
least 1 flavor B (documented negative) with pilot_005 called out
as the existing flavor-B reference and three candidate flavor-B
pages listed for expansion (Pod Security Standards, RBAC, more
NetworkPolicy clauses).
- time_sensitive flag placement: 2–3 questions distributed across
≥2 CRAG types, each tied to a specific K8s version state
(HPA v1 vs v2, PSA stable at 1.25, PSP removal at 1.25).
- Difficulty distribution guidance (8–10 easy, 10–12 medium, 4–6
hard).
- Authoring checklist per question — 14 required schema fields with
explicit notes on which are flavor-A-specific, which match the
v1.1 plan's source-attribution methodology, and which may be
retired (is_multi_hop → question_type migration contingent on
harness.py update).
- Pilot-first validation gates BEFORE the 25-question authoring
session: (1) step 4 ingestion verified via smoke queries;
(2) existing 6-question pilot must still pass its gates against
the expanded corpus; (3) 2–3 hand-drafted questions tested
through the pipeline before bulk authoring. Each gate honors the
cross-cutting #8 discipline that caught six issues across four
sessions with zero false positives.
What this commit does NOT contain:
- Specific 25-question texts (step 5 authoring, fresh session).
- Verified kubernetes.io URLs for the 20 remaining pages (step 4).
- Pulled markdown content for the 20 remaining pages (step 4).
- Updates to agent_bench/evaluation/datasets/k8s_golden_pilot.json
(the 6-question pilot stays as-is until step 5 replaces it).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- data/k8s_docs/QUESTION_PLAN.md +261 -0
- data/k8s_docs/SOURCES.md +129 -29
|
@@ -0,0 +1,261 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# K8s Golden Dataset — Question Plan
|
| 2 |
+
|
| 3 |
+
**Status:** Structural guide for Week 1 step 5 authoring (v1.1 plan).
|
| 4 |
+
This document defines the 25-question target distribution, per-type
|
| 5 |
+
source-page mapping, and authoring constraints. It does NOT contain
|
| 6 |
+
the 25 specific question texts — those are authored during step 5 in
|
| 7 |
+
a fresh session, per cross-cutting #8 pilot-first discipline.
|
| 8 |
+
|
| 9 |
+
**Upstream contracts:**
|
| 10 |
+
- Taxonomy: CRAG 8-type (Yang et al., NeurIPS 2024) — see DECISIONS.md
|
| 11 |
+
"K8s golden dataset uses CRAG's 8-type taxonomy as the schema".
|
| 12 |
+
- Source pages: see `SOURCES.md` (28 pages, category-locked; 8 already
|
| 13 |
+
pulled, 20 to pull at step 4).
|
| 14 |
+
- Schema: see `agent_bench/evaluation/harness.py` `GoldenQuestion`
|
| 15 |
+
plus the v1.1 plan's methodology #3 source-attribution fields.
|
| 16 |
+
- Flavor A/B for `false_premise`: see DECISIONS.md "False-premise
|
| 17 |
+
questions come in two flavors".
|
| 18 |
+
|
| 19 |
+
---
|
| 20 |
+
|
| 21 |
+
## Target distribution (25 questions total)
|
| 22 |
+
|
| 23 |
+
| CRAG type | Count | Schema field | Notes |
|
| 24 |
+
|---|---|---|---|
|
| 25 |
+
| `simple` | 5–6 | `question_type: "simple"` | Baseline retrieval: direct lookup in 1 page, 1–2 sentence answer. |
|
| 26 |
+
| `simple_w_condition` | 3–4 | `question_type: "simple_w_condition"` | Answer depends on a condition stated in the question (enforcement level, volume type, Pod phase). |
|
| 27 |
+
| `comparison` | 3–4 | `question_type: "comparison"` | Answer compares two concepts across 2 pages; reranker stress. |
|
| 28 |
+
| `multi_hop` | 5–6 | `question_type: "multi_hop"` | Answer synthesizes 2–4 pages; reranker-stressing by construction. |
|
| 29 |
+
| `false_premise` | 3–4 | `question_type: "false_premise"` | Grounded refusal stress. Flavor A (pure refusal) + flavor B (documented negative). |
|
| 30 |
+
| `set` / `aggregation` / `post_processing_heavy` | 0–3 | respective values | Optional. Include only if natural from corpus content. |
|
| 31 |
+
| **Total** | **25** | | |
|
| 32 |
+
|
| 33 |
+
**Orthogonal flag:** `time_sensitive: bool` on 2–3 questions. Does
|
| 34 |
+
NOT replace `question_type` — it's an independent property for
|
| 35 |
+
version-bounded content (feature state, API version migration,
|
| 36 |
+
deprecations).
|
| 37 |
+
|
| 38 |
+
---
|
| 39 |
+
|
| 40 |
+
## Per-type source-page mapping
|
| 41 |
+
|
| 42 |
+
Each row identifies the K8s concept pages a question of that type
|
| 43 |
+
should draw from. Multi-hop and comparison questions list multiple
|
| 44 |
+
pages intentionally.
|
| 45 |
+
|
| 46 |
+
### simple (5–6 slots)
|
| 47 |
+
|
| 48 |
+
Pool questions where a 1–2 sentence answer lives inside a single page.
|
| 49 |
+
|
| 50 |
+
| Candidate source | CRAG slot justification |
|
| 51 |
+
|---|---|
|
| 52 |
+
| `k8s_pods.md` | Pod IP semantics, container sharing, ephemeral containers |
|
| 53 |
+
| `k8s_deployment.md` | What a Deployment is, declarative update mechanic |
|
| 54 |
+
| `k8s_configmap.md` | What a ConfigMap is, immutable field |
|
| 55 |
+
| `k8s_secret.md` | What a Secret is, volume mount modes |
|
| 56 |
+
| RBAC Authorization *(step 4 page)* | RBAC primitive definitions (Role, RoleBinding, ClusterRole) |
|
| 57 |
+
| StatefulSet *(step 4 page)* | StatefulSet identity guarantees |
|
| 58 |
+
| DaemonSet *(step 4 page)* | One-per-node scheduling contract |
|
| 59 |
+
| Namespaces *(step 4 page)* | Namespace scoping for resources |
|
| 60 |
+
|
| 61 |
+
**Authoring rule:** Each `simple` question must have exactly one
|
| 62 |
+
expected source page and 1–2 source snippets. KHR target ≥ 0.60 on
|
| 63 |
+
the authored keywords.
|
| 64 |
+
|
| 65 |
+
### simple_w_condition (3–4 slots)
|
| 66 |
+
|
| 67 |
+
Pool questions where the answer explicitly depends on a condition
|
| 68 |
+
named in the question.
|
| 69 |
+
|
| 70 |
+
| Candidate source | Condition that shapes the answer |
|
| 71 |
+
|---|---|
|
| 72 |
+
| `k8s_pod_security_admission.md` | enforcement level: `enforce` / `audit` / `warn` |
|
| 73 |
+
| `k8s_secret.md` | mount mode: environment variable vs file in volume |
|
| 74 |
+
| Liveness/Readiness/Startup Probes *(step 4)* | probe type: liveness vs readiness vs startup |
|
| 75 |
+
| Volumes *(step 4)* | volume type: emptyDir vs configMap vs persistentVolumeClaim |
|
| 76 |
+
| Node-pressure Eviction (`k8s_node_pressure_eviction.md`) | resource under pressure: memory vs disk vs inodes |
|
| 77 |
+
|
| 78 |
+
**Authoring rule:** The condition must be named in the question
|
| 79 |
+
stem, not implied. The expected answer must change materially if the
|
| 80 |
+
condition flips. Example: "How is a Secret mounted as a volume
|
| 81 |
+
versus consumed as an environment variable?" is a valid
|
| 82 |
+
`simple_w_condition`; "How is a Secret mounted?" is `simple`.
|
| 83 |
+
|
| 84 |
+
### comparison (3–4 slots)
|
| 85 |
+
|
| 86 |
+
Pool questions where the answer explicitly compares two K8s concepts
|
| 87 |
+
that span 2 pages.
|
| 88 |
+
|
| 89 |
+
| Page pair | Concept compared |
|
| 90 |
+
|---|---|
|
| 91 |
+
| Deployment vs StatefulSet *(step 4)* | stateless vs stateful workload semantics |
|
| 92 |
+
| Deployment vs DaemonSet *(step 4)* | replica-count vs one-per-node scheduling |
|
| 93 |
+
| ConfigMap vs Secret | non-confidential vs confidential data, mount parity |
|
| 94 |
+
| Service vs Ingress *(step 4)* | L4 vs L7 exposure |
|
| 95 |
+
| Taints/Tolerations vs Node Affinity *(step 4)* | opt-out vs opt-in placement |
|
| 96 |
+
| Liveness vs Readiness probes *(step 4)* | restart vs traffic-routing semantics |
|
| 97 |
+
|
| 98 |
+
**Authoring rule:** The question must force retrieval from both
|
| 99 |
+
pages. Reranker stress is intentional — questions where BM25 would
|
| 100 |
+
find one side but miss the other are the target. Expected sources:
|
| 101 |
+
2 pages minimum.
|
| 102 |
+
|
| 103 |
+
### multi_hop (5–6 slots)
|
| 104 |
+
|
| 105 |
+
Pool questions where the answer synthesizes 2–4 pages. These are
|
| 106 |
+
the primary reranker stressors.
|
| 107 |
+
|
| 108 |
+
| Page set (example) | Hop path |
|
| 109 |
+
|---|---|
|
| 110 |
+
| Pod + Service + Ingress *(step 4)* | How external traffic reaches a Pod through Service → Ingress |
|
| 111 |
+
| Deployment + ReplicaSet + Pod | How a Deployment rollout changes the underlying ReplicaSet and Pod set |
|
| 112 |
+
| ConfigMap + Deployment | How a ConfigMap update propagates to Pods via env vars or mounted volume |
|
| 113 |
+
| HPA + Deployment + Metrics Server *(partial step 4)* | How HPA reads metrics and scales a Deployment |
|
| 114 |
+
| NetworkPolicy + Pod + Namespace *(partial step 4)* | How NetworkPolicy selectors resolve across namespaces |
|
| 115 |
+
| Job + Pod + Container lifecycle *(partial step 4)* | How a Job's completions and parallelism interact with Pod restart policy |
|
| 116 |
+
|
| 117 |
+
**Authoring rule:** Expected sources ≥ 2 pages. The question must
|
| 118 |
+
not be answerable from any single page alone. `source_chunk_ids`
|
| 119 |
+
must list at least one chunk from each expected page; partial
|
| 120 |
+
credit is granted in the evaluator if at least one expected chunk is
|
| 121 |
+
cited (see `agent_bench/evaluation/harness.py`).
|
| 122 |
+
|
| 123 |
+
### false_premise (3–4 slots)
|
| 124 |
+
|
| 125 |
+
Pool questions whose premise is wrong. Split across two flavors:
|
| 126 |
+
|
| 127 |
+
**Flavor A — pure refusal** (at least 1 slot):
|
| 128 |
+
- Premise targets a capability that does not exist in the K8s corpus
|
| 129 |
+
(not in any pulled page).
|
| 130 |
+
- Example seed: "How do I configure Claude API rate limits in a
|
| 131 |
+
Kubernetes Deployment?" (wrong domain — Claude API is not a K8s
|
| 132 |
+
concept)
|
| 133 |
+
- Schema: `category: "out_of_scope"`, `expected_sources: []`,
|
| 134 |
+
`source_snippets: []`.
|
| 135 |
+
- Evaluator expectation: answer contains refusal phrasing AND cites
|
| 136 |
+
zero sources.
|
| 137 |
+
|
| 138 |
+
**Flavor B — documented negative** (at least 1 slot, ideally 2):
|
| 139 |
+
- Corpus contains an explicit negative statement (e.g.
|
| 140 |
+
NetworkPolicy "Anything TLS related" limitation at chunk 63 of
|
| 141 |
+
`k8s_network_policies.md`).
|
| 142 |
+
- Example already in pilot: `k8s_pilot_005` (NetworkPolicy mTLS).
|
| 143 |
+
- Schema: `category: "retrieval"`, `question_type: "false_premise"`,
|
| 144 |
+
`expected_sources: [<negative-answer page>]`,
|
| 145 |
+
`source_snippets: [<verbatim negative statement>]`.
|
| 146 |
+
- Evaluator expectation: answer reports the documented negative
|
| 147 |
+
with citation, does NOT open with "the documentation does not
|
| 148 |
+
provide instructions" phrasing (per pilot_005 Fix 1 + Fix 2
|
| 149 |
+
revert analysis).
|
| 150 |
+
|
| 151 |
+
**Other flavor-B candidate pages for authoring:**
|
| 152 |
+
- Pod Security Standards — explicit statements about what each
|
| 153 |
+
profile does NOT permit
|
| 154 |
+
- RBAC Authorization — explicit statements about what RBAC does NOT
|
| 155 |
+
provide (e.g. no deny rules)
|
| 156 |
+
- NetworkPolicy — additional negative clauses beyond the pilot_005
|
| 157 |
+
mTLS one
|
| 158 |
+
|
| 159 |
+
### set / aggregation / post_processing_heavy (0–3 slots)
|
| 160 |
+
|
| 161 |
+
Include only if a K8s page naturally supports the pattern:
|
| 162 |
+
|
| 163 |
+
- `set`: "Which Kubernetes resources can expose a Service?" (answer
|
| 164 |
+
is a set drawn from the Service page). Include 0–1 of this type
|
| 165 |
+
if a clean example emerges; otherwise leave slot empty.
|
| 166 |
+
- `aggregation`: Unlikely to fit K8s docs (docs describe concepts,
|
| 167 |
+
not tabular data). Likely leave empty.
|
| 168 |
+
- `post_processing_heavy`: Unlikely to fit K8s docs. Likely leave
|
| 169 |
+
empty.
|
| 170 |
+
|
| 171 |
+
**Default:** Leave 0–3 as **0**. Only author these if a question
|
| 172 |
+
emerges organically during step 5. Do not force-author to hit a
|
| 173 |
+
target count; the plan explicitly says "0–3, included only where
|
| 174 |
+
corpus content naturally supports".
|
| 175 |
+
|
| 176 |
+
---
|
| 177 |
+
|
| 178 |
+
## `time_sensitive` flag placement (2–3 questions)
|
| 179 |
+
|
| 180 |
+
Flag questions whose correct answer depends on K8s version state:
|
| 181 |
+
|
| 182 |
+
| Candidate | Why time-sensitive |
|
| 183 |
+
|---|---|
|
| 184 |
+
| HPA API version | `autoscaling/v1` vs `autoscaling/v2` — v2 stable since 1.23 |
|
| 185 |
+
| Pod Security Admission stability | "stable as of v1.25" — feature state in the page |
|
| 186 |
+
| PodSecurityPolicy removal | PSP removed in 1.25; migration path to PSA |
|
| 187 |
+
|
| 188 |
+
**Authoring rule:** Set `time_sensitive: true` on exactly 2–3
|
| 189 |
+
questions. Distribute across ≥2 different CRAG types (e.g. one
|
| 190 |
+
`simple`, one `simple_w_condition`) so the flag is not concentrated
|
| 191 |
+
in a single type. Each `time_sensitive` question must cite a
|
| 192 |
+
specific K8s version or feature state in the source snippet,
|
| 193 |
+
otherwise the flag is not load-bearing.
|
| 194 |
+
|
| 195 |
+
---
|
| 196 |
+
|
| 197 |
+
## Difficulty distribution
|
| 198 |
+
|
| 199 |
+
Loose guidance, not a hard constraint:
|
| 200 |
+
|
| 201 |
+
- `easy`: 8–10 questions — mostly `simple` and single-page
|
| 202 |
+
`simple_w_condition`
|
| 203 |
+
- `medium`: 10–12 questions — `comparison`, most `multi_hop`,
|
| 204 |
+
straightforward `false_premise`
|
| 205 |
+
- `hard`: 4–6 questions — deep `multi_hop`, flavor-B `false_premise`,
|
| 206 |
+
`time_sensitive` + `multi_hop` combinations
|
| 207 |
+
|
| 208 |
+
The pilot's 6-question set is all `easy`/`medium`. Step 5 should add
|
| 209 |
+
the `hard` tier.
|
| 210 |
+
|
| 211 |
+
---
|
| 212 |
+
|
| 213 |
+
## Authoring checklist (per question)
|
| 214 |
+
|
| 215 |
+
For each of the 25 questions, the step 5 author must fill:
|
| 216 |
+
|
| 217 |
+
| Field | Required | Notes |
|
| 218 |
+
|---|---|---|
|
| 219 |
+
| `id` | yes | `k8s_<NNN>` zero-padded (e.g. `k8s_001`) |
|
| 220 |
+
| `question` | yes | Natural-language question in the voice of a recruiter or developer |
|
| 221 |
+
| `expected_answer_keywords` | yes | 3–6 keywords that MUST appear in a correct answer; drives `keyword_hit_rate` |
|
| 222 |
+
| `expected_sources` | yes | List of `.md` filenames from `SOURCES.md`; ≥1 for scoped questions, `[]` for flavor-A false-premise |
|
| 223 |
+
| `category` | yes | `retrieval` / `calculation` / `out_of_scope` |
|
| 224 |
+
| `difficulty` | yes | `easy` / `medium` / `hard` |
|
| 225 |
+
| `requires_calculator` | yes | `false` for all K8s questions (no calc tool use expected) |
|
| 226 |
+
| `reference_answer` | yes | 1–3 sentence answer used by the optional LLM judge |
|
| 227 |
+
| `question_type` | yes | CRAG taxonomy value (exactly one of the 8 canonical strings) |
|
| 228 |
+
| `time_sensitive` | yes | `bool`; `true` on exactly 2–3 questions |
|
| 229 |
+
| `source_chunk_ids` | yes | Content-hashed chunk IDs (stable across reindex); must be `[]` for flavor-A false-premise |
|
| 230 |
+
| `source_snippets` | yes | ~20 words verbatim per chunk; drift-detection field |
|
| 231 |
+
| `source_pages` | yes | Human-readable page anchor (e.g. `"concepts/workloads/pods"`) |
|
| 232 |
+
| `source_sections` | yes | Deepest heading containing the snippet |
|
| 233 |
+
|
| 234 |
+
**Deprecation note:** The pilot schema has `is_multi_hop: bool`.
|
| 235 |
+
Step 5 may retire this field in favor of `question_type == "multi_hop"`,
|
| 236 |
+
but only after confirming the evaluator's partial-credit logic
|
| 237 |
+
(`agent_bench/evaluation/harness.py:38`) is updated to read from
|
| 238 |
+
`question_type`. Do NOT remove `is_multi_hop` without the
|
| 239 |
+
corresponding harness update, or existing pilot questions will
|
| 240 |
+
break partial-credit scoring.
|
| 241 |
+
|
| 242 |
+
---
|
| 243 |
+
|
| 244 |
+
## Pilot-first validation before step 5 authoring
|
| 245 |
+
|
| 246 |
+
Before writing the 25 questions, step 5 author must:
|
| 247 |
+
|
| 248 |
+
1. Confirm the 20 new pages from step 4 are ingested and reachable
|
| 249 |
+
via the pipeline (smoke-query test per `SOURCES.md`'s post-ingest
|
| 250 |
+
validation).
|
| 251 |
+
2. Re-run `make evaluate` on the existing 6-question pilot dataset
|
| 252 |
+
against the newly-expanded corpus. The pilot's existing questions
|
| 253 |
+
must still pass their per-question gates — if adding 20 new
|
| 254 |
+
pages drops pilot P@5 materially, investigate before adding more
|
| 255 |
+
questions on top.
|
| 256 |
+
3. Hand-draft 2–3 questions first, run them through the pipeline,
|
| 257 |
+
and confirm retrieval surfaces the expected chunks. This is the
|
| 258 |
+
final pilot-first checkpoint before bulk authoring.
|
| 259 |
+
|
| 260 |
+
Only after these three checks pass does the step 5 author proceed
|
| 261 |
+
to the full 25-question authoring session.
|
|
@@ -1,25 +1,38 @@
|
|
| 1 |
# Kubernetes Corpus Sources
|
| 2 |
|
| 3 |
-
**Status:**
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
## Scope
|
| 11 |
|
| 12 |
**Include:**
|
| 13 |
|
| 14 |
-
- Core workload concepts: Pod, Deployment, StatefulSet, DaemonSet,
|
| 15 |
-
CronJob, ReplicaSet
|
| 16 |
-
- Networking: Service, Ingress, NetworkPolicy, EndpointSlice
|
| 17 |
-
- Config + state: ConfigMap, Secret,
|
| 18 |
-
|
| 19 |
-
-
|
| 20 |
-
|
| 21 |
-
|
| 22 |
-
|
|
|
|
|
|
|
| 23 |
|
| 24 |
**Exclude:**
|
| 25 |
|
|
@@ -36,27 +49,114 @@ This corpus targets **recruiter-likely questions**, not coverage. A
|
|
| 36 |
question about etcd raft internals will be correctly refused — the
|
| 37 |
refusal mechanism is part of the demo story, not a failure mode.
|
| 38 |
|
| 39 |
-
Each ingested
|
| 40 |
|
| 41 |
-
- A URL (source of truth, for re-scraping
|
| 42 |
-
|
|
|
|
| 43 |
- A one-line rationale (why this page is in scope)
|
| 44 |
-
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
|
| 48 |
-
|
| 49 |
-
|
| 50 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
|
| 52 |
## Ingestion
|
| 53 |
|
| 54 |
-
Once
|
| 55 |
|
| 56 |
```bash
|
| 57 |
make ingest-k8s
|
| 58 |
```
|
| 59 |
|
| 60 |
-
This populates `.cache/store_k8s/` with embeddings + BM25 index
|
| 61 |
-
the FastAPI corpus's chunker settings (recursive, 512-token
|
| 62 |
-
64-token overlap).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
# Kubernetes Corpus Sources
|
| 2 |
|
| 3 |
+
**Status:** Locked at the category level (v1.1 Week 1 step 2). Per-page
|
| 4 |
+
URL verification and pull dates are deferred to step 4 ingestion per
|
| 5 |
+
pilot-first discipline — committing to 25 specific kubernetes.io URLs
|
| 6 |
+
in this session without a verification pass would invert the
|
| 7 |
+
"draft small, validate, then bulk" rule documented in the plan's
|
| 8 |
+
cross-cutting #8.
|
| 9 |
+
|
| 10 |
+
**Target:** ~25–30 markdown files from kubernetes.io/docs — enough to
|
| 11 |
+
support 25 golden questions at ~1 question per page with headroom for
|
| 12 |
+
multi-hop questions that draw on 2–4 pages each.
|
| 13 |
+
|
| 14 |
+
**Content license:** All kubernetes.io/docs content is licensed under
|
| 15 |
+
[CC BY 4.0](https://git.k8s.io/website/LICENSE). License verification
|
| 16 |
+
happens per page at step 4 pull time; any page whose license terms
|
| 17 |
+
differ from the site default is flagged in the table below and
|
| 18 |
+
reassessed against the honest-evaluation brand's licensing discipline
|
| 19 |
+
(same pattern the v1.1 plan uses for Lynx/HaluBench CC BY-NC).
|
| 20 |
|
| 21 |
## Scope
|
| 22 |
|
| 23 |
**Include:**
|
| 24 |
|
| 25 |
+
- Core workload concepts: Pod, Deployment, StatefulSet, DaemonSet,
|
| 26 |
+
Job, CronJob, ReplicaSet, Init Containers, Pod Lifecycle
|
| 27 |
+
- Networking: Service, Ingress, NetworkPolicy, EndpointSlice, DNS
|
| 28 |
+
- Config + state: ConfigMap, Secret, Volumes, PersistentVolumes,
|
| 29 |
+
Namespaces
|
| 30 |
+
- Scheduling + resources: Resource Management, Node Assignment,
|
| 31 |
+
Taints and Tolerations, Node-pressure Eviction
|
| 32 |
+
- Access control: RBAC Authorization
|
| 33 |
+
- Health + autoscaling: Liveness/Readiness/Startup Probes,
|
| 34 |
+
Horizontal Pod Autoscaling
|
| 35 |
+
- Security: Pod Security Admission, Pod Security Standards
|
| 36 |
|
| 37 |
**Exclude:**
|
| 38 |
|
|
|
|
| 49 |
question about etcd raft internals will be correctly refused — the
|
| 50 |
refusal mechanism is part of the demo story, not a failure mode.
|
| 51 |
|
| 52 |
+
Each ingested page below must have:
|
| 53 |
|
| 54 |
+
- A canonical kubernetes.io/docs URL (source of truth, for re-scraping
|
| 55 |
+
if content drifts)
|
| 56 |
+
- A date pulled (provenance, for audit; verified at step 4)
|
| 57 |
- A one-line rationale (why this page is in scope)
|
| 58 |
+
- License confirmation (default CC BY 4.0 unless a per-page notice says
|
| 59 |
+
otherwise)
|
| 60 |
+
|
| 61 |
+
## Locked category breakdown
|
| 62 |
+
|
| 63 |
+
| Category | Target pages | Rationale |
|
| 64 |
+
|---|---|---|
|
| 65 |
+
| Core workloads | 9 | Pod, Pod Lifecycle, Deployment, ReplicaSet, StatefulSet, DaemonSet, Job, CronJob, Init Containers. The reranker-stressing multi-hop questions will draw on 2–4 of these per question. |
|
| 66 |
+
| Networking | 5 | Service, Ingress, NetworkPolicy, EndpointSlice, DNS for Services and Pods. NetworkPolicy is already validated as the pilot_005 flavor-B false_premise target. |
|
| 67 |
+
| Config + state | 5 | ConfigMap, Secret, Volumes, Persistent Volumes, Namespaces. Supports `simple_w_condition` questions where the answer depends on configuration context (volume type, secret mount mode, namespace scoping). |
|
| 68 |
+
| Scheduling + resources | 4 | Resource Management for Pods and Containers, Assigning Pods to Nodes, Taints and Tolerations, Node-pressure Eviction (already pulled). Good source for `comparison` questions (e.g. taints vs affinity) and `time_sensitive` questions (feature-state-bound scheduler behavior). |
|
| 69 |
+
| Access control | 1 | RBAC Authorization. Single page supports 1–2 `simple` questions about RBAC primitives. Not the reranker-stressing category. |
|
| 70 |
+
| Health + autoscaling | 2 | Liveness/Readiness/Startup Probes, Horizontal Pod Autoscaling. HPA is a `time_sensitive` candidate (autoscaling/v2 stable state). |
|
| 71 |
+
| Security | 2 | Pod Security Admission (already pulled), Pod Security Standards. Pod Security Admission is the `simple_w_condition` stressor where answer depends on enforcement level (enforce / audit / warn). |
|
| 72 |
+
| **Total** | **28** | Supports 25 questions with 3 pages of headroom for multi-hop fan-out. |
|
| 73 |
+
|
| 74 |
+
## Already-pulled pages (8 from the pilot corpus)
|
| 75 |
+
|
| 76 |
+
These were pulled during the pilot work and are the empirical grounding
|
| 77 |
+
for the threshold calibration at 0.015 and the flavor-B discipline for
|
| 78 |
+
pilot_005. No re-pull required unless content drift is detected at
|
| 79 |
+
step 4 verification.
|
| 80 |
+
|
| 81 |
+
| File | Category | Best-known URL | Pilot evidence |
|
| 82 |
+
|---|---|---|---|
|
| 83 |
+
| `k8s_configmap.md` | Config + state | `https://kubernetes.io/docs/concepts/configuration/configmap/` | — |
|
| 84 |
+
| `k8s_deployment.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/deployment/` | — |
|
| 85 |
+
| `k8s_network_policies.md` | Networking | `https://kubernetes.io/docs/concepts/services-networking/network-policies/` | **pilot_005 flavor-B target** — contains "Anything TLS related (use a service mesh or ingress controller for this)" at chunk_index 63 |
|
| 86 |
+
| `k8s_node_pressure_eviction.md` | Scheduling + resources | `https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/` | — |
|
| 87 |
+
| `k8s_pod_security_admission.md` | Security | `https://kubernetes.io/docs/concepts/security/pod-security-admission/` | — |
|
| 88 |
+
| `k8s_pods.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/pods/` | pilot_001 target (Pod IP + localhost communication) |
|
| 89 |
+
| `k8s_replicaset.md` | Core workloads | `https://kubernetes.io/docs/concepts/workloads/controllers/replicaset/` | — |
|
| 90 |
+
| `k8s_secret.md` | Config + state | `https://kubernetes.io/docs/concepts/configuration/secret/` | — |
|
| 91 |
+
|
| 92 |
+
## Pages to pull at step 4 (20 remaining)
|
| 93 |
+
|
| 94 |
+
**Core workloads (6 to add):**
|
| 95 |
+
- Pod Lifecycle
|
| 96 |
+
- StatefulSet
|
| 97 |
+
- DaemonSet
|
| 98 |
+
- Job
|
| 99 |
+
- CronJob
|
| 100 |
+
- Init Containers
|
| 101 |
+
|
| 102 |
+
**Networking (4 to add):**
|
| 103 |
+
- Service
|
| 104 |
+
- Ingress
|
| 105 |
+
- EndpointSlice
|
| 106 |
+
- DNS for Services and Pods
|
| 107 |
+
|
| 108 |
+
**Config + state (3 to add):**
|
| 109 |
+
- Volumes
|
| 110 |
+
- Persistent Volumes
|
| 111 |
+
- Namespaces
|
| 112 |
+
|
| 113 |
+
**Scheduling + resources (3 to add):**
|
| 114 |
+
- Resource Management for Pods and Containers
|
| 115 |
+
- Assigning Pods to Nodes
|
| 116 |
+
- Taints and Tolerations
|
| 117 |
+
|
| 118 |
+
**Access control (1 to add):**
|
| 119 |
+
- RBAC Authorization
|
| 120 |
+
|
| 121 |
+
**Health + autoscaling (2 to add):**
|
| 122 |
+
- Configure Liveness, Readiness and Startup Probes
|
| 123 |
+
- Horizontal Pod Autoscaling
|
| 124 |
+
|
| 125 |
+
**Security (1 to add):**
|
| 126 |
+
- Pod Security Standards
|
| 127 |
+
|
| 128 |
+
**Step 4 checklist per page:**
|
| 129 |
+
1. Resolve kubernetes.io/docs URL — use the best-known path in the
|
| 130 |
+
table above as a starting point; confirm the page loads at that
|
| 131 |
+
path; if redirected, update SOURCES.md with the final URL and
|
| 132 |
+
a one-line note explaining the redirect.
|
| 133 |
+
2. Confirm CC BY 4.0 licensing (default); flag any exception.
|
| 134 |
+
3. Pull content using the same scraper used for the pilot 8 pages
|
| 135 |
+
(matching format with inline markdown links and structured
|
| 136 |
+
headings).
|
| 137 |
+
4. Record the pull date in the "date pulled" column.
|
| 138 |
+
5. Verify the one-line rationale still holds after reading the
|
| 139 |
+
page — if the page content doesn't support any planned
|
| 140 |
+
question (see `QUESTION_PLAN.md`), flag for replacement with a
|
| 141 |
+
reasoned alternative.
|
| 142 |
|
| 143 |
## Ingestion
|
| 144 |
|
| 145 |
+
Once all 28 files are in `data/k8s_docs/`, run:
|
| 146 |
|
| 147 |
```bash
|
| 148 |
make ingest-k8s
|
| 149 |
```
|
| 150 |
|
| 151 |
+
This populates `.cache/store_k8s/` with embeddings + BM25 index
|
| 152 |
+
matching the FastAPI corpus's chunker settings (recursive, 512-token
|
| 153 |
+
chunks, 64-token overlap).
|
| 154 |
+
|
| 155 |
+
**Post-ingest validation (pilot-first):** Before authoring the full
|
| 156 |
+
25-question golden set, run 2–3 smoke queries against the ingested
|
| 157 |
+
store (e.g. `"what is a StatefulSet"`, `"how does HPA scale
|
| 158 |
+
replicas"`, `"what happens when a Pod is evicted"`) and confirm that
|
| 159 |
+
the retrieval returns sensible chunks from the expected pages. Any
|
| 160 |
+
query that surfaces irrelevant chunks or hits the refusal gate
|
| 161 |
+
indicates a chunk-boundary or content-coverage issue that should be
|
| 162 |
+
debugged before the golden-set authoring session.
|