Add Micro–DC/docs/training/D5

2025-12-05 12:59:39 +00:00
parent 90b0b8e41e
commit f0d5f81229
1 changed files with 571 additions and 0 deletions
--- a/Micro–DC/docs/training/D5
+++ b/Micro–DC/docs/training/D5
@@ -0,0 +1,571 @@
+You already have D4 (Sovereignty & Compliance Labs), so I'll move to the next block in the sequence: **D5 - Observability & SLO Labs Manual**.
+
+---
+
+# D5 — Observability & SLO Labs Manual
+
+**Focus:** SLOs, telemetry, dashboards, and incident/drift response for EU-PAR-FR01.
+
+We'll treat D5 as **2 main labs** (each can be a half-day):
+
+1. **Lab 1 - Control Plane & Core Platform SLOs**
+2. **Lab 2 - GPU Job Wait Time & Sustainability (PUE) SLOs**
+
+Assumptions (from D1-D4):
+
+* Repos: `infra-foundation`, `platform-clusters`, `policies-and-compliance`
+* Pipelines: `lint_and_unit`, `policy_gates`, `integration_test`, `site_rollout`
+* Site: **EU-PAR-FR01** with:
+
+  * Mgmt K8s cluster GitOps-managed
+  * Prometheus & Grafana deployable via `platform-clusters`
+  * DCIM/PDUs exporting basic power metrics (or simulated) for PUE
+
+---
+
+## Shared Pre-Reqs for D5
+
+* `platform-clusters/addons/monitoring-logging-security` exists with:
+
+  * Prometheus `values.yaml` or equivalent config
+  * Grafana dashboards folder
+* CI can deploy/upgrade monitoring stack via `site_rollout`
+* Trainees know:
+
+  * How to add metrics scrape configs
+  * Basic Prometheus query syntax
+  * How to commit dashboards (JSON) into Git
+
+---
+
+## Lab 1 - Control Plane & Core Platform SLOs
+
+**Theme:**
+Define, implement, and operate SLOs for the **K8s control plane** and basic platform health.
+
+### Learning Objectives
+
+By the end of Lab 1, trainees can:
+
+* Define an SLO for control-plane availability in measurable terms.
+* Implement the SLO as:
+
+  * Prometheus queries (SLIs)
+  * Dashboard panels
+  * Alert rules
+* Respond to a simulated control-plane incident using **Git-only** changes.
+
+### Timebox
+
+~3-4 hours.
+
+---
+
+### Step 0 - Scenario
+
+EU-PAR-FR01 must deliver:
+
+* **Control-plane availability SLO**: 99.95% over 30 days
+* Violations should:
+
+  * Be visible in Grafana
+  * Trigger alerts to the SRE on-call
+* All configuration lives in `platform-clusters` repo; no manual dashboard editing in production Grafana.
+
+---
+
+### Step 1 - Define SLO Object (Declarative Spec)
+
+**Repo:** `platform-clusters`
+**Branch:** `feat/d5-lab1-control-plane-slo`
+
+Create a logical SLO spec:
+
+`addons/monitoring-logging-security/slo/control-plane.yaml`:
+
+```yaml
+apiVersion: slo.example.io/v1
+kind: ServiceLevelObjective
+metadata:
+  name: control-plane-availability
+  labels:
+    site: EU-PAR-FR01
+spec:
+  description: "Availability of the K8s API server for eu-par-fr01-mgmt"
+  target: 99.95
+  window: 30d
+  sli:
+    type: events
+    source: prometheus
+    success_query: |
+      sum(rate(apiserver_request_total{code!~"5.."}[5m]))
+    total_query: |
+      sum(rate(apiserver_request_total[5m]))
+  alerting:
+    burn_rates:
+      - window: 1h
+        threshold: 14.4   # example — high burn
+      - window: 6h
+        threshold: 6
+```
+
+This can later be consumed by an SLO controller or just treated as documentation+convention.
+
+Run:
+
+```bash
+./scripts/lint.sh
+```
+
+(You can choose whether to enforce schema via policy or not; for training, start lenient.)
+
+---
+
+### Step 2 - Prometheus Rules for SLI & Alerts
+
+Add or extend:
+
+`addons/monitoring-logging-security/prometheus/rules/control-plane-slo.yaml`:
+
+```yaml
+groups:
+  - name: control-plane-slo-rules
+    rules:
+      - record: slo:apiserver_availability:ratio
+        expr: |
+          sum(rate(apiserver_request_total{code!~"5.."}[5m])) 
+          / 
+          sum(rate(apiserver_request_total[5m]))
+
+      - alert: ControlPlaneAvailabilityLow
+        expr: slo:apiserver_availability:ratio < 0.9995
+        for: 5m
+        labels:
+          severity: warning
+          site: EU-PAR-FR01
+        annotations:
+          summary: "Control plane availability SLO under target for EU-PAR-FR01"
+          description: "apiserver availability ratio below 99.95% over the last 5m window."
+```
+
+Run local checks:
+
+```bash
+./scripts/lint.sh
+```
+
+If you lint Prometheus rules (e.g., via `promtool`), ensure that's included in `lint.sh`.
+
+---
+
+### Step 3 - Grafana Dashboard for SLO
+
+Add a dashboard JSON under:
+
+`addons/monitoring-logging-security/grafana/dashboards/control-plane-slo.json`
+
+Include at minimum:
+
+* Timeseries panel: `slo:apiserver_availability:ratio`
+* Threshold line at 0.9995
+* SingleStat (or equivalent) showing last 30d average:
+
+Example query for panel:
+
+```promql
+avg_over_time(slo:apiserver_availability:ratio[30d])
+```
+
+Commit JSON as code (never edit directly in live Grafana without exporting back to Git).
+
+---
+
+### Step 4 - Deploy Monitoring & SLO
+
+Push branch, open MR, ensure CI passes:
+
+* `lint_and_unit` (YAML + Prometheus lint)
+* `policy_gates` (if you have basic policies for dashboards/rules)
+
+After merge:
+
+* Run `site_rollout EU-PAR-FR01` from `platform-clusters`
+* Check:
+
+  * Prometheus has the new recording rule
+  * Grafana deploys/refreshes dashboard from configMaps or sidecar
+
+---
+
+### Step 5 - Simulated Incident: Control Plane Degradation
+
+Goal: create a realistic scenario where control-plane SLO dips.
+
+Options (depending on your lab environment):
+
+* Easiest:
+
+  * Run a script that:
+
+    * Sends high volume of failing requests to `apiserver`, or
+    * Temporarily blocks access from synthetic probe pod.
+* Or training-mode simulation:
+
+  * Inject synthetic timeseries data (if using a test Prometheus) that mimics downtime.
+
+**What trainees do:**
+
+1. Observe alert firing in Prometheus/Alertmanager.
+2. Check Grafana `control-plane-slo` dashboard:
+
+   * SLO ratio visibly dips below 99.95%.
+3. Perform **root cause analysis**:
+
+   * Logs, node states, recent config changes (via Git log).
+4. Mitigation:
+
+   * Example: revert a recent change that caused misconfig, or mark one node as unschedulable and drain, etc.
+   * All mitigations must be done via Git (manifests, config, etc.).
+
+**Post-incident review steps:**
+
+* Add a short `postmortems/control-plane-slo-incident-YYYYMMDD.md` in `platform-clusters/docs`:
+
+  * Timeline (detection → mitigation → recovery)
+  * Queries/dashboards used
+  * Follow-up actions (e.g., more granular synthetics, tightening policies)
+
+---
+
+### Lab 1 Definition of Done
+
+* `control-plane-availability` SLO spec exists in Git.
+* Prometheus rules & Grafana dashboard are deployed from `platform-clusters`.
+* Team has simulated a control-plane issue and used:
+
+  * SLO signals
+  * Dashboards
+  * Git-based config changes
+    to detect, respond, and document the incident.
+
+---
+
+## Lab 2 - GPU Job Wait Time & Sustainability (PUE) SLOs
+
+**Theme:**
+Treat GPU performance and energy efficiency as first-class SLOs: **GPU job wait time** and **PUE** for EU-PAR-FR01.
+
+### Learning Objectives
+
+By the end of Lab 2, trainees can:
+
+* Define an SLO for GPU job wait time that both infra and tenants understand.
+* Define a sustainability SLO based on PUE.
+* Wire metrics, dashboards, and alerts.
+* Investigate and respond to both **performance** and **efficiency** SLO breaches.
+
+### Timebox
+
+~4 hours.
+
+---
+
+### Step 0 - Scenario
+
+* EU-PAR-FR01 runs shared GPU clusters for multiple tenants.
+* Commitments:
+
+  * **GPU job wait time SLO**: 95% of jobs start within 5 minutes of submission.
+  * **PUE SLO**: 30-day rolling average PUE ≤ 1.4.
+
+We assume:
+
+* GPU scheduler exports `gpu_job_queue_seconds_bucket` histogram.
+* DC power metrics:
+
+  * `dc_it_power_kw{site="EU-PAR-FR01"}`
+  * `dc_facility_power_kw{site="EU-PAR-FR01"}`
+
+---
+
+### Step 1 - GPU Job Wait Time SLO Spec
+
+**Repo:** `platform-clusters`
+**Branch:** `feat/d5-lab2-gpu-pue-slos`
+
+Create:
+
+`addons/monitoring-logging-security/slo/gpu-job-wait-time.yaml`:
+
+```yaml
+apiVersion: slo.example.io/v1
+kind: ServiceLevelObjective
+metadata:
+  name: gpu-job-wait-time
+  labels:
+    site: EU-PAR-FR01
+spec:
+  description: "95% of GPU jobs start within 5 minutes of submission"
+  target: 0.95
+  window: 30d
+  sli:
+    type: latency
+    source: prometheus
+    objective_seconds: 300
+    query: |
+      histogram_quantile(
+        0.95,
+        sum by (le) (
+          rate(gpu_job_queue_seconds_bucket[5m])
+        )
+      )
+```
+
+This expresses the performance objective.
+
+---
+
+### Step 2 - PUE SLO Spec
+
+In same branch:
+
+`addons/monitoring-logging-security/slo/pue.yaml`:
+
+```yaml
+apiVersion: slo.example.io/v1
+kind: ServiceLevelObjective
+metadata:
+  name: site-pue
+  labels:
+    site: EU-PAR-FR01
+spec:
+  description: "30-day rolling PUE target for EU-PAR-FR01"
+  target: 1.4
+  window: 30d
+  sli:
+    type: ratio
+    source: prometheus
+    query: |
+      avg_over_time(
+        dc_facility_power_kw{site="EU-PAR-FR01"} 
+        / 
+        dc_it_power_kw{site="EU-PAR-FR01"} [30d]
+      )
+```
+
+---
+
+### Step 3 - Prometheus Recording & Alerting Rules
+
+`addons/monitoring-logging-security/prometheus/rules/gpu-and-pue-slo.yaml`:
+
+```yaml
+groups:
+  - name: gpu-job-wait-time-slo
+    rules:
+      - record: slo:gpu_job_wait_time_p95_seconds
+        expr: |
+          histogram_quantile(
+            0.95,
+            sum by (le) (
+              rate(gpu_job_queue_seconds_bucket[5m])
+            )
+          )
+
+      - alert: GPUJobWaitTimeSLOBreached
+        expr: slo:gpu_job_wait_time_p95_seconds > 300
+        for: 10m
+        labels:
+          severity: warning
+          site: EU-PAR-FR01
+        annotations:
+          summary: "GPU job wait time SLO breached on EU-PAR-FR01"
+          description: "p95 wait time > 300s over the last 10 minutes."
+
+  - name: pue-slo
+    rules:
+      - record: slo:dc_pue_30d
+        expr: |
+          avg_over_time(
+            dc_facility_power_kw{site="EU-PAR-FR01"} 
+            / 
+            dc_it_power_kw{site="EU-PAR-FR01"} [30d]
+          )
+
+      - alert: PUESLOBreached
+        expr: slo:dc_pue_30d > 1.4
+        for: 1h
+        labels:
+          severity: info
+          site: EU-PAR-FR01
+        annotations:
+          summary: "PUE SLO breached for EU-PAR-FR01"
+          description: "30d rolling PUE above 1.4; investigate cooling, capacity, or workload placement."
+```
+
+Run:
+
+```bash
+./scripts/lint.sh
+```
+
+Fix any errors.
+
+---
+
+### Step 4 - Dashboards for GPU & PUE
+
+Add `grafana` dashboards (JSON) e.g.:
+
+* `gpu-performance-slo.json`
+
+  * Panels:
+
+    * `slo:gpu_job_wait_time_p95_seconds` over time
+    * Histogram of `gpu_job_queue_seconds_bucket`
+    * Breakdown by tenant (if labels exist)
+* `pue-overview.json`
+
+  * Panels:
+
+    * `dc_it_power_kw`, `dc_facility_power_kw`
+    * `dc_facility_power_kw / dc_it_power_kw` as instant PUE
+    * `slo:dc_pue_30d` as 30d rolling metric
+
+Commit dashboards into Git.
+
+---
+
+### Step 5 - Deploy Monitoring Update
+
+Push branch, open MR, ensure:
+
+* `lint_and_unit` passes
+* `policy_gates` passes (if any rules on SLO objects/rules)
+* After merge, run `site_rollout EU-PAR-FR01` to redeploy monitoring stack
+
+Verify in Grafana:
+
+* Dashboards are visible.
+* Queries return data (even if in test environment data is synthetic).
+
+---
+
+### Step 6 - Simulated Incident: GPU Wait Time SLO Breach
+
+**Goal:**
+Show how a shared GPU cluster can breach SLO when overloaded or misconfigured.
+
+Options:
+
+* Flood cluster with several long-running GPU jobs from multiple tenants.
+* Misconfigure scheduler to reduce available GPUs (e.g., taint nodes via Git, or reduce replica count of GPU compute pool).
+
+**What trainees do:**
+
+1. Observe increased `slo:gpu_job_wait_time_p95_seconds` in Grafana.
+2. See `GPUJobWaitTimeSLOBreached` alert firing.
+3. Correlate with:
+
+   * Spike in queued jobs.
+   * Recent config change (e.g., reduced GPU pool size).
+
+**Mitigation paths (all via Git):**
+
+* Increase GPU capacity:
+
+  * Add GPU nodes in `infra-foundation/baremetal/profiles/compute-gpu.yaml` and re-run `site_rollout`.
+* Adjust scheduling:
+
+  * Change resource quotas or max jobs per tenant in `platform-clusters` workload configs.
+* Or apply fair-share scheduling policies (if your lab environment supports it).
+
+**Post-incident mini-PIR:**
+
+* Short doc: `docs/postmortems/gpu-wait-time-slo-incident-YYYYMMDD.md`
+
+  * Describe whether capacity or fairness was the main issue.
+  * Capture Prometheus queries/dashboards used for analysis.
+
+---
+
+### Step 7 - Simulated Drift Event: PUE SLO Breach
+
+**Goal:**
+Show how sustainability KPI drift is detected and investigated.
+
+Options:
+
+* Simulate a persistent increase in `dc_facility_power_kw` without corresponding IT load (e.g., cooling inefficiency).
+* Reduce IT load significantly while keeping facility load high (underutilisation).
+
+**What trainees do:**
+
+1. Observe `slo:dc_pue_30d` creeping above 1.4 in Grafana.
+2. See `PUESLOBreached` alert.
+3. Investigate:
+
+   * Are GPU nodes idle but powered? (Check utilisation metrics.)
+   * Are workloads moved away from EU-PAR-FR01 unnecessarily?
+   * Is facility using less efficient cooling mode?
+
+**Mitigation options (still respecting sovereignty):**
+
+* Consolidate workloads onto fewer nodes and power down unused hardware.
+* Adjust cooldown or free-cooling thresholds if allowed.
+* For non-personal / low-sensitivity workloads, consider offloading to another site where more efficient energy use is possible — *but only within regulatory constraints* (e.g., still within EU/EEA).
+
+**Drift documentation:**
+
+* Add a short drift record to `infra-foundation/docs/drift/pue-eu-par-fr01-YYYYMMDD.md`:
+
+  * When drift was detected
+  * Probable causes
+  * Actions taken (e.g. hardware consolidation, workload movements)
+
+---
+
+### Lab 2 Definition of Done
+
+* `gpu-job-wait-time` and `site-pue` SLO specs exist in Git.
+* Prometheus rules and Grafana dashboards are deployed and show data.
+* GPU wait-time SLO has been intentionally breached and investigated.
+* PUE SLO breach scenario has been explored and mitigations discussed & optionally simulated.
+* Sustainability KPIs (PUE) are clearly treated as first-class signals, not afterthoughts.
+
+---
+
+## D5 Overall Definition of Done
+
+Once you finish both labs:
+
+1. **SLOs are codified, not hand-waved**
+
+   * Control-plane, GPU job wait time, and PUE SLOs live as declarative specs.
+   * Queries, dashboards, and alerts are versioned in Git.
+
+2. **Telemetry is wired end-to-end**
+
+   * Prometheus scrapes:
+
+     * K8s API server and core components
+     * GPU scheduler / job metrics
+     * DC power metrics needed for PUE
+   * Grafana dashboards are generated from Git, not manually built in prod.
+
+3. **Teams have practised incident & drift response**
+
+   * At least:
+
+     * One control-plane SLO incident
+     * One GPU wait-time incident
+     * One PUE drift event
+   * All handled via **Git + pipelines**, not ad-hoc manual tweaks.
+
+4. **Sustainability is embedded**
+
+   * PUE is monitored and has an SLO.
+   * There's a narrative for how capacity planning and workload placement affect sustainability and sovereignty together.
+
+---
+