Add Micro–DC/docs/training/D5

This commit is contained in:
2025-12-05 12:59:39 +00:00
parent 90b0b8e41e
commit f0d5f81229

571
Micro–DC/docs/training/D5 Normal file
View File

@@ -0,0 +1,571 @@
You already have D4 (Sovereignty & Compliance Labs), so I'll move to the next block in the sequence: **D5 - Observability & SLO Labs Manual**.
---
# D5 — Observability & SLO Labs Manual
**Focus:** SLOs, telemetry, dashboards, and incident/drift response for EU-PAR-FR01.
We'll treat D5 as **2 main labs** (each can be a half-day):
1. **Lab 1 - Control Plane & Core Platform SLOs**
2. **Lab 2 - GPU Job Wait Time & Sustainability (PUE) SLOs**
Assumptions (from D1-D4):
* Repos: `infra-foundation`, `platform-clusters`, `policies-and-compliance`
* Pipelines: `lint_and_unit`, `policy_gates`, `integration_test`, `site_rollout`
* Site: **EU-PAR-FR01** with:
* Mgmt K8s cluster GitOps-managed
* Prometheus & Grafana deployable via `platform-clusters`
* DCIM/PDUs exporting basic power metrics (or simulated) for PUE
---
## Shared Pre-Reqs for D5
* `platform-clusters/addons/monitoring-logging-security` exists with:
* Prometheus `values.yaml` or equivalent config
* Grafana dashboards folder
* CI can deploy/upgrade monitoring stack via `site_rollout`
* Trainees know:
* How to add metrics scrape configs
* Basic Prometheus query syntax
* How to commit dashboards (JSON) into Git
---
## Lab 1 - Control Plane & Core Platform SLOs
**Theme:**
Define, implement, and operate SLOs for the **K8s control plane** and basic platform health.
### Learning Objectives
By the end of Lab 1, trainees can:
* Define an SLO for control-plane availability in measurable terms.
* Implement the SLO as:
* Prometheus queries (SLIs)
* Dashboard panels
* Alert rules
* Respond to a simulated control-plane incident using **Git-only** changes.
### Timebox
~3-4 hours.
---
### Step 0 - Scenario
EU-PAR-FR01 must deliver:
* **Control-plane availability SLO**: 99.95% over 30 days
* Violations should:
* Be visible in Grafana
* Trigger alerts to the SRE on-call
* All configuration lives in `platform-clusters` repo; no manual dashboard editing in production Grafana.
---
### Step 1 - Define SLO Object (Declarative Spec)
**Repo:** `platform-clusters`
**Branch:** `feat/d5-lab1-control-plane-slo`
Create a logical SLO spec:
`addons/monitoring-logging-security/slo/control-plane.yaml`:
```yaml
apiVersion: slo.example.io/v1
kind: ServiceLevelObjective
metadata:
name: control-plane-availability
labels:
site: EU-PAR-FR01
spec:
description: "Availability of the K8s API server for eu-par-fr01-mgmt"
target: 99.95
window: 30d
sli:
type: events
source: prometheus
success_query: |
sum(rate(apiserver_request_total{code!~"5.."}[5m]))
total_query: |
sum(rate(apiserver_request_total[5m]))
alerting:
burn_rates:
- window: 1h
threshold: 14.4 # example — high burn
- window: 6h
threshold: 6
```
This can later be consumed by an SLO controller or just treated as documentation+convention.
Run:
```bash
./scripts/lint.sh
```
(You can choose whether to enforce schema via policy or not; for training, start lenient.)
---
### Step 2 - Prometheus Rules for SLI & Alerts
Add or extend:
`addons/monitoring-logging-security/prometheus/rules/control-plane-slo.yaml`:
```yaml
groups:
- name: control-plane-slo-rules
rules:
- record: slo:apiserver_availability:ratio
expr: |
sum(rate(apiserver_request_total{code!~"5.."}[5m]))
/
sum(rate(apiserver_request_total[5m]))
- alert: ControlPlaneAvailabilityLow
expr: slo:apiserver_availability:ratio < 0.9995
for: 5m
labels:
severity: warning
site: EU-PAR-FR01
annotations:
summary: "Control plane availability SLO under target for EU-PAR-FR01"
description: "apiserver availability ratio below 99.95% over the last 5m window."
```
Run local checks:
```bash
./scripts/lint.sh
```
If you lint Prometheus rules (e.g., via `promtool`), ensure that's included in `lint.sh`.
---
### Step 3 - Grafana Dashboard for SLO
Add a dashboard JSON under:
`addons/monitoring-logging-security/grafana/dashboards/control-plane-slo.json`
Include at minimum:
* Timeseries panel: `slo:apiserver_availability:ratio`
* Threshold line at 0.9995
* SingleStat (or equivalent) showing last 30d average:
Example query for panel:
```promql
avg_over_time(slo:apiserver_availability:ratio[30d])
```
Commit JSON as code (never edit directly in live Grafana without exporting back to Git).
---
### Step 4 - Deploy Monitoring & SLO
Push branch, open MR, ensure CI passes:
* `lint_and_unit` (YAML + Prometheus lint)
* `policy_gates` (if you have basic policies for dashboards/rules)
After merge:
* Run `site_rollout EU-PAR-FR01` from `platform-clusters`
* Check:
* Prometheus has the new recording rule
* Grafana deploys/refreshes dashboard from configMaps or sidecar
---
### Step 5 - Simulated Incident: Control Plane Degradation
Goal: create a realistic scenario where control-plane SLO dips.
Options (depending on your lab environment):
* Easiest:
* Run a script that:
* Sends high volume of failing requests to `apiserver`, or
* Temporarily blocks access from synthetic probe pod.
* Or training-mode simulation:
* Inject synthetic timeseries data (if using a test Prometheus) that mimics downtime.
**What trainees do:**
1. Observe alert firing in Prometheus/Alertmanager.
2. Check Grafana `control-plane-slo` dashboard:
* SLO ratio visibly dips below 99.95%.
3. Perform **root cause analysis**:
* Logs, node states, recent config changes (via Git log).
4. Mitigation:
* Example: revert a recent change that caused misconfig, or mark one node as unschedulable and drain, etc.
* All mitigations must be done via Git (manifests, config, etc.).
**Post-incident review steps:**
* Add a short `postmortems/control-plane-slo-incident-YYYYMMDD.md` in `platform-clusters/docs`:
* Timeline (detection → mitigation → recovery)
* Queries/dashboards used
* Follow-up actions (e.g., more granular synthetics, tightening policies)
---
### Lab 1 Definition of Done
* `control-plane-availability` SLO spec exists in Git.
* Prometheus rules & Grafana dashboard are deployed from `platform-clusters`.
* Team has simulated a control-plane issue and used:
* SLO signals
* Dashboards
* Git-based config changes
to detect, respond, and document the incident.
---
## Lab 2 - GPU Job Wait Time & Sustainability (PUE) SLOs
**Theme:**
Treat GPU performance and energy efficiency as first-class SLOs: **GPU job wait time** and **PUE** for EU-PAR-FR01.
### Learning Objectives
By the end of Lab 2, trainees can:
* Define an SLO for GPU job wait time that both infra and tenants understand.
* Define a sustainability SLO based on PUE.
* Wire metrics, dashboards, and alerts.
* Investigate and respond to both **performance** and **efficiency** SLO breaches.
### Timebox
~4 hours.
---
### Step 0 - Scenario
* EU-PAR-FR01 runs shared GPU clusters for multiple tenants.
* Commitments:
* **GPU job wait time SLO**: 95% of jobs start within 5 minutes of submission.
* **PUE SLO**: 30-day rolling average PUE ≤ 1.4.
We assume:
* GPU scheduler exports `gpu_job_queue_seconds_bucket` histogram.
* DC power metrics:
* `dc_it_power_kw{site="EU-PAR-FR01"}`
* `dc_facility_power_kw{site="EU-PAR-FR01"}`
---
### Step 1 - GPU Job Wait Time SLO Spec
**Repo:** `platform-clusters`
**Branch:** `feat/d5-lab2-gpu-pue-slos`
Create:
`addons/monitoring-logging-security/slo/gpu-job-wait-time.yaml`:
```yaml
apiVersion: slo.example.io/v1
kind: ServiceLevelObjective
metadata:
name: gpu-job-wait-time
labels:
site: EU-PAR-FR01
spec:
description: "95% of GPU jobs start within 5 minutes of submission"
target: 0.95
window: 30d
sli:
type: latency
source: prometheus
objective_seconds: 300
query: |
histogram_quantile(
0.95,
sum by (le) (
rate(gpu_job_queue_seconds_bucket[5m])
)
)
```
This expresses the performance objective.
---
### Step 2 - PUE SLO Spec
In same branch:
`addons/monitoring-logging-security/slo/pue.yaml`:
```yaml
apiVersion: slo.example.io/v1
kind: ServiceLevelObjective
metadata:
name: site-pue
labels:
site: EU-PAR-FR01
spec:
description: "30-day rolling PUE target for EU-PAR-FR01"
target: 1.4
window: 30d
sli:
type: ratio
source: prometheus
query: |
avg_over_time(
dc_facility_power_kw{site="EU-PAR-FR01"}
/
dc_it_power_kw{site="EU-PAR-FR01"} [30d]
)
```
---
### Step 3 - Prometheus Recording & Alerting Rules
`addons/monitoring-logging-security/prometheus/rules/gpu-and-pue-slo.yaml`:
```yaml
groups:
- name: gpu-job-wait-time-slo
rules:
- record: slo:gpu_job_wait_time_p95_seconds
expr: |
histogram_quantile(
0.95,
sum by (le) (
rate(gpu_job_queue_seconds_bucket[5m])
)
)
- alert: GPUJobWaitTimeSLOBreached
expr: slo:gpu_job_wait_time_p95_seconds > 300
for: 10m
labels:
severity: warning
site: EU-PAR-FR01
annotations:
summary: "GPU job wait time SLO breached on EU-PAR-FR01"
description: "p95 wait time > 300s over the last 10 minutes."
- name: pue-slo
rules:
- record: slo:dc_pue_30d
expr: |
avg_over_time(
dc_facility_power_kw{site="EU-PAR-FR01"}
/
dc_it_power_kw{site="EU-PAR-FR01"} [30d]
)
- alert: PUESLOBreached
expr: slo:dc_pue_30d > 1.4
for: 1h
labels:
severity: info
site: EU-PAR-FR01
annotations:
summary: "PUE SLO breached for EU-PAR-FR01"
description: "30d rolling PUE above 1.4; investigate cooling, capacity, or workload placement."
```
Run:
```bash
./scripts/lint.sh
```
Fix any errors.
---
### Step 4 - Dashboards for GPU & PUE
Add `grafana` dashboards (JSON) e.g.:
* `gpu-performance-slo.json`
* Panels:
* `slo:gpu_job_wait_time_p95_seconds` over time
* Histogram of `gpu_job_queue_seconds_bucket`
* Breakdown by tenant (if labels exist)
* `pue-overview.json`
* Panels:
* `dc_it_power_kw`, `dc_facility_power_kw`
* `dc_facility_power_kw / dc_it_power_kw` as instant PUE
* `slo:dc_pue_30d` as 30d rolling metric
Commit dashboards into Git.
---
### Step 5 - Deploy Monitoring Update
Push branch, open MR, ensure:
* `lint_and_unit` passes
* `policy_gates` passes (if any rules on SLO objects/rules)
* After merge, run `site_rollout EU-PAR-FR01` to redeploy monitoring stack
Verify in Grafana:
* Dashboards are visible.
* Queries return data (even if in test environment data is synthetic).
---
### Step 6 - Simulated Incident: GPU Wait Time SLO Breach
**Goal:**
Show how a shared GPU cluster can breach SLO when overloaded or misconfigured.
Options:
* Flood cluster with several long-running GPU jobs from multiple tenants.
* Misconfigure scheduler to reduce available GPUs (e.g., taint nodes via Git, or reduce replica count of GPU compute pool).
**What trainees do:**
1. Observe increased `slo:gpu_job_wait_time_p95_seconds` in Grafana.
2. See `GPUJobWaitTimeSLOBreached` alert firing.
3. Correlate with:
* Spike in queued jobs.
* Recent config change (e.g., reduced GPU pool size).
**Mitigation paths (all via Git):**
* Increase GPU capacity:
* Add GPU nodes in `infra-foundation/baremetal/profiles/compute-gpu.yaml` and re-run `site_rollout`.
* Adjust scheduling:
* Change resource quotas or max jobs per tenant in `platform-clusters` workload configs.
* Or apply fair-share scheduling policies (if your lab environment supports it).
**Post-incident mini-PIR:**
* Short doc: `docs/postmortems/gpu-wait-time-slo-incident-YYYYMMDD.md`
* Describe whether capacity or fairness was the main issue.
* Capture Prometheus queries/dashboards used for analysis.
---
### Step 7 - Simulated Drift Event: PUE SLO Breach
**Goal:**
Show how sustainability KPI drift is detected and investigated.
Options:
* Simulate a persistent increase in `dc_facility_power_kw` without corresponding IT load (e.g., cooling inefficiency).
* Reduce IT load significantly while keeping facility load high (underutilisation).
**What trainees do:**
1. Observe `slo:dc_pue_30d` creeping above 1.4 in Grafana.
2. See `PUESLOBreached` alert.
3. Investigate:
* Are GPU nodes idle but powered? (Check utilisation metrics.)
* Are workloads moved away from EU-PAR-FR01 unnecessarily?
* Is facility using less efficient cooling mode?
**Mitigation options (still respecting sovereignty):**
* Consolidate workloads onto fewer nodes and power down unused hardware.
* Adjust cooldown or free-cooling thresholds if allowed.
* For non-personal / low-sensitivity workloads, consider offloading to another site where more efficient energy use is possible — *but only within regulatory constraints* (e.g., still within EU/EEA).
**Drift documentation:**
* Add a short drift record to `infra-foundation/docs/drift/pue-eu-par-fr01-YYYYMMDD.md`:
* When drift was detected
* Probable causes
* Actions taken (e.g. hardware consolidation, workload movements)
---
### Lab 2 Definition of Done
* `gpu-job-wait-time` and `site-pue` SLO specs exist in Git.
* Prometheus rules and Grafana dashboards are deployed and show data.
* GPU wait-time SLO has been intentionally breached and investigated.
* PUE SLO breach scenario has been explored and mitigations discussed & optionally simulated.
* Sustainability KPIs (PUE) are clearly treated as first-class signals, not afterthoughts.
---
## D5 Overall Definition of Done
Once you finish both labs:
1. **SLOs are codified, not hand-waved**
* Control-plane, GPU job wait time, and PUE SLOs live as declarative specs.
* Queries, dashboards, and alerts are versioned in Git.
2. **Telemetry is wired end-to-end**
* Prometheus scrapes:
* K8s API server and core components
* GPU scheduler / job metrics
* DC power metrics needed for PUE
* Grafana dashboards are generated from Git, not manually built in prod.
3. **Teams have practised incident & drift response**
* At least:
* One control-plane SLO incident
* One GPU wait-time incident
* One PUE drift event
* All handled via **Git + pipelines**, not ad-hoc manual tweaks.
4. **Sustainability is embedded**
* PUE is monitored and has an SLO.
* There's a narrative for how capacity planning and workload placement affect sustainability and sovereignty together.
---