Add Micro–DC/docs/training/D5
This commit is contained in:
571
Micro–DC/docs/training/D5
Normal file
571
Micro–DC/docs/training/D5
Normal file
@@ -0,0 +1,571 @@
|
||||
You already have D4 (Sovereignty & Compliance Labs), so I'll move to the next block in the sequence: **D5 - Observability & SLO Labs Manual**.
|
||||
|
||||
---
|
||||
|
||||
# D5 — Observability & SLO Labs Manual
|
||||
|
||||
**Focus:** SLOs, telemetry, dashboards, and incident/drift response for EU-PAR-FR01.
|
||||
|
||||
We'll treat D5 as **2 main labs** (each can be a half-day):
|
||||
|
||||
1. **Lab 1 - Control Plane & Core Platform SLOs**
|
||||
2. **Lab 2 - GPU Job Wait Time & Sustainability (PUE) SLOs**
|
||||
|
||||
Assumptions (from D1-D4):
|
||||
|
||||
* Repos: `infra-foundation`, `platform-clusters`, `policies-and-compliance`
|
||||
* Pipelines: `lint_and_unit`, `policy_gates`, `integration_test`, `site_rollout`
|
||||
* Site: **EU-PAR-FR01** with:
|
||||
|
||||
* Mgmt K8s cluster GitOps-managed
|
||||
* Prometheus & Grafana deployable via `platform-clusters`
|
||||
* DCIM/PDUs exporting basic power metrics (or simulated) for PUE
|
||||
|
||||
---
|
||||
|
||||
## Shared Pre-Reqs for D5
|
||||
|
||||
* `platform-clusters/addons/monitoring-logging-security` exists with:
|
||||
|
||||
* Prometheus `values.yaml` or equivalent config
|
||||
* Grafana dashboards folder
|
||||
* CI can deploy/upgrade monitoring stack via `site_rollout`
|
||||
* Trainees know:
|
||||
|
||||
* How to add metrics scrape configs
|
||||
* Basic Prometheus query syntax
|
||||
* How to commit dashboards (JSON) into Git
|
||||
|
||||
---
|
||||
|
||||
## Lab 1 - Control Plane & Core Platform SLOs
|
||||
|
||||
**Theme:**
|
||||
Define, implement, and operate SLOs for the **K8s control plane** and basic platform health.
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
By the end of Lab 1, trainees can:
|
||||
|
||||
* Define an SLO for control-plane availability in measurable terms.
|
||||
* Implement the SLO as:
|
||||
|
||||
* Prometheus queries (SLIs)
|
||||
* Dashboard panels
|
||||
* Alert rules
|
||||
* Respond to a simulated control-plane incident using **Git-only** changes.
|
||||
|
||||
### Timebox
|
||||
|
||||
~3-4 hours.
|
||||
|
||||
---
|
||||
|
||||
### Step 0 - Scenario
|
||||
|
||||
EU-PAR-FR01 must deliver:
|
||||
|
||||
* **Control-plane availability SLO**: 99.95% over 30 days
|
||||
* Violations should:
|
||||
|
||||
* Be visible in Grafana
|
||||
* Trigger alerts to the SRE on-call
|
||||
* All configuration lives in `platform-clusters` repo; no manual dashboard editing in production Grafana.
|
||||
|
||||
---
|
||||
|
||||
### Step 1 - Define SLO Object (Declarative Spec)
|
||||
|
||||
**Repo:** `platform-clusters`
|
||||
**Branch:** `feat/d5-lab1-control-plane-slo`
|
||||
|
||||
Create a logical SLO spec:
|
||||
|
||||
`addons/monitoring-logging-security/slo/control-plane.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: slo.example.io/v1
|
||||
kind: ServiceLevelObjective
|
||||
metadata:
|
||||
name: control-plane-availability
|
||||
labels:
|
||||
site: EU-PAR-FR01
|
||||
spec:
|
||||
description: "Availability of the K8s API server for eu-par-fr01-mgmt"
|
||||
target: 99.95
|
||||
window: 30d
|
||||
sli:
|
||||
type: events
|
||||
source: prometheus
|
||||
success_query: |
|
||||
sum(rate(apiserver_request_total{code!~"5.."}[5m]))
|
||||
total_query: |
|
||||
sum(rate(apiserver_request_total[5m]))
|
||||
alerting:
|
||||
burn_rates:
|
||||
- window: 1h
|
||||
threshold: 14.4 # example — high burn
|
||||
- window: 6h
|
||||
threshold: 6
|
||||
```
|
||||
|
||||
This can later be consumed by an SLO controller or just treated as documentation+convention.
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
./scripts/lint.sh
|
||||
```
|
||||
|
||||
(You can choose whether to enforce schema via policy or not; for training, start lenient.)
|
||||
|
||||
---
|
||||
|
||||
### Step 2 - Prometheus Rules for SLI & Alerts
|
||||
|
||||
Add or extend:
|
||||
|
||||
`addons/monitoring-logging-security/prometheus/rules/control-plane-slo.yaml`:
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: control-plane-slo-rules
|
||||
rules:
|
||||
- record: slo:apiserver_availability:ratio
|
||||
expr: |
|
||||
sum(rate(apiserver_request_total{code!~"5.."}[5m]))
|
||||
/
|
||||
sum(rate(apiserver_request_total[5m]))
|
||||
|
||||
- alert: ControlPlaneAvailabilityLow
|
||||
expr: slo:apiserver_availability:ratio < 0.9995
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
site: EU-PAR-FR01
|
||||
annotations:
|
||||
summary: "Control plane availability SLO under target for EU-PAR-FR01"
|
||||
description: "apiserver availability ratio below 99.95% over the last 5m window."
|
||||
```
|
||||
|
||||
Run local checks:
|
||||
|
||||
```bash
|
||||
./scripts/lint.sh
|
||||
```
|
||||
|
||||
If you lint Prometheus rules (e.g., via `promtool`), ensure that's included in `lint.sh`.
|
||||
|
||||
---
|
||||
|
||||
### Step 3 - Grafana Dashboard for SLO
|
||||
|
||||
Add a dashboard JSON under:
|
||||
|
||||
`addons/monitoring-logging-security/grafana/dashboards/control-plane-slo.json`
|
||||
|
||||
Include at minimum:
|
||||
|
||||
* Timeseries panel: `slo:apiserver_availability:ratio`
|
||||
* Threshold line at 0.9995
|
||||
* SingleStat (or equivalent) showing last 30d average:
|
||||
|
||||
Example query for panel:
|
||||
|
||||
```promql
|
||||
avg_over_time(slo:apiserver_availability:ratio[30d])
|
||||
```
|
||||
|
||||
Commit JSON as code (never edit directly in live Grafana without exporting back to Git).
|
||||
|
||||
---
|
||||
|
||||
### Step 4 - Deploy Monitoring & SLO
|
||||
|
||||
Push branch, open MR, ensure CI passes:
|
||||
|
||||
* `lint_and_unit` (YAML + Prometheus lint)
|
||||
* `policy_gates` (if you have basic policies for dashboards/rules)
|
||||
|
||||
After merge:
|
||||
|
||||
* Run `site_rollout EU-PAR-FR01` from `platform-clusters`
|
||||
* Check:
|
||||
|
||||
* Prometheus has the new recording rule
|
||||
* Grafana deploys/refreshes dashboard from configMaps or sidecar
|
||||
|
||||
---
|
||||
|
||||
### Step 5 - Simulated Incident: Control Plane Degradation
|
||||
|
||||
Goal: create a realistic scenario where control-plane SLO dips.
|
||||
|
||||
Options (depending on your lab environment):
|
||||
|
||||
* Easiest:
|
||||
|
||||
* Run a script that:
|
||||
|
||||
* Sends high volume of failing requests to `apiserver`, or
|
||||
* Temporarily blocks access from synthetic probe pod.
|
||||
* Or training-mode simulation:
|
||||
|
||||
* Inject synthetic timeseries data (if using a test Prometheus) that mimics downtime.
|
||||
|
||||
**What trainees do:**
|
||||
|
||||
1. Observe alert firing in Prometheus/Alertmanager.
|
||||
2. Check Grafana `control-plane-slo` dashboard:
|
||||
|
||||
* SLO ratio visibly dips below 99.95%.
|
||||
3. Perform **root cause analysis**:
|
||||
|
||||
* Logs, node states, recent config changes (via Git log).
|
||||
4. Mitigation:
|
||||
|
||||
* Example: revert a recent change that caused misconfig, or mark one node as unschedulable and drain, etc.
|
||||
* All mitigations must be done via Git (manifests, config, etc.).
|
||||
|
||||
**Post-incident review steps:**
|
||||
|
||||
* Add a short `postmortems/control-plane-slo-incident-YYYYMMDD.md` in `platform-clusters/docs`:
|
||||
|
||||
* Timeline (detection → mitigation → recovery)
|
||||
* Queries/dashboards used
|
||||
* Follow-up actions (e.g., more granular synthetics, tightening policies)
|
||||
|
||||
---
|
||||
|
||||
### Lab 1 Definition of Done
|
||||
|
||||
* `control-plane-availability` SLO spec exists in Git.
|
||||
* Prometheus rules & Grafana dashboard are deployed from `platform-clusters`.
|
||||
* Team has simulated a control-plane issue and used:
|
||||
|
||||
* SLO signals
|
||||
* Dashboards
|
||||
* Git-based config changes
|
||||
to detect, respond, and document the incident.
|
||||
|
||||
---
|
||||
|
||||
## Lab 2 - GPU Job Wait Time & Sustainability (PUE) SLOs
|
||||
|
||||
**Theme:**
|
||||
Treat GPU performance and energy efficiency as first-class SLOs: **GPU job wait time** and **PUE** for EU-PAR-FR01.
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
By the end of Lab 2, trainees can:
|
||||
|
||||
* Define an SLO for GPU job wait time that both infra and tenants understand.
|
||||
* Define a sustainability SLO based on PUE.
|
||||
* Wire metrics, dashboards, and alerts.
|
||||
* Investigate and respond to both **performance** and **efficiency** SLO breaches.
|
||||
|
||||
### Timebox
|
||||
|
||||
~4 hours.
|
||||
|
||||
---
|
||||
|
||||
### Step 0 - Scenario
|
||||
|
||||
* EU-PAR-FR01 runs shared GPU clusters for multiple tenants.
|
||||
* Commitments:
|
||||
|
||||
* **GPU job wait time SLO**: 95% of jobs start within 5 minutes of submission.
|
||||
* **PUE SLO**: 30-day rolling average PUE ≤ 1.4.
|
||||
|
||||
We assume:
|
||||
|
||||
* GPU scheduler exports `gpu_job_queue_seconds_bucket` histogram.
|
||||
* DC power metrics:
|
||||
|
||||
* `dc_it_power_kw{site="EU-PAR-FR01"}`
|
||||
* `dc_facility_power_kw{site="EU-PAR-FR01"}`
|
||||
|
||||
---
|
||||
|
||||
### Step 1 - GPU Job Wait Time SLO Spec
|
||||
|
||||
**Repo:** `platform-clusters`
|
||||
**Branch:** `feat/d5-lab2-gpu-pue-slos`
|
||||
|
||||
Create:
|
||||
|
||||
`addons/monitoring-logging-security/slo/gpu-job-wait-time.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: slo.example.io/v1
|
||||
kind: ServiceLevelObjective
|
||||
metadata:
|
||||
name: gpu-job-wait-time
|
||||
labels:
|
||||
site: EU-PAR-FR01
|
||||
spec:
|
||||
description: "95% of GPU jobs start within 5 minutes of submission"
|
||||
target: 0.95
|
||||
window: 30d
|
||||
sli:
|
||||
type: latency
|
||||
source: prometheus
|
||||
objective_seconds: 300
|
||||
query: |
|
||||
histogram_quantile(
|
||||
0.95,
|
||||
sum by (le) (
|
||||
rate(gpu_job_queue_seconds_bucket[5m])
|
||||
)
|
||||
)
|
||||
```
|
||||
|
||||
This expresses the performance objective.
|
||||
|
||||
---
|
||||
|
||||
### Step 2 - PUE SLO Spec
|
||||
|
||||
In same branch:
|
||||
|
||||
`addons/monitoring-logging-security/slo/pue.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: slo.example.io/v1
|
||||
kind: ServiceLevelObjective
|
||||
metadata:
|
||||
name: site-pue
|
||||
labels:
|
||||
site: EU-PAR-FR01
|
||||
spec:
|
||||
description: "30-day rolling PUE target for EU-PAR-FR01"
|
||||
target: 1.4
|
||||
window: 30d
|
||||
sli:
|
||||
type: ratio
|
||||
source: prometheus
|
||||
query: |
|
||||
avg_over_time(
|
||||
dc_facility_power_kw{site="EU-PAR-FR01"}
|
||||
/
|
||||
dc_it_power_kw{site="EU-PAR-FR01"} [30d]
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3 - Prometheus Recording & Alerting Rules
|
||||
|
||||
`addons/monitoring-logging-security/prometheus/rules/gpu-and-pue-slo.yaml`:
|
||||
|
||||
```yaml
|
||||
groups:
|
||||
- name: gpu-job-wait-time-slo
|
||||
rules:
|
||||
- record: slo:gpu_job_wait_time_p95_seconds
|
||||
expr: |
|
||||
histogram_quantile(
|
||||
0.95,
|
||||
sum by (le) (
|
||||
rate(gpu_job_queue_seconds_bucket[5m])
|
||||
)
|
||||
)
|
||||
|
||||
- alert: GPUJobWaitTimeSLOBreached
|
||||
expr: slo:gpu_job_wait_time_p95_seconds > 300
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
site: EU-PAR-FR01
|
||||
annotations:
|
||||
summary: "GPU job wait time SLO breached on EU-PAR-FR01"
|
||||
description: "p95 wait time > 300s over the last 10 minutes."
|
||||
|
||||
- name: pue-slo
|
||||
rules:
|
||||
- record: slo:dc_pue_30d
|
||||
expr: |
|
||||
avg_over_time(
|
||||
dc_facility_power_kw{site="EU-PAR-FR01"}
|
||||
/
|
||||
dc_it_power_kw{site="EU-PAR-FR01"} [30d]
|
||||
)
|
||||
|
||||
- alert: PUESLOBreached
|
||||
expr: slo:dc_pue_30d > 1.4
|
||||
for: 1h
|
||||
labels:
|
||||
severity: info
|
||||
site: EU-PAR-FR01
|
||||
annotations:
|
||||
summary: "PUE SLO breached for EU-PAR-FR01"
|
||||
description: "30d rolling PUE above 1.4; investigate cooling, capacity, or workload placement."
|
||||
```
|
||||
|
||||
Run:
|
||||
|
||||
```bash
|
||||
./scripts/lint.sh
|
||||
```
|
||||
|
||||
Fix any errors.
|
||||
|
||||
---
|
||||
|
||||
### Step 4 - Dashboards for GPU & PUE
|
||||
|
||||
Add `grafana` dashboards (JSON) e.g.:
|
||||
|
||||
* `gpu-performance-slo.json`
|
||||
|
||||
* Panels:
|
||||
|
||||
* `slo:gpu_job_wait_time_p95_seconds` over time
|
||||
* Histogram of `gpu_job_queue_seconds_bucket`
|
||||
* Breakdown by tenant (if labels exist)
|
||||
* `pue-overview.json`
|
||||
|
||||
* Panels:
|
||||
|
||||
* `dc_it_power_kw`, `dc_facility_power_kw`
|
||||
* `dc_facility_power_kw / dc_it_power_kw` as instant PUE
|
||||
* `slo:dc_pue_30d` as 30d rolling metric
|
||||
|
||||
Commit dashboards into Git.
|
||||
|
||||
---
|
||||
|
||||
### Step 5 - Deploy Monitoring Update
|
||||
|
||||
Push branch, open MR, ensure:
|
||||
|
||||
* `lint_and_unit` passes
|
||||
* `policy_gates` passes (if any rules on SLO objects/rules)
|
||||
* After merge, run `site_rollout EU-PAR-FR01` to redeploy monitoring stack
|
||||
|
||||
Verify in Grafana:
|
||||
|
||||
* Dashboards are visible.
|
||||
* Queries return data (even if in test environment data is synthetic).
|
||||
|
||||
---
|
||||
|
||||
### Step 6 - Simulated Incident: GPU Wait Time SLO Breach
|
||||
|
||||
**Goal:**
|
||||
Show how a shared GPU cluster can breach SLO when overloaded or misconfigured.
|
||||
|
||||
Options:
|
||||
|
||||
* Flood cluster with several long-running GPU jobs from multiple tenants.
|
||||
* Misconfigure scheduler to reduce available GPUs (e.g., taint nodes via Git, or reduce replica count of GPU compute pool).
|
||||
|
||||
**What trainees do:**
|
||||
|
||||
1. Observe increased `slo:gpu_job_wait_time_p95_seconds` in Grafana.
|
||||
2. See `GPUJobWaitTimeSLOBreached` alert firing.
|
||||
3. Correlate with:
|
||||
|
||||
* Spike in queued jobs.
|
||||
* Recent config change (e.g., reduced GPU pool size).
|
||||
|
||||
**Mitigation paths (all via Git):**
|
||||
|
||||
* Increase GPU capacity:
|
||||
|
||||
* Add GPU nodes in `infra-foundation/baremetal/profiles/compute-gpu.yaml` and re-run `site_rollout`.
|
||||
* Adjust scheduling:
|
||||
|
||||
* Change resource quotas or max jobs per tenant in `platform-clusters` workload configs.
|
||||
* Or apply fair-share scheduling policies (if your lab environment supports it).
|
||||
|
||||
**Post-incident mini-PIR:**
|
||||
|
||||
* Short doc: `docs/postmortems/gpu-wait-time-slo-incident-YYYYMMDD.md`
|
||||
|
||||
* Describe whether capacity or fairness was the main issue.
|
||||
* Capture Prometheus queries/dashboards used for analysis.
|
||||
|
||||
---
|
||||
|
||||
### Step 7 - Simulated Drift Event: PUE SLO Breach
|
||||
|
||||
**Goal:**
|
||||
Show how sustainability KPI drift is detected and investigated.
|
||||
|
||||
Options:
|
||||
|
||||
* Simulate a persistent increase in `dc_facility_power_kw` without corresponding IT load (e.g., cooling inefficiency).
|
||||
* Reduce IT load significantly while keeping facility load high (underutilisation).
|
||||
|
||||
**What trainees do:**
|
||||
|
||||
1. Observe `slo:dc_pue_30d` creeping above 1.4 in Grafana.
|
||||
2. See `PUESLOBreached` alert.
|
||||
3. Investigate:
|
||||
|
||||
* Are GPU nodes idle but powered? (Check utilisation metrics.)
|
||||
* Are workloads moved away from EU-PAR-FR01 unnecessarily?
|
||||
* Is facility using less efficient cooling mode?
|
||||
|
||||
**Mitigation options (still respecting sovereignty):**
|
||||
|
||||
* Consolidate workloads onto fewer nodes and power down unused hardware.
|
||||
* Adjust cooldown or free-cooling thresholds if allowed.
|
||||
* For non-personal / low-sensitivity workloads, consider offloading to another site where more efficient energy use is possible — *but only within regulatory constraints* (e.g., still within EU/EEA).
|
||||
|
||||
**Drift documentation:**
|
||||
|
||||
* Add a short drift record to `infra-foundation/docs/drift/pue-eu-par-fr01-YYYYMMDD.md`:
|
||||
|
||||
* When drift was detected
|
||||
* Probable causes
|
||||
* Actions taken (e.g. hardware consolidation, workload movements)
|
||||
|
||||
---
|
||||
|
||||
### Lab 2 Definition of Done
|
||||
|
||||
* `gpu-job-wait-time` and `site-pue` SLO specs exist in Git.
|
||||
* Prometheus rules and Grafana dashboards are deployed and show data.
|
||||
* GPU wait-time SLO has been intentionally breached and investigated.
|
||||
* PUE SLO breach scenario has been explored and mitigations discussed & optionally simulated.
|
||||
* Sustainability KPIs (PUE) are clearly treated as first-class signals, not afterthoughts.
|
||||
|
||||
---
|
||||
|
||||
## D5 Overall Definition of Done
|
||||
|
||||
Once you finish both labs:
|
||||
|
||||
1. **SLOs are codified, not hand-waved**
|
||||
|
||||
* Control-plane, GPU job wait time, and PUE SLOs live as declarative specs.
|
||||
* Queries, dashboards, and alerts are versioned in Git.
|
||||
|
||||
2. **Telemetry is wired end-to-end**
|
||||
|
||||
* Prometheus scrapes:
|
||||
|
||||
* K8s API server and core components
|
||||
* GPU scheduler / job metrics
|
||||
* DC power metrics needed for PUE
|
||||
* Grafana dashboards are generated from Git, not manually built in prod.
|
||||
|
||||
3. **Teams have practised incident & drift response**
|
||||
|
||||
* At least:
|
||||
|
||||
* One control-plane SLO incident
|
||||
* One GPU wait-time incident
|
||||
* One PUE drift event
|
||||
* All handled via **Git + pipelines**, not ad-hoc manual tweaks.
|
||||
|
||||
4. **Sustainability is embedded**
|
||||
|
||||
* PUE is monitored and has an SLO.
|
||||
* There's a narrative for how capacity planning and workload placement affect sustainability and sovereignty together.
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user