Add Micro–DC/docs/training/D3

2025-12-05 12:35:34 +00:00
parent 2a798b7565
commit e85d9ca87e
1 changed files with 650 additions and 0 deletions
--- a/Micro–DC/docs/training/D3
+++ b/Micro–DC/docs/training/D3
@@ -0,0 +1,650 @@
+Alright, time for **D3: Pipelines & Policy Training** — this is where the Git-only story actually becomes *operational muscle memory*.
+
+I'll structure it like D1/D2:
+
+* Multi-day manual
+* Each day = objectives, pre-reqs, concrete steps, and Definition of Done
+
+Scope for D3:
+
+* One shared CI pattern across the three repos
+* Strong focus on `policy_gates` + how trainees **see, debug, and fix** failures
+* Hands-on rollback drills & “bad change caught by policy” examples
+
+---
+
+# D3 — Pipeline & Policy Training Manual
+
+**Focus:** CI/CD stages, policy-as-code enforcement, and safe rollback patterns.
+
+We'll use **3 training days**:
+
+* **Day 1:** Pipeline anatomy & local validation
+* **Day 2:** Policy gate labs (residency, RBAC, naming)
+* **Day 3:** Rollback & failure-handling drills
+
+Assumed repos (as in D1/D2):
+
+* `infra-foundation`
+* `platform-clusters`
+* `policies-and-compliance`
+
+Standard stages:
+
+* `lint_and_unit`
+* `policy_gates`
+* `integration_test`
+* `promotion_to_template`
+* `site_rollout`
+
+---
+
+## Day 1 — Pipeline Anatomy & Local Validation
+
+### Learning Objectives
+
+By the end of Day 1, the team can:
+
+* Read and explain the **CI pipeline layout** for all three repos.
+* Run **local equivalents** of `lint_and_unit` and `policy_gates`.
+* Predict which stage will fail given a specific mistake.
+
+### Pre-Reqs
+
+* D1 and D2 manuals executed (basic repo layout & templates exist).
+* CI runner available for all three repos.
+* OPA (or Conftest) + basic linters pre-installed or containerised.
+
+---
+
+### 1.1 Standard CI Blueprint (per repo)
+
+Pick one repo (e.g. `platform-clusters`) and ensure `.gitlab-ci.yml` (or similar) looks roughly like:
+
+```yaml
+stages:
+  - lint_and_unit
+  - policy_gates
+  - integration_test
+  - promotion_to_template
+  - site_rollout
+
+lint_and_unit:
+  stage: lint_and_unit
+  script:
+    - ./scripts/lint.sh
+  only:
+    - merge_requests
+    - main
+
+policy_gates:
+  stage: policy_gates
+  script:
+    - ./scripts/run_opa.sh
+  needs: ["lint_and_unit"]
+  only:
+    - merge_requests
+    - main
+
+integration_test:
+  stage: integration_test
+  script:
+    - ./scripts/integration_tests.sh
+  needs: ["policy_gates"]
+  when: manual
+  only:
+    - main
+
+promotion_to_template:
+  stage: promotion_to_template
+  script:
+    - ./scripts/promote_templates.sh
+  only:
+    - main
+
+site_rollout:
+  stage: site_rollout
+  script:
+    - ./scripts/apply_site.sh EU-PAR-FR01
+  when: manual
+  needs: ["integration_test"]
+  only:
+    - main
+```
+
+**Exercise:**
+
+For each repo, have participants:
+
+1. Open the CI file.
+2. Identify:
+
+   * Where `lint_and_unit` is defined.
+   * Where `policy_gates` is defined and which script it calls.
+   * Which stages are `manual` vs automatic.
+3. In a short whiteboard or doc, write:
+
+   * “If I break YAML → which stage fails?”
+   * “If I violate data residency → which stage fails?”
+   * “If cluster fails health checks post-deploy → which stage fails?”
+
+---
+
+### 1.2 Local Lint & Policy Scripts
+
+**Goal:** Engineers should be able to run the same checks *locally* before pushing.
+
+Example `scripts/lint.sh` (for K8s repo):
+
+```bash
+#!/usr/bin/env bash
+set -e
+
+echo "[lint] YAML syntax"
+find k8s -name '*.yaml' -print0 | xargs -0 yamllint
+
+echo "[lint] K8s schema (kubeconform or similar)"
+kubeconform -summary -strict -ignore-missing-schemas \
+  -kubernetes-version 1.29.0 \
+  $(find k8s -name '*.yaml' -print)
+```
+
+Example `scripts/run_opa.sh`:
+
+```bash
+#!/usr/bin/env bash
+set -e
+
+echo "[policy] Running OPA policies"
+conftest test \
+  --policy ./opa-policies \
+  $(find k8s -name '*.yaml' -print)
+```
+
+**Exercise:**
+
+For each repo:
+
+1. Open `scripts/lint.sh` and `scripts/run_opa.sh`.
+
+2. Run them locally:
+
+   ```bash
+   ./scripts/lint.sh
+   ./scripts/run_opa.sh
+   ```
+
+3. Intentionally break something small (e.g., missing `:` in YAML) and see `lint_and_unit` fail locally, then fix.
+
+---
+
+### 1.3 Day 1 Definition of Done
+
+* Every engineer knows:
+
+  * What each CI stage does.
+  * Roughly which types of errors show up where.
+* Every engineer can run `./scripts/lint.sh` and `./scripts/run_opa.sh` locally and interpret failures.
+
+---
+
+## Day 2 — Policy Gate Labs (Data Residency, RBAC, Naming)
+
+### Learning Objectives
+
+By the end of Day 2, the team can:
+
+* Write *and* debug OPA policies.
+* See how a bad change is blocked at `policy_gates`.
+* Understand 3 core policy categories:
+
+  * **Naming & structure**
+  * **Data residency**
+  * **RBAC/admin access**
+
+We'll implement **three concrete labs**.
+
+---
+
+### Lab 1 — Data Residency: Illegal Backup Region
+
+**Context**
+
+Repo: `platform-clusters`
+Policy file (already present from earlier): `opa-policies/data_residency.rego`
+
+Example policy:
+
+```rego
+package data_residency
+
+deny[msg] {
+  input.kind == "BackupPolicy"
+  input.metadata.labels["data_classification"] == "CRITICAL_SOVEREIGN_FR"
+  not input.spec.target.region == "fr-central"
+  msg := sprintf("critical FR data must backup to fr-central, got %v", [input.spec.target.region])
+}
+```
+
+**Starting manifest** (intentionally wrong):
+
+```yaml
+# k8s/clusters/eu-par-fr01/backups.yaml
+apiVersion: backup.example.io/v1
+kind: BackupPolicy
+metadata:
+  name: fr-critical-sovereign-backup
+  labels:
+    data_classification: CRITICAL_SOVEREIGN_FR
+spec:
+  schedule: "0 * * * *"
+  target:
+    provider: "object-storage"
+    region: "eu-central-1"   # WRONG (non-FR region)
+```
+
+**Steps**
+
+1. Create branch:
+
+   ```bash
+   git checkout -b feat/lab-data-residency
+   ```
+
+2. Add the above `backups.yaml`.
+
+3. Run local check:
+
+   ```bash
+   ./scripts/run_opa.sh
+   ```
+
+4. Observe failure message (from OPA).
+
+5. Fix `region: fr-central`.
+
+6. Re-run `./scripts/run_opa.sh` → no deny messages.
+
+7. Push branch, open MR, ensure `policy_gates` passes in CI.
+
+**Day 2 outcome from Lab 1**
+
+* Team sees how **residency** constraints are enforced before deployment.
+* Data classification label on `BackupPolicy` is now meaningful; it controls which regions are allowed.
+
+---
+
+### Lab 2 — Namespace Classification & Storage Mapping
+
+**Context**
+
+Namespace and StorageClass for critical FR workloads.
+
+Manifest (intentionally wrong storage):
+
+```yaml
+apiVersion: v1
+kind: Namespace
+metadata:
+  name: fr-critical-sovereign-ai
+  labels:
+    data_classification: CRITICAL_SOVEREIGN_FR
+    country: FR
+---
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: fr-critical-sovereign-sc
+provisioner: "ceph.rbd"
+parameters:
+  pool: "gpu-block-eu"  # WRONG - replicated to non-FR sites
+```
+
+Corresponding policy (example) in `data_residency.rego`:
+
+```rego
+package data_residency
+
+deny[msg] {
+  input.kind == "StorageClass"
+  input.metadata.name == "fr-critical-sovereign-sc"
+  input.parameters.pool != "gpu-block-fr-local"
+  msg := sprintf("StorageClass %v must use gpu-block-fr-local for critical FR data", [input.metadata.name])
+}
+```
+
+**Steps**
+
+1. Branch:
+
+   ```bash
+   git checkout -b feat/lab-storage-residency
+   ```
+
+2. Add the above namespace + StorageClass.
+
+3. Run `./scripts/run_opa.sh` → see deny message.
+
+4. Fix:
+
+   ```yaml
+   parameters:
+     pool: "gpu-block-fr-local"
+   ```
+
+5. Re-run locally, then push & confirm CI passes.
+
+**Outcome**
+
+* Engineers see how **storage-level locality** is enforced.
+* They learn to correlate which policy file produced which error.
+
+---
+
+### Lab 3 — Admin RBAC Restriction
+
+**Context**
+
+ClusterRoleBinding must only bind admin to sovereign identity.
+
+Manifest (intentionally wrong):
+
+```yaml
+kind: ClusterRoleBinding
+apiVersion: rbac.authorization.k8s.io/v1
+metadata:
+  name: cluster-admin
+subjects:
+  - kind: User
+    name: temp-admin@example.com   # WRONG
+roleRef:
+  kind: ClusterRole
+  name: cluster-admin
+  apiGroup: rbac.authorization.k8s.io
+```
+
+Policy `opa-policies/rbac.rego`:
+
+```rego
+package rbac
+
+deny[msg] {
+  input.kind == "ClusterRoleBinding"
+  input.metadata.name == "cluster-admin"
+  some s
+  s := input.subjects[_]
+  s.kind == "User"
+  not endswith(s.name, "@sovereign-ops.fr")
+  msg := "cluster-admin bindings must target sovereign-ops.fr principals only"
+}
+```
+
+**Steps**
+
+1. Branch:
+
+   ```bash
+   git checkout -b feat/lab-rbac
+   ```
+
+2. Add ClusterRoleBinding manifest.
+
+3. Run `./scripts/run_opa.sh` → see rbac deny message.
+
+4. Fix to:
+
+   ```yaml
+   subjects:
+     - kind: Group
+       name: sovereign-ops-admins@sovereign-ops.fr
+   ```
+
+5. Re-run locally and in CI.
+
+**Outcome**
+
+* Engineers see how **RBAC & identity** constraints prevent dangerous bindings from ever reaching cluster.
+
+---
+
+### Day 2 Definition of Done
+
+* All three labs completed:
+
+  * Data residency / backup region.
+  * Storage-class locality.
+  * RBAC/admin constraints.
+* Engineers can:
+
+  * Identify which OPA package is failing from CI logs.
+  * Fix YAML to satisfy policy without disabling or bypassing it.
+
+---
+
+## Day 3 — Rollback & Failure Handling Drills
+
+### Learning Objectives
+
+By the end of Day 3, the team can:
+
+* Understand what happens when `integration_test` or `site_rollout` fails.
+* Practise **rollback via Git and pipelines**, not manual hotfixes.
+* Distinguish between:
+
+  * **Pre-merge failures** (lint, policy)
+  * **Post-merge / post-deploy failures** (integration, runtime)
+
+We'll run **two drills**:
+
+1. K8s cluster misconfiguration & GitOps rollback.
+2. Bare-metal provisioning error & image rollback.
+
+---
+
+### Drill 1 — K8s Misconfig & GitOps Rollback
+
+**Context**
+
+Repo: `platform-clusters`
+We will introduce a bad CNI config that breaks cluster networking.
+
+**Starting good config** (simplified):
+
+```yaml
+# k8s/clusters/eu-par-fr01/mgmt-cluster.yaml
+cluster:
+  name: eu-par-fr01-mgmt
+  site: EU-PAR-FR01
+  control_plane_nodes: 3
+  node_pool_profile: compute-standard
+  networking:
+    cni: cilium
+```
+
+**Bad change** (training):
+
+```yaml
+  networking:
+    cni: "nonexistent-cni"   # WRONG
+```
+
+Integration test script example, `scripts/integration_tests.sh`:
+
+```bash
+#!/usr/bin/env bash
+set -e
+
+echo "[integration] Checking API server health"
+kubectl --context eu-par-fr01-mgmt get --raw='/healthz' || {
+  echo "API server health check failed"
+  exit 1
+}
+```
+
+**Steps**
+
+1. Branch:
+
+   ```bash
+   git checkout -b exp/break-cni
+   ```
+
+2. Change CNI to `nonexistent-cni`.
+
+3. Run `./scripts/lint.sh` and `./scripts/run_opa.sh` — they still pass (no syntax/policy issue).
+
+4. Open MR, merge it (in training env).
+
+5. CI pipeline on `main`:
+
+   * `lint_and_unit` → pass
+   * `policy_gates` → pass
+   * `integration_test` → **fails** because API server becomes unhealthy after GitOps sync.
+
+**Observation**
+
+* Trainees inspect integration stage logs and see API health failure.
+* GitOps UI (Argo/Flux) may show sync succeeded but cluster is unhealthy.
+
+**Rollback procedure (Git-first)**
+
+1. `git revert <bad-commit>` on `main` or via MR in a training branch.
+2. Push revert; pipeline re-runs:
+
+   * `lint_and_unit` → pass
+   * `policy_gates` → pass
+   * `integration_test` → pass (API healthy again)
+3. GitOps applies the reverted config and restores previous working state.
+
+**Important**
+
+* No manual `kubectl edit` or direct patching allowed.
+* If emergency manual patch is used during drill, it must be:
+
+  * Logged.
+  * Immediately codified as a commit and the manual state reversed.
+
+---
+
+### Drill 2 — Bare-Metal Provisioning Error & Image Rollback
+
+**Context**
+
+Repo: `infra-foundation`
+We simulate a wrong OS image channel leading to bad kernel version.
+
+Bare-metal profile (good):
+
+```yaml
+profile:
+  name: compute-standard
+  cpu: "2x32"
+  ram_gb: 512
+  role: "k8s-node"
+  image:
+    name: "ubuntu-22.04"
+    channel: "stable"
+```
+
+**Bad change:**
+
+```yaml
+  image:
+    name: "ubuntu-22.04"
+    channel: "unstable-edge"   # WRONG
+```
+
+Integration check for OS image, in `scripts/integration_tests.sh`:
+
+```bash
+#!/usr/bin/env bash
+set -e
+
+echo "[integration] Checking kernel version on compute nodes"
+ansible -i hypervisor/ansible/inventory/eu-par-fr01.ini k8s_nodes \
+  -m shell -a "uname -r" \
+  | tee /tmp/kernel_versions.txt
+
+if grep -q "rc" /tmp/kernel_versions.txt; then
+  echo "Unstable kernel detected on k8s nodes"
+  exit 1
+fi
+```
+
+**Steps**
+
+1. Branch:
+
+   ```bash
+   git checkout -b exp/bad-os-channel
+   ```
+
+2. Edit the profile to use `channel: unstable-edge`.
+
+3. Run `./scripts/lint.sh` (passes) and `./scripts/run_opa.sh` (passes - assuming no policy on channel).
+
+4. Merge MR (training env).
+
+5. Trigger `site_rollout EU-PAR-FR01`:
+
+   * Nodes reprovision with unstable kernel.
+   * `integration_test` runs and finds `"rc"` in kernel version → fail.
+
+**Rollback via Git**
+
+1. Revert commit or edit profile back to `channel: stable`.
+2. Push and rerun `site_rollout`.
+3. Nodes are reprovisioned or updated; `integration_test` passes.
+
+**Outcome**
+
+* Team sees that **not all bad changes are caught by policy**; some require integration tests.
+* Rollback pattern is always: **fix in Git → run pipeline → let automation revert.**
+
+---
+
+### Day 3 Definition of Done
+
+* Two drills executed:
+
+  * K8s cluster misconfig & GitOps rollback.
+  * OS image/channel misconfig & provisioning rollback.
+* Teams can describe:
+
+  * Where a failure was caught (policy vs integration vs runtime).
+  * The exact steps to revert via Git and pipelines.
+* No one proposes “just SSH in and fix it” as a normal response anymore.
+
+---
+
+## D3 Overall Definition of Done
+
+When you complete D3:
+
+1. **Pipeline anatomy is understood**
+
+   * Engineers know what each stage does and how to run its checks locally.
+
+2. **Policy-as-code is operational**
+
+   * At least 3 concrete policy categories are actively used and tested:
+
+     * Naming/structure
+     * Data residency & storage locality
+     * RBAC/admin access
+
+3. **Failure and rollback are practised**
+
+   * Teams have experienced real CI failures and cluster/provisioning breakage.
+   * Recovery was done **only** by:
+
+     * Reverting/fixing commits
+     * Re-running pipelines.
+
+4. **Zero_manual_provisioning is enforced by culture**
+
+   * CI & policies are not “annoying gates” but guardrails the team has trained with.
+   * Manual interventions (if any) in drills are treated as exceptional and must be codified after.
+
+---
+