Add Micro–DC/docs/training/D3
This commit is contained in:
650
Micro–DC/docs/training/D3
Normal file
650
Micro–DC/docs/training/D3
Normal file
@@ -0,0 +1,650 @@
|
||||
Alright, time for **D3: Pipelines & Policy Training** — this is where the Git-only story actually becomes *operational muscle memory*.
|
||||
|
||||
I'll structure it like D1/D2:
|
||||
|
||||
* Multi-day manual
|
||||
* Each day = objectives, pre-reqs, concrete steps, and Definition of Done
|
||||
|
||||
Scope for D3:
|
||||
|
||||
* One shared CI pattern across the three repos
|
||||
* Strong focus on `policy_gates` + how trainees **see, debug, and fix** failures
|
||||
* Hands-on rollback drills & “bad change caught by policy” examples
|
||||
|
||||
---
|
||||
|
||||
# D3 — Pipeline & Policy Training Manual
|
||||
|
||||
**Focus:** CI/CD stages, policy-as-code enforcement, and safe rollback patterns.
|
||||
|
||||
We'll use **3 training days**:
|
||||
|
||||
* **Day 1:** Pipeline anatomy & local validation
|
||||
* **Day 2:** Policy gate labs (residency, RBAC, naming)
|
||||
* **Day 3:** Rollback & failure-handling drills
|
||||
|
||||
Assumed repos (as in D1/D2):
|
||||
|
||||
* `infra-foundation`
|
||||
* `platform-clusters`
|
||||
* `policies-and-compliance`
|
||||
|
||||
Standard stages:
|
||||
|
||||
* `lint_and_unit`
|
||||
* `policy_gates`
|
||||
* `integration_test`
|
||||
* `promotion_to_template`
|
||||
* `site_rollout`
|
||||
|
||||
---
|
||||
|
||||
## Day 1 — Pipeline Anatomy & Local Validation
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
By the end of Day 1, the team can:
|
||||
|
||||
* Read and explain the **CI pipeline layout** for all three repos.
|
||||
* Run **local equivalents** of `lint_and_unit` and `policy_gates`.
|
||||
* Predict which stage will fail given a specific mistake.
|
||||
|
||||
### Pre-Reqs
|
||||
|
||||
* D1 and D2 manuals executed (basic repo layout & templates exist).
|
||||
* CI runner available for all three repos.
|
||||
* OPA (or Conftest) + basic linters pre-installed or containerised.
|
||||
|
||||
---
|
||||
|
||||
### 1.1 Standard CI Blueprint (per repo)
|
||||
|
||||
Pick one repo (e.g. `platform-clusters`) and ensure `.gitlab-ci.yml` (or similar) looks roughly like:
|
||||
|
||||
```yaml
|
||||
stages:
|
||||
- lint_and_unit
|
||||
- policy_gates
|
||||
- integration_test
|
||||
- promotion_to_template
|
||||
- site_rollout
|
||||
|
||||
lint_and_unit:
|
||||
stage: lint_and_unit
|
||||
script:
|
||||
- ./scripts/lint.sh
|
||||
only:
|
||||
- merge_requests
|
||||
- main
|
||||
|
||||
policy_gates:
|
||||
stage: policy_gates
|
||||
script:
|
||||
- ./scripts/run_opa.sh
|
||||
needs: ["lint_and_unit"]
|
||||
only:
|
||||
- merge_requests
|
||||
- main
|
||||
|
||||
integration_test:
|
||||
stage: integration_test
|
||||
script:
|
||||
- ./scripts/integration_tests.sh
|
||||
needs: ["policy_gates"]
|
||||
when: manual
|
||||
only:
|
||||
- main
|
||||
|
||||
promotion_to_template:
|
||||
stage: promotion_to_template
|
||||
script:
|
||||
- ./scripts/promote_templates.sh
|
||||
only:
|
||||
- main
|
||||
|
||||
site_rollout:
|
||||
stage: site_rollout
|
||||
script:
|
||||
- ./scripts/apply_site.sh EU-PAR-FR01
|
||||
when: manual
|
||||
needs: ["integration_test"]
|
||||
only:
|
||||
- main
|
||||
```
|
||||
|
||||
**Exercise:**
|
||||
|
||||
For each repo, have participants:
|
||||
|
||||
1. Open the CI file.
|
||||
2. Identify:
|
||||
|
||||
* Where `lint_and_unit` is defined.
|
||||
* Where `policy_gates` is defined and which script it calls.
|
||||
* Which stages are `manual` vs automatic.
|
||||
3. In a short whiteboard or doc, write:
|
||||
|
||||
* “If I break YAML → which stage fails?”
|
||||
* “If I violate data residency → which stage fails?”
|
||||
* “If cluster fails health checks post-deploy → which stage fails?”
|
||||
|
||||
---
|
||||
|
||||
### 1.2 Local Lint & Policy Scripts
|
||||
|
||||
**Goal:** Engineers should be able to run the same checks *locally* before pushing.
|
||||
|
||||
Example `scripts/lint.sh` (for K8s repo):
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
set -e
|
||||
|
||||
echo "[lint] YAML syntax"
|
||||
find k8s -name '*.yaml' -print0 | xargs -0 yamllint
|
||||
|
||||
echo "[lint] K8s schema (kubeconform or similar)"
|
||||
kubeconform -summary -strict -ignore-missing-schemas \
|
||||
-kubernetes-version 1.29.0 \
|
||||
$(find k8s -name '*.yaml' -print)
|
||||
```
|
||||
|
||||
Example `scripts/run_opa.sh`:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
set -e
|
||||
|
||||
echo "[policy] Running OPA policies"
|
||||
conftest test \
|
||||
--policy ./opa-policies \
|
||||
$(find k8s -name '*.yaml' -print)
|
||||
```
|
||||
|
||||
**Exercise:**
|
||||
|
||||
For each repo:
|
||||
|
||||
1. Open `scripts/lint.sh` and `scripts/run_opa.sh`.
|
||||
|
||||
2. Run them locally:
|
||||
|
||||
```bash
|
||||
./scripts/lint.sh
|
||||
./scripts/run_opa.sh
|
||||
```
|
||||
|
||||
3. Intentionally break something small (e.g., missing `:` in YAML) and see `lint_and_unit` fail locally, then fix.
|
||||
|
||||
---
|
||||
|
||||
### 1.3 Day 1 Definition of Done
|
||||
|
||||
* Every engineer knows:
|
||||
|
||||
* What each CI stage does.
|
||||
* Roughly which types of errors show up where.
|
||||
* Every engineer can run `./scripts/lint.sh` and `./scripts/run_opa.sh` locally and interpret failures.
|
||||
|
||||
---
|
||||
|
||||
## Day 2 — Policy Gate Labs (Data Residency, RBAC, Naming)
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
By the end of Day 2, the team can:
|
||||
|
||||
* Write *and* debug OPA policies.
|
||||
* See how a bad change is blocked at `policy_gates`.
|
||||
* Understand 3 core policy categories:
|
||||
|
||||
* **Naming & structure**
|
||||
* **Data residency**
|
||||
* **RBAC/admin access**
|
||||
|
||||
We'll implement **three concrete labs**.
|
||||
|
||||
---
|
||||
|
||||
### Lab 1 — Data Residency: Illegal Backup Region
|
||||
|
||||
**Context**
|
||||
|
||||
Repo: `platform-clusters`
|
||||
Policy file (already present from earlier): `opa-policies/data_residency.rego`
|
||||
|
||||
Example policy:
|
||||
|
||||
```rego
|
||||
package data_residency
|
||||
|
||||
deny[msg] {
|
||||
input.kind == "BackupPolicy"
|
||||
input.metadata.labels["data_classification"] == "CRITICAL_SOVEREIGN_FR"
|
||||
not input.spec.target.region == "fr-central"
|
||||
msg := sprintf("critical FR data must backup to fr-central, got %v", [input.spec.target.region])
|
||||
}
|
||||
```
|
||||
|
||||
**Starting manifest** (intentionally wrong):
|
||||
|
||||
```yaml
|
||||
# k8s/clusters/eu-par-fr01/backups.yaml
|
||||
apiVersion: backup.example.io/v1
|
||||
kind: BackupPolicy
|
||||
metadata:
|
||||
name: fr-critical-sovereign-backup
|
||||
labels:
|
||||
data_classification: CRITICAL_SOVEREIGN_FR
|
||||
spec:
|
||||
schedule: "0 * * * *"
|
||||
target:
|
||||
provider: "object-storage"
|
||||
region: "eu-central-1" # WRONG (non-FR region)
|
||||
```
|
||||
|
||||
**Steps**
|
||||
|
||||
1. Create branch:
|
||||
|
||||
```bash
|
||||
git checkout -b feat/lab-data-residency
|
||||
```
|
||||
|
||||
2. Add the above `backups.yaml`.
|
||||
|
||||
3. Run local check:
|
||||
|
||||
```bash
|
||||
./scripts/run_opa.sh
|
||||
```
|
||||
|
||||
4. Observe failure message (from OPA).
|
||||
|
||||
5. Fix `region: fr-central`.
|
||||
|
||||
6. Re-run `./scripts/run_opa.sh` → no deny messages.
|
||||
|
||||
7. Push branch, open MR, ensure `policy_gates` passes in CI.
|
||||
|
||||
**Day 2 outcome from Lab 1**
|
||||
|
||||
* Team sees how **residency** constraints are enforced before deployment.
|
||||
* Data classification label on `BackupPolicy` is now meaningful; it controls which regions are allowed.
|
||||
|
||||
---
|
||||
|
||||
### Lab 2 — Namespace Classification & Storage Mapping
|
||||
|
||||
**Context**
|
||||
|
||||
Namespace and StorageClass for critical FR workloads.
|
||||
|
||||
Manifest (intentionally wrong storage):
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: fr-critical-sovereign-ai
|
||||
labels:
|
||||
data_classification: CRITICAL_SOVEREIGN_FR
|
||||
country: FR
|
||||
---
|
||||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: fr-critical-sovereign-sc
|
||||
provisioner: "ceph.rbd"
|
||||
parameters:
|
||||
pool: "gpu-block-eu" # WRONG - replicated to non-FR sites
|
||||
```
|
||||
|
||||
Corresponding policy (example) in `data_residency.rego`:
|
||||
|
||||
```rego
|
||||
package data_residency
|
||||
|
||||
deny[msg] {
|
||||
input.kind == "StorageClass"
|
||||
input.metadata.name == "fr-critical-sovereign-sc"
|
||||
input.parameters.pool != "gpu-block-fr-local"
|
||||
msg := sprintf("StorageClass %v must use gpu-block-fr-local for critical FR data", [input.metadata.name])
|
||||
}
|
||||
```
|
||||
|
||||
**Steps**
|
||||
|
||||
1. Branch:
|
||||
|
||||
```bash
|
||||
git checkout -b feat/lab-storage-residency
|
||||
```
|
||||
|
||||
2. Add the above namespace + StorageClass.
|
||||
|
||||
3. Run `./scripts/run_opa.sh` → see deny message.
|
||||
|
||||
4. Fix:
|
||||
|
||||
```yaml
|
||||
parameters:
|
||||
pool: "gpu-block-fr-local"
|
||||
```
|
||||
|
||||
5. Re-run locally, then push & confirm CI passes.
|
||||
|
||||
**Outcome**
|
||||
|
||||
* Engineers see how **storage-level locality** is enforced.
|
||||
* They learn to correlate which policy file produced which error.
|
||||
|
||||
---
|
||||
|
||||
### Lab 3 — Admin RBAC Restriction
|
||||
|
||||
**Context**
|
||||
|
||||
ClusterRoleBinding must only bind admin to sovereign identity.
|
||||
|
||||
Manifest (intentionally wrong):
|
||||
|
||||
```yaml
|
||||
kind: ClusterRoleBinding
|
||||
apiVersion: rbac.authorization.k8s.io/v1
|
||||
metadata:
|
||||
name: cluster-admin
|
||||
subjects:
|
||||
- kind: User
|
||||
name: temp-admin@example.com # WRONG
|
||||
roleRef:
|
||||
kind: ClusterRole
|
||||
name: cluster-admin
|
||||
apiGroup: rbac.authorization.k8s.io
|
||||
```
|
||||
|
||||
Policy `opa-policies/rbac.rego`:
|
||||
|
||||
```rego
|
||||
package rbac
|
||||
|
||||
deny[msg] {
|
||||
input.kind == "ClusterRoleBinding"
|
||||
input.metadata.name == "cluster-admin"
|
||||
some s
|
||||
s := input.subjects[_]
|
||||
s.kind == "User"
|
||||
not endswith(s.name, "@sovereign-ops.fr")
|
||||
msg := "cluster-admin bindings must target sovereign-ops.fr principals only"
|
||||
}
|
||||
```
|
||||
|
||||
**Steps**
|
||||
|
||||
1. Branch:
|
||||
|
||||
```bash
|
||||
git checkout -b feat/lab-rbac
|
||||
```
|
||||
|
||||
2. Add ClusterRoleBinding manifest.
|
||||
|
||||
3. Run `./scripts/run_opa.sh` → see rbac deny message.
|
||||
|
||||
4. Fix to:
|
||||
|
||||
```yaml
|
||||
subjects:
|
||||
- kind: Group
|
||||
name: sovereign-ops-admins@sovereign-ops.fr
|
||||
```
|
||||
|
||||
5. Re-run locally and in CI.
|
||||
|
||||
**Outcome**
|
||||
|
||||
* Engineers see how **RBAC & identity** constraints prevent dangerous bindings from ever reaching cluster.
|
||||
|
||||
---
|
||||
|
||||
### Day 2 Definition of Done
|
||||
|
||||
* All three labs completed:
|
||||
|
||||
* Data residency / backup region.
|
||||
* Storage-class locality.
|
||||
* RBAC/admin constraints.
|
||||
* Engineers can:
|
||||
|
||||
* Identify which OPA package is failing from CI logs.
|
||||
* Fix YAML to satisfy policy without disabling or bypassing it.
|
||||
|
||||
---
|
||||
|
||||
## Day 3 — Rollback & Failure Handling Drills
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
By the end of Day 3, the team can:
|
||||
|
||||
* Understand what happens when `integration_test` or `site_rollout` fails.
|
||||
* Practise **rollback via Git and pipelines**, not manual hotfixes.
|
||||
* Distinguish between:
|
||||
|
||||
* **Pre-merge failures** (lint, policy)
|
||||
* **Post-merge / post-deploy failures** (integration, runtime)
|
||||
|
||||
We'll run **two drills**:
|
||||
|
||||
1. K8s cluster misconfiguration & GitOps rollback.
|
||||
2. Bare-metal provisioning error & image rollback.
|
||||
|
||||
---
|
||||
|
||||
### Drill 1 — K8s Misconfig & GitOps Rollback
|
||||
|
||||
**Context**
|
||||
|
||||
Repo: `platform-clusters`
|
||||
We will introduce a bad CNI config that breaks cluster networking.
|
||||
|
||||
**Starting good config** (simplified):
|
||||
|
||||
```yaml
|
||||
# k8s/clusters/eu-par-fr01/mgmt-cluster.yaml
|
||||
cluster:
|
||||
name: eu-par-fr01-mgmt
|
||||
site: EU-PAR-FR01
|
||||
control_plane_nodes: 3
|
||||
node_pool_profile: compute-standard
|
||||
networking:
|
||||
cni: cilium
|
||||
```
|
||||
|
||||
**Bad change** (training):
|
||||
|
||||
```yaml
|
||||
networking:
|
||||
cni: "nonexistent-cni" # WRONG
|
||||
```
|
||||
|
||||
Integration test script example, `scripts/integration_tests.sh`:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
set -e
|
||||
|
||||
echo "[integration] Checking API server health"
|
||||
kubectl --context eu-par-fr01-mgmt get --raw='/healthz' || {
|
||||
echo "API server health check failed"
|
||||
exit 1
|
||||
}
|
||||
```
|
||||
|
||||
**Steps**
|
||||
|
||||
1. Branch:
|
||||
|
||||
```bash
|
||||
git checkout -b exp/break-cni
|
||||
```
|
||||
|
||||
2. Change CNI to `nonexistent-cni`.
|
||||
|
||||
3. Run `./scripts/lint.sh` and `./scripts/run_opa.sh` — they still pass (no syntax/policy issue).
|
||||
|
||||
4. Open MR, merge it (in training env).
|
||||
|
||||
5. CI pipeline on `main`:
|
||||
|
||||
* `lint_and_unit` → pass
|
||||
* `policy_gates` → pass
|
||||
* `integration_test` → **fails** because API server becomes unhealthy after GitOps sync.
|
||||
|
||||
**Observation**
|
||||
|
||||
* Trainees inspect integration stage logs and see API health failure.
|
||||
* GitOps UI (Argo/Flux) may show sync succeeded but cluster is unhealthy.
|
||||
|
||||
**Rollback procedure (Git-first)**
|
||||
|
||||
1. `git revert <bad-commit>` on `main` or via MR in a training branch.
|
||||
2. Push revert; pipeline re-runs:
|
||||
|
||||
* `lint_and_unit` → pass
|
||||
* `policy_gates` → pass
|
||||
* `integration_test` → pass (API healthy again)
|
||||
3. GitOps applies the reverted config and restores previous working state.
|
||||
|
||||
**Important**
|
||||
|
||||
* No manual `kubectl edit` or direct patching allowed.
|
||||
* If emergency manual patch is used during drill, it must be:
|
||||
|
||||
* Logged.
|
||||
* Immediately codified as a commit and the manual state reversed.
|
||||
|
||||
---
|
||||
|
||||
### Drill 2 — Bare-Metal Provisioning Error & Image Rollback
|
||||
|
||||
**Context**
|
||||
|
||||
Repo: `infra-foundation`
|
||||
We simulate a wrong OS image channel leading to bad kernel version.
|
||||
|
||||
Bare-metal profile (good):
|
||||
|
||||
```yaml
|
||||
profile:
|
||||
name: compute-standard
|
||||
cpu: "2x32"
|
||||
ram_gb: 512
|
||||
role: "k8s-node"
|
||||
image:
|
||||
name: "ubuntu-22.04"
|
||||
channel: "stable"
|
||||
```
|
||||
|
||||
**Bad change:**
|
||||
|
||||
```yaml
|
||||
image:
|
||||
name: "ubuntu-22.04"
|
||||
channel: "unstable-edge" # WRONG
|
||||
```
|
||||
|
||||
Integration check for OS image, in `scripts/integration_tests.sh`:
|
||||
|
||||
```bash
|
||||
#!/usr/bin/env bash
|
||||
set -e
|
||||
|
||||
echo "[integration] Checking kernel version on compute nodes"
|
||||
ansible -i hypervisor/ansible/inventory/eu-par-fr01.ini k8s_nodes \
|
||||
-m shell -a "uname -r" \
|
||||
| tee /tmp/kernel_versions.txt
|
||||
|
||||
if grep -q "rc" /tmp/kernel_versions.txt; then
|
||||
echo "Unstable kernel detected on k8s nodes"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
**Steps**
|
||||
|
||||
1. Branch:
|
||||
|
||||
```bash
|
||||
git checkout -b exp/bad-os-channel
|
||||
```
|
||||
|
||||
2. Edit the profile to use `channel: unstable-edge`.
|
||||
|
||||
3. Run `./scripts/lint.sh` (passes) and `./scripts/run_opa.sh` (passes - assuming no policy on channel).
|
||||
|
||||
4. Merge MR (training env).
|
||||
|
||||
5. Trigger `site_rollout EU-PAR-FR01`:
|
||||
|
||||
* Nodes reprovision with unstable kernel.
|
||||
* `integration_test` runs and finds `"rc"` in kernel version → fail.
|
||||
|
||||
**Rollback via Git**
|
||||
|
||||
1. Revert commit or edit profile back to `channel: stable`.
|
||||
2. Push and rerun `site_rollout`.
|
||||
3. Nodes are reprovisioned or updated; `integration_test` passes.
|
||||
|
||||
**Outcome**
|
||||
|
||||
* Team sees that **not all bad changes are caught by policy**; some require integration tests.
|
||||
* Rollback pattern is always: **fix in Git → run pipeline → let automation revert.**
|
||||
|
||||
---
|
||||
|
||||
### Day 3 Definition of Done
|
||||
|
||||
* Two drills executed:
|
||||
|
||||
* K8s cluster misconfig & GitOps rollback.
|
||||
* OS image/channel misconfig & provisioning rollback.
|
||||
* Teams can describe:
|
||||
|
||||
* Where a failure was caught (policy vs integration vs runtime).
|
||||
* The exact steps to revert via Git and pipelines.
|
||||
* No one proposes “just SSH in and fix it” as a normal response anymore.
|
||||
|
||||
---
|
||||
|
||||
## D3 Overall Definition of Done
|
||||
|
||||
When you complete D3:
|
||||
|
||||
1. **Pipeline anatomy is understood**
|
||||
|
||||
* Engineers know what each stage does and how to run its checks locally.
|
||||
|
||||
2. **Policy-as-code is operational**
|
||||
|
||||
* At least 3 concrete policy categories are actively used and tested:
|
||||
|
||||
* Naming/structure
|
||||
* Data residency & storage locality
|
||||
* RBAC/admin access
|
||||
|
||||
3. **Failure and rollback are practised**
|
||||
|
||||
* Teams have experienced real CI failures and cluster/provisioning breakage.
|
||||
* Recovery was done **only** by:
|
||||
|
||||
* Reverting/fixing commits
|
||||
* Re-running pipelines.
|
||||
|
||||
4. **Zero_manual_provisioning is enforced by culture**
|
||||
|
||||
* CI & policies are not “annoying gates” but guardrails the team has trained with.
|
||||
* Manual interventions (if any) in drills are treated as exceptional and must be codified after.
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user