Add Micro–DC/docs/training/D6

This commit is contained in:
2025-12-05 13:09:22 +00:00
parent 55745d2b17
commit d220774716

531
Micro–DC/docs/training/D6 Normal file
View File

@@ -0,0 +1,531 @@
Here we go — **D6 = the final boss fight**: proving you're really at zero_manual_provisioning and not just “vibes-compliant”.
I'll structure this as a **3-part final manual**:
1. **Day 1 - Evidence & Gap Scan** (what's *actually* automated vs still manual)
2. **Day 2 - End-to-End Replay Drill** (“empty site → running workloads” using only Git & pipelines)
3. **Day 3 - Regression Controls & Formal Sign-Off** (make it hard to backslide into snowflake land)
This is the training track that turns everything from D1-D5 into a **passed audit + repeatable pattern**.
---
# D6 — Zero-Manual-Provisioning Manual
**Focus:** Definition of Done, verification flow, and governance to keep it that way.
---
## Day 1 - Evidence & Gap Scan
**Goal:** Build a *brutally honest* picture of where manual actions still exist and convert them (or constrain and codify them) until you can reasonably claim zero_manual_provisioning for EU-PAR-FR01.
### Learning Objectives
By the end of Day 1, the team can:
* List all **infra and platform layers** and state whether they're:
* Automated (Git → pipeline → infra),
* One-time bootstrap (strictly documented),
* Or still “mysteriously manual”.
* Create a **verification checklist** organized by repo / pipeline / layer.
* Identify gaps that must be closed before final D6 sign-off.
### 1.1 Pre-Reqs
* D1-D5 labs completed in at least a training or staging environment.
* Repos:
* `infra-foundation`
* `platform-clusters`
* `policies-and-compliance`
* Pipelines wired and working:
* `lint_and_unit`, `policy_gates`, `integration_test`, `promotion_to_template`, `site_rollout`
* Monitoring + SLOs in place (control plane, GPU wait time, PUE).
---
### 1.2 Build the “Stack Checklist”
Create a shared doc, e.g.:
`governance/checklists/zero-manual-provisioning-eu-par-fr01.md`
(in a dedicated governance repo or `policies-and-compliance/docs/`).
Structure it by **layers**:
```markdown
# Zero-Manual-Provisioning Checklist - EU-PAR-FR01
## 1. Facility & Physical Inventory
- [ ] Racks defined in infra-foundation (rack_layouts/eu-par-fr01-racks.yaml)
- [ ] Power & cooling defined in infra-foundation (power_and_cooling/eu-par-fr01-power.yaml)
- [ ] Telemetry mapping (PDU → metrics → Prometheus) codified
## 2. Network & OOB
- [ ] Underlay config defined in Terraform (network/terraform/eu-par-fr01/)
- [ ] OOB inventory defined as code (baremetal/profiles/oob-switches.yaml)
- [ ] First-time OOB bootstrap documented in facility/bootstrap/eu-par-fr01.md
- [ ] No ongoing config changes on network devices outside Terraform/Ansible
## 3. Bare Metal & Hypervisors
- [ ] Baremetal profiles defined (baremetal/profiles/*.yaml)
- [ ] MAAS/commissioning controlled via scripts/apply_site.sh
- [ ] Hypervisor config driven by Ansible (hypervisor/ansible/...)
- [ ] No “snowflake” nodes with manual config
## 4. Platform (K8s/GitOps)
- [ ] mgmt cluster defined as YAML (platform-clusters/k8s/clusters/eu-par-fr01/)
- [ ] GitOps app definition committed (gitops-app.yaml)
- [ ] All cluster add-ons configured via GitOps
## 5. Policies & Compliance
- [ ] Data classification defined (data-classification.yaml)
- [ ] Residency policies (data_residency.rego) enforced in CI
- [ ] RBAC policies (rbac.rego) enforced; JIT pattern defined
- [ ] Sustainability KPIs defined (sustainability-kpis.yaml)
## 6. Observability & SLOs
- [ ] Prometheus rules for control-plane SLO
- [ ] Prometheus rules for GPU wait-time SLO
- [ ] Prometheus rules for PUE SLO
- [ ] Dashboards in Grafana defined via Git
## 7. Process & Governance
- [ ] Manual-change policy documented
- [ ] Break-glass/JIT procedure documented and tested
- [ ] Drift detection in place (GitOps out-of-sync, config diff tools)
```
Your job today is to **fill this with reality**:
* Check each box honestly:
* ✅ implemented
* 🚧 partially implemented
* ❌ missing
* Note for each ❌ / 🚧:
* Which repo and path will own the fix.
* Which team/role is responsible.
---
### 1.3 Identify Remaining Manual Touchpoints
In a short workshop, list **every manual action** you can think of for EU-PAR-FR01:
* “We still log into switch X to tweak SNMP traps”
* “We create namespaces by hand via `kubectl` for testing”
* “We change Ceph pool settings via UI for urgent issues”
Categorise:
1. **One-time bootstrap (allowed)**
* E.g. first console config on OOB devices; install of `mgmt01`.
* Must be:
* Listed in `facility/bootstrap/eu-par-fr01.md`
* Idempotent and documented
* Not used in steady state
2. **Steady-state manual (NOT allowed for zero_manual_provisioning)**
* These must be either:
* Automated (IaC / K8s config / operator), or
* Constrained behind a JIT/break-glass procedure, then codified after the fact.
Create a table like:
```markdown
| Manual action | Layer | Category | Remediation owner | Target repo/path |
|--------------------------------------------|---------------|-----------------|-------------------|-----------------------------------------|
| Change OSPF cost on leaf switch | Network | Steady-state | Network Architect | infra-foundation/network/terraform/... |
| Create K8s namespaces via kubectl | Platform | Steady-state | Platform SRE | platform-clusters/k8s/clusters/... |
| Tune Ceph pool size via GUI | Storage | Steady-state | Storage Engineer | infra-foundation/storage-ceph.yaml |
| First-time OOB IP + login on switches | OOB bootstrap | One-time | Network Architect | facility/bootstrap/eu-par-fr01.md |
```
---
### 1.4 Day 1 Definition of Done
* The checklist file exists and is **filled** with actual state.
* Manual actions are categorised into:
* Allowed one-time bootstrap (documented)
* To-be-eliminated steady-state manual changes
* For each non-compliant manual change, there-s a **named owner** and target repo/path for remediation.
---
## Day 2 - End-to-End Replay Drill (“From Nothing to Workloads”)
**Goal:** Prove that EU-PAR-FR01 can be (re)built and made operational **purely** from Git + pipelines + documented one-time bootstrap.
You're not going to physically tear down the DC; you'll use:
* A **sandbox environment**, or
* A “virtual EU-PAR-FR01” representation, or
* A subset (e.g. one rack, minimal control-plane) as a proxy.
### Learning Objectives
By the end of Day 2, the team can:
* Run a **scripted end-to-end replay** of:
* Facility → Network → Bare metal → Hypervisor → K8s → Policies → Workloads → SLOs.
* Observe each stage via pipelines and dashboards.
* Demonstrate that no hidden manual step is required in steady state.
### 2.1 Pre-Reqs
* Most gaps from Day 1 are addressed or consciously accepted as temporary and not affecting the drill scope.
* A “resettable” environment exists (virtual site or subset of nodes).
---
### 2.2 Design the Replay Script
Create a high-level step-by-step “runbook for the drill”, e.g.:
`governance/drills/eu-par-fr01-e2e-replay.md`
Example skeleton:
```markdown
# EU-PAR-FR01 - End-to-End Replay Drill
## Step 0 - Preconditions
- [ ] Lab environment reset to baseline (no cluster, nodes powered off or unprovisioned)
- [ ] CI runners reachable
- [ ] Git repos at tagged versions:
infra-foundation@vX.Y.Z,
platform-clusters@vA.B.C,
policies-and-compliance@vM.N.P
## Step 1 - Facility & Inventory
- [ ] Validate facility manifests present in infra-foundation (no changes needed)
- [ ] Confirm telemetry endpoints for PDU simulated/available
## Step 2 - Network & OOB Bootstrap
- [ ] Run infra-foundation pipeline: terraform apply for eu-par-fr01
- [ ] Perform documented one-time console bootstrap actions (from bootstrap/eu-par-fr01.md)
- [ ] Verify OOB & mgmt reachability
## Step 3 - Bare Metal & Hypervisor
- [ ] Trigger infra-foundation site_rollout for MAAS commission + hypervisor
- [ ] Verify nodes appear with correct tags and hypervisor cluster healthy
## Step 4 - K8s Mgmt Cluster & GitOps
- [ ] Trigger platform-clusters site_rollout for eu-par-fr01 mgmt cluster
- [ ] Confirm control-plane SLO panel shows baseline metrics
## Step 5 - Policies & Monitoring
- [ ] Ensure policies-and-compliance pipelines passed (no pending changes)
- [ ] Confirm OPA policies active (test a small intentional failure in a lab branch, not applied)
## Step 6 - Workloads & Sovereign Namespaces
- [ ] Deploy justice tenant workload via platform-clusters
- [ ] Confirm namespace & data residency policies in effect
## Step 7 - SLOs & Dashboards
- [ ] Confirm control-plane, GPU wait time, PUE SLO metrics live
- [ ] Simulate one small load burst to see metrics move
```
You'll use this script as the basis of the live drill.
---
### 2.3 Run the Drill as a “Game Day”
Treat this like a **Game Day**. Roles:
* **Conductor** - drives the drill, announces steps, timeboxes.
* **Observers** - note any manual actions, friction points, or undocumented steps.
* **Operators** - actually run pipelines, check logs, etc.
For each step:
1. Read the step from the drill doc.
2. Execute it strictly via:
* Git commits (if needed)
* CI pipelines (`lint_and_unit`, `policy_gates`, `site_rollout`)
* Documented one-time bootstrap actions (if in Step 2).
3. Log:
* Start time, end time
* Any manual commands run
* Any surprises (missing script, script out-of-date, etc.)
**Critical rule:**
If someone says “I'll just SSH in and fix X”, they must:
* Stop and say it out loud.
* The Conductor marks this as a **failed requirement** and a **gap to fix in code** afterward.
* For the purpose of the drill, decide whether to continue with that manual step (noting it as a gap) or stop and fix it properly in Git first.
---
### 2.4 Artifacts from the Drill
You want a reproducible evidence trail:
* CI pipeline links for:
* `infra-foundation` site_rollout
* `platform-clusters` site_rollout
* `policies-and-compliance` `policy_gates` runs
* Screenshots or snapshots of:
* K8s nodes, namespaces, workloads
* SLO dashboards (control-plane, GPU, PUE)
* Drill log:
* Saved as `governance/drills/logs/eu-par-fr01-e2e-replay-YYYYMMDD.md`
Example entry:
```markdown
## Step 3 - Bare Metal & Hypervisor
Time: 09:30-09:55
Pipeline: https://git/.../pipelines/1234
Manual actions:
- None (all MAAS commission via script+pipeline)
Notes:
- Proxmox cluster came up with minor warning about time sync; add NTP config to Ansible role.
```
---
### 2.5 Day 2 Definition of Done
* One complete **end-to-end replay drill** executed and logged.
* All unavoidable manual actions are:
* Either in the one-time bootstrap doc, or
* Clearly flagged as gaps to fix.
* You can point to:
* A single doc describing the replay
* A set of pipeline runs
* A functioning platform with workloads, policies, and SLOs
and say:
> “This environment was built from Git and pipelines only.”
---
## Day 3 - Regression Controls & Formal Sign-Off
**Goal:** Turn “we did it once” into “this is how we operate by default and don't regress”.
This is where **governance, culture, and automation** interlock.
### Learning Objectives
By the end of Day 3, the team can:
* Define durable **controls against regression** (manual changes, drift, snowflakes).
* Set up a **recurring verification cadence** (e.g., quarterly drills).
* Produce a final “zero_manual_provisioning” sign-off package for EU-PAR-FR01.
---
### 3.1 Hardening Against Manual Drift
Work through these and implement what you don't yet have:
1. **Git protections**
* Branch protection on `main`:
* MR required
* CI must pass (lint + policy at minimum)
* Optional: code owners for sensitive paths (`opa-policies/`, network Terraform, etc.)
2. **GitOps enforcement on clusters**
* Alert if GitOps shows out-of-sync for more than X minutes.
* Optionally, **auto-revert** manual cluster changes by periodic reconciliation.
3. **Network / OS drift detection**
* Periodic jobs that:
* Pull running configs from devices
* Compare to Terraform/Ansible expected state
* Alert on differences
4. **Manual-change policy**
* Document a short policy, e.g. `governance/policies/no-manual-changes.md`:
```markdown
# No Manual Changes Policy - EU-PAR-FR01
- All persistent infra/platform changes must be made via Git and CI/CD.
- Exceptions (break-glass, JIT) must:
- Be pre-approved by on-call SRE and Security.
- Be logged in an incident or change ticket.
- Be codified via Git within 24h (or a defined SLA).
- Direct device or cluster access is only for:
- Documented bootstrap tasks
- Emergency remediation with follow-up codification
```
* Go through and ensure **everyone at the table** acknowledges this is the standard.
---
### 3.2 Define Recurring Verification
You *don't* want D6 to be a one-off. Define a cadence, e.g.:
* **Quarterly “mini replay” drill**:
* Not full bare metal → workloads, but at least:
* Destroy/recreate a non-critical cluster
* Validate policies and SLOs
* **Monthly drift review**:
* Review GitOps drift, network/OS drift reports, RBAC exceptions.
Capture this in:
`governance/runbooks/zero-manual-verification-cadence.md`
Example:
```markdown
# Zero-Manual-Provisioning Verification Cadence - EU-PAR-FR01
## Monthly
- [ ] Review GitOps drift logs for all clusters.
- [ ] Review network/OS drift reports.
- [ ] Check for any undocumented manual actions in incident/change tickets.
## Quarterly
- [ ] Run a scoped replay drill:
- Recreate a non-prod cluster from Git.
- Validate policies & SLOs.
- [ ] Update zero-manual checklist with any new components.
## Yearly
- [ ] Full replay drill (or as close as feasible in a test environment).
- [ ] Re-affirm break-glass policies and JIT patterns.
- [ ] Refresh training for new team members.
```
---
### 3.3 Formal Zero-Manual-Provisioning Sign-Off
Now build the actual **D6 “Definition of Done” artifact set**.
Think of it as your internal audit dossier.
Create:
`governance/certificates/eu-par-fr01-zero-manual-provisioning.md`
Suggested sections:
```markdown
# Zero-Manual-Provisioning Certificate - EU-PAR-FR01
## 1. Scope
- Site: EU-PAR-FR01 (Paris, France)
- Regime: EU/EEA, GDPR + FR Data Protection Act
- Stack: Facility, Network, Bare Metal, Hypervisor, K8s, Policies, Monitoring
## 2. Evidence Summary
### 2.1 Repos & Pipelines
- infra-foundation: URL
- Pipelines: lint_and_unit, policy_gates, site_rollout
- platform-clusters: URL
- Pipelines: lint_and_unit, policy_gates, integration_test, site_rollout
- policies-and-compliance: URL
- Pipelines: lint_and_unit, policy_gates
### 2.2 End-to-End Replay
- Drill document: governance/drills/eu-par-fr01-e2e-replay.md
- Execution date: YYYY-MM-DD
- Pipelines:
- infra-foundation #1234
- platform-clusters #5678
- policies-and-compliance #91011
### 2.3 Policies & SLOs
- Residency: opa-policies/data_residency.rego (link)
- RBAC & JIT: opa-policies/rbac.rego (link)
- SLO specs:
- control-plane-availability
- gpu-job-wait-time
- site-pue
- Dashboards: paths to Grafana JSON files
## 3. Residual Manual Elements
- One-time bootstrap tasks:
- OOB device initial config (documented in bootstrap/eu-par-fr01.md)
- mgmt01 initial OS install (to be automated in next iteration)
- Current limitations:
- <short list, aligned with LR1-LR4 where relevant>
## 4. Governance & Cadence
- No-manual-changes policy: governance/policies/no-manual-changes.md
- Verification cadence: governance/runbooks/zero-manual-verification-cadence.md
## 5. Approvals
- Sovereign Compliance & Sustainability Lead: ______ (date)
- Principal SRE / Platform Architect: ______ (date)
- Network & Facility Leads (as required): ______ (date)
```
Make this **short but precise**. It's your internal “badge” for EU-PAR-FR01.
---
### 3.4 Day 3 Definition of Done
* No-manual-changes policy written and adopted.
* Verification cadence defined and stored in Git.
* Zero-manual certificate for EU-PAR-FR01 written, with:
* Evidence links
* Residual limitations
* Signatures/approvals
---
## D6 Overall Definition of Done (Training-Level)
When the D6 manual is completed **in practice**, you have:
1. **A complete, honest inventory** of automation vs manual for EU-PAR-FR01.
2. At least one **end-to-end replay drill** proving build-from-Git is real, not theoretical.
3. **Controls to prevent regression**:
* Policy + process + drift detection.
4. A **formal sign-off package** that says, in concrete terms:
* What is automated
* What is still manually bootstrapped (and constrained)
* How you ensure it stays that way over time.
That's your training cohort “graduation” — from bare metal to zero_manual_provisioning as a living practice, not just an architecture diagram.
---