Add Micro–DC/docs/training/D6
This commit is contained in:
531
Micro–DC/docs/training/D6
Normal file
531
Micro–DC/docs/training/D6
Normal file
@@ -0,0 +1,531 @@
|
||||
Here we go — **D6 = the final boss fight**: proving you're really at zero_manual_provisioning and not just “vibes-compliant”.
|
||||
|
||||
I'll structure this as a **3-part final manual**:
|
||||
|
||||
1. **Day 1 - Evidence & Gap Scan** (what's *actually* automated vs still manual)
|
||||
2. **Day 2 - End-to-End Replay Drill** (“empty site → running workloads” using only Git & pipelines)
|
||||
3. **Day 3 - Regression Controls & Formal Sign-Off** (make it hard to backslide into snowflake land)
|
||||
|
||||
This is the training track that turns everything from D1-D5 into a **passed audit + repeatable pattern**.
|
||||
|
||||
---
|
||||
|
||||
# D6 — Zero-Manual-Provisioning Manual
|
||||
|
||||
**Focus:** Definition of Done, verification flow, and governance to keep it that way.
|
||||
|
||||
---
|
||||
|
||||
## Day 1 - Evidence & Gap Scan
|
||||
|
||||
**Goal:** Build a *brutally honest* picture of where manual actions still exist and convert them (or constrain and codify them) until you can reasonably claim zero_manual_provisioning for EU-PAR-FR01.
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
By the end of Day 1, the team can:
|
||||
|
||||
* List all **infra and platform layers** and state whether they're:
|
||||
|
||||
* Automated (Git → pipeline → infra),
|
||||
* One-time bootstrap (strictly documented),
|
||||
* Or still “mysteriously manual”.
|
||||
* Create a **verification checklist** organized by repo / pipeline / layer.
|
||||
* Identify gaps that must be closed before final D6 sign-off.
|
||||
|
||||
### 1.1 Pre-Reqs
|
||||
|
||||
* D1-D5 labs completed in at least a training or staging environment.
|
||||
* Repos:
|
||||
|
||||
* `infra-foundation`
|
||||
* `platform-clusters`
|
||||
* `policies-and-compliance`
|
||||
* Pipelines wired and working:
|
||||
|
||||
* `lint_and_unit`, `policy_gates`, `integration_test`, `promotion_to_template`, `site_rollout`
|
||||
* Monitoring + SLOs in place (control plane, GPU wait time, PUE).
|
||||
|
||||
---
|
||||
|
||||
### 1.2 Build the “Stack Checklist”
|
||||
|
||||
Create a shared doc, e.g.:
|
||||
|
||||
`governance/checklists/zero-manual-provisioning-eu-par-fr01.md`
|
||||
(in a dedicated governance repo or `policies-and-compliance/docs/`).
|
||||
|
||||
Structure it by **layers**:
|
||||
|
||||
```markdown
|
||||
# Zero-Manual-Provisioning Checklist - EU-PAR-FR01
|
||||
|
||||
## 1. Facility & Physical Inventory
|
||||
- [ ] Racks defined in infra-foundation (rack_layouts/eu-par-fr01-racks.yaml)
|
||||
- [ ] Power & cooling defined in infra-foundation (power_and_cooling/eu-par-fr01-power.yaml)
|
||||
- [ ] Telemetry mapping (PDU → metrics → Prometheus) codified
|
||||
|
||||
## 2. Network & OOB
|
||||
- [ ] Underlay config defined in Terraform (network/terraform/eu-par-fr01/)
|
||||
- [ ] OOB inventory defined as code (baremetal/profiles/oob-switches.yaml)
|
||||
- [ ] First-time OOB bootstrap documented in facility/bootstrap/eu-par-fr01.md
|
||||
- [ ] No ongoing config changes on network devices outside Terraform/Ansible
|
||||
|
||||
## 3. Bare Metal & Hypervisors
|
||||
- [ ] Baremetal profiles defined (baremetal/profiles/*.yaml)
|
||||
- [ ] MAAS/commissioning controlled via scripts/apply_site.sh
|
||||
- [ ] Hypervisor config driven by Ansible (hypervisor/ansible/...)
|
||||
- [ ] No “snowflake” nodes with manual config
|
||||
|
||||
## 4. Platform (K8s/GitOps)
|
||||
- [ ] mgmt cluster defined as YAML (platform-clusters/k8s/clusters/eu-par-fr01/)
|
||||
- [ ] GitOps app definition committed (gitops-app.yaml)
|
||||
- [ ] All cluster add-ons configured via GitOps
|
||||
|
||||
## 5. Policies & Compliance
|
||||
- [ ] Data classification defined (data-classification.yaml)
|
||||
- [ ] Residency policies (data_residency.rego) enforced in CI
|
||||
- [ ] RBAC policies (rbac.rego) enforced; JIT pattern defined
|
||||
- [ ] Sustainability KPIs defined (sustainability-kpis.yaml)
|
||||
|
||||
## 6. Observability & SLOs
|
||||
- [ ] Prometheus rules for control-plane SLO
|
||||
- [ ] Prometheus rules for GPU wait-time SLO
|
||||
- [ ] Prometheus rules for PUE SLO
|
||||
- [ ] Dashboards in Grafana defined via Git
|
||||
|
||||
## 7. Process & Governance
|
||||
- [ ] Manual-change policy documented
|
||||
- [ ] Break-glass/JIT procedure documented and tested
|
||||
- [ ] Drift detection in place (GitOps out-of-sync, config diff tools)
|
||||
```
|
||||
|
||||
Your job today is to **fill this with reality**:
|
||||
|
||||
* Check each box honestly:
|
||||
|
||||
* ✅ implemented
|
||||
* 🚧 partially implemented
|
||||
* ❌ missing
|
||||
* Note for each ❌ / 🚧:
|
||||
|
||||
* Which repo and path will own the fix.
|
||||
* Which team/role is responsible.
|
||||
|
||||
---
|
||||
|
||||
### 1.3 Identify Remaining Manual Touchpoints
|
||||
|
||||
In a short workshop, list **every manual action** you can think of for EU-PAR-FR01:
|
||||
|
||||
* “We still log into switch X to tweak SNMP traps”
|
||||
* “We create namespaces by hand via `kubectl` for testing”
|
||||
* “We change Ceph pool settings via UI for urgent issues”
|
||||
|
||||
Categorise:
|
||||
|
||||
1. **One-time bootstrap (allowed)**
|
||||
|
||||
* E.g. first console config on OOB devices; install of `mgmt01`.
|
||||
* Must be:
|
||||
|
||||
* Listed in `facility/bootstrap/eu-par-fr01.md`
|
||||
* Idempotent and documented
|
||||
* Not used in steady state
|
||||
|
||||
2. **Steady-state manual (NOT allowed for zero_manual_provisioning)**
|
||||
|
||||
* These must be either:
|
||||
|
||||
* Automated (IaC / K8s config / operator), or
|
||||
* Constrained behind a JIT/break-glass procedure, then codified after the fact.
|
||||
|
||||
Create a table like:
|
||||
|
||||
```markdown
|
||||
| Manual action | Layer | Category | Remediation owner | Target repo/path |
|
||||
|--------------------------------------------|---------------|-----------------|-------------------|-----------------------------------------|
|
||||
| Change OSPF cost on leaf switch | Network | Steady-state | Network Architect | infra-foundation/network/terraform/... |
|
||||
| Create K8s namespaces via kubectl | Platform | Steady-state | Platform SRE | platform-clusters/k8s/clusters/... |
|
||||
| Tune Ceph pool size via GUI | Storage | Steady-state | Storage Engineer | infra-foundation/storage-ceph.yaml |
|
||||
| First-time OOB IP + login on switches | OOB bootstrap | One-time | Network Architect | facility/bootstrap/eu-par-fr01.md |
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 1.4 Day 1 Definition of Done
|
||||
|
||||
* The checklist file exists and is **filled** with actual state.
|
||||
* Manual actions are categorised into:
|
||||
|
||||
* Allowed one-time bootstrap (documented)
|
||||
* To-be-eliminated steady-state manual changes
|
||||
* For each non-compliant manual change, there-s a **named owner** and target repo/path for remediation.
|
||||
|
||||
---
|
||||
|
||||
## Day 2 - End-to-End Replay Drill (“From Nothing to Workloads”)
|
||||
|
||||
**Goal:** Prove that EU-PAR-FR01 can be (re)built and made operational **purely** from Git + pipelines + documented one-time bootstrap.
|
||||
|
||||
You're not going to physically tear down the DC; you'll use:
|
||||
|
||||
* A **sandbox environment**, or
|
||||
* A “virtual EU-PAR-FR01” representation, or
|
||||
* A subset (e.g. one rack, minimal control-plane) as a proxy.
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
By the end of Day 2, the team can:
|
||||
|
||||
* Run a **scripted end-to-end replay** of:
|
||||
|
||||
* Facility → Network → Bare metal → Hypervisor → K8s → Policies → Workloads → SLOs.
|
||||
* Observe each stage via pipelines and dashboards.
|
||||
* Demonstrate that no hidden manual step is required in steady state.
|
||||
|
||||
### 2.1 Pre-Reqs
|
||||
|
||||
* Most gaps from Day 1 are addressed or consciously accepted as temporary and not affecting the drill scope.
|
||||
* A “resettable” environment exists (virtual site or subset of nodes).
|
||||
|
||||
---
|
||||
|
||||
### 2.2 Design the Replay Script
|
||||
|
||||
Create a high-level step-by-step “runbook for the drill”, e.g.:
|
||||
|
||||
`governance/drills/eu-par-fr01-e2e-replay.md`
|
||||
|
||||
Example skeleton:
|
||||
|
||||
```markdown
|
||||
# EU-PAR-FR01 - End-to-End Replay Drill
|
||||
|
||||
## Step 0 - Preconditions
|
||||
- [ ] Lab environment reset to baseline (no cluster, nodes powered off or unprovisioned)
|
||||
- [ ] CI runners reachable
|
||||
- [ ] Git repos at tagged versions:
|
||||
infra-foundation@vX.Y.Z,
|
||||
platform-clusters@vA.B.C,
|
||||
policies-and-compliance@vM.N.P
|
||||
|
||||
## Step 1 - Facility & Inventory
|
||||
- [ ] Validate facility manifests present in infra-foundation (no changes needed)
|
||||
- [ ] Confirm telemetry endpoints for PDU simulated/available
|
||||
|
||||
## Step 2 - Network & OOB Bootstrap
|
||||
- [ ] Run infra-foundation pipeline: terraform apply for eu-par-fr01
|
||||
- [ ] Perform documented one-time console bootstrap actions (from bootstrap/eu-par-fr01.md)
|
||||
- [ ] Verify OOB & mgmt reachability
|
||||
|
||||
## Step 3 - Bare Metal & Hypervisor
|
||||
- [ ] Trigger infra-foundation site_rollout for MAAS commission + hypervisor
|
||||
- [ ] Verify nodes appear with correct tags and hypervisor cluster healthy
|
||||
|
||||
## Step 4 - K8s Mgmt Cluster & GitOps
|
||||
- [ ] Trigger platform-clusters site_rollout for eu-par-fr01 mgmt cluster
|
||||
- [ ] Confirm control-plane SLO panel shows baseline metrics
|
||||
|
||||
## Step 5 - Policies & Monitoring
|
||||
- [ ] Ensure policies-and-compliance pipelines passed (no pending changes)
|
||||
- [ ] Confirm OPA policies active (test a small intentional failure in a lab branch, not applied)
|
||||
|
||||
## Step 6 - Workloads & Sovereign Namespaces
|
||||
- [ ] Deploy justice tenant workload via platform-clusters
|
||||
- [ ] Confirm namespace & data residency policies in effect
|
||||
|
||||
## Step 7 - SLOs & Dashboards
|
||||
- [ ] Confirm control-plane, GPU wait time, PUE SLO metrics live
|
||||
- [ ] Simulate one small load burst to see metrics move
|
||||
```
|
||||
|
||||
You'll use this script as the basis of the live drill.
|
||||
|
||||
---
|
||||
|
||||
### 2.3 Run the Drill as a “Game Day”
|
||||
|
||||
Treat this like a **Game Day**. Roles:
|
||||
|
||||
* **Conductor** - drives the drill, announces steps, timeboxes.
|
||||
* **Observers** - note any manual actions, friction points, or undocumented steps.
|
||||
* **Operators** - actually run pipelines, check logs, etc.
|
||||
|
||||
For each step:
|
||||
|
||||
1. Read the step from the drill doc.
|
||||
2. Execute it strictly via:
|
||||
|
||||
* Git commits (if needed)
|
||||
* CI pipelines (`lint_and_unit`, `policy_gates`, `site_rollout`)
|
||||
* Documented one-time bootstrap actions (if in Step 2).
|
||||
3. Log:
|
||||
|
||||
* Start time, end time
|
||||
* Any manual commands run
|
||||
* Any surprises (missing script, script out-of-date, etc.)
|
||||
|
||||
**Critical rule:**
|
||||
If someone says “I'll just SSH in and fix X”, they must:
|
||||
|
||||
* Stop and say it out loud.
|
||||
* The Conductor marks this as a **failed requirement** and a **gap to fix in code** afterward.
|
||||
* For the purpose of the drill, decide whether to continue with that manual step (noting it as a gap) or stop and fix it properly in Git first.
|
||||
|
||||
---
|
||||
|
||||
### 2.4 Artifacts from the Drill
|
||||
|
||||
You want a reproducible evidence trail:
|
||||
|
||||
* CI pipeline links for:
|
||||
|
||||
* `infra-foundation` site_rollout
|
||||
* `platform-clusters` site_rollout
|
||||
* `policies-and-compliance` `policy_gates` runs
|
||||
* Screenshots or snapshots of:
|
||||
|
||||
* K8s nodes, namespaces, workloads
|
||||
* SLO dashboards (control-plane, GPU, PUE)
|
||||
* Drill log:
|
||||
|
||||
* Saved as `governance/drills/logs/eu-par-fr01-e2e-replay-YYYYMMDD.md`
|
||||
|
||||
Example entry:
|
||||
|
||||
```markdown
|
||||
## Step 3 - Bare Metal & Hypervisor
|
||||
Time: 09:30-09:55
|
||||
Pipeline: https://git/.../pipelines/1234
|
||||
Manual actions:
|
||||
- None (all MAAS commission via script+pipeline)
|
||||
Notes:
|
||||
- Proxmox cluster came up with minor warning about time sync; add NTP config to Ansible role.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2.5 Day 2 Definition of Done
|
||||
|
||||
* One complete **end-to-end replay drill** executed and logged.
|
||||
* All unavoidable manual actions are:
|
||||
|
||||
* Either in the one-time bootstrap doc, or
|
||||
* Clearly flagged as gaps to fix.
|
||||
* You can point to:
|
||||
|
||||
* A single doc describing the replay
|
||||
* A set of pipeline runs
|
||||
* A functioning platform with workloads, policies, and SLOs
|
||||
|
||||
and say:
|
||||
|
||||
> “This environment was built from Git and pipelines only.”
|
||||
|
||||
---
|
||||
|
||||
## Day 3 - Regression Controls & Formal Sign-Off
|
||||
|
||||
**Goal:** Turn “we did it once” into “this is how we operate by default and don't regress”.
|
||||
|
||||
This is where **governance, culture, and automation** interlock.
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
By the end of Day 3, the team can:
|
||||
|
||||
* Define durable **controls against regression** (manual changes, drift, snowflakes).
|
||||
* Set up a **recurring verification cadence** (e.g., quarterly drills).
|
||||
* Produce a final “zero_manual_provisioning” sign-off package for EU-PAR-FR01.
|
||||
|
||||
---
|
||||
|
||||
### 3.1 Hardening Against Manual Drift
|
||||
|
||||
Work through these and implement what you don't yet have:
|
||||
|
||||
1. **Git protections**
|
||||
|
||||
* Branch protection on `main`:
|
||||
|
||||
* MR required
|
||||
* CI must pass (lint + policy at minimum)
|
||||
* Optional: code owners for sensitive paths (`opa-policies/`, network Terraform, etc.)
|
||||
|
||||
2. **GitOps enforcement on clusters**
|
||||
|
||||
* Alert if GitOps shows out-of-sync for more than X minutes.
|
||||
* Optionally, **auto-revert** manual cluster changes by periodic reconciliation.
|
||||
|
||||
3. **Network / OS drift detection**
|
||||
|
||||
* Periodic jobs that:
|
||||
|
||||
* Pull running configs from devices
|
||||
* Compare to Terraform/Ansible expected state
|
||||
* Alert on differences
|
||||
|
||||
4. **Manual-change policy**
|
||||
|
||||
* Document a short policy, e.g. `governance/policies/no-manual-changes.md`:
|
||||
|
||||
```markdown
|
||||
# No Manual Changes Policy - EU-PAR-FR01
|
||||
|
||||
- All persistent infra/platform changes must be made via Git and CI/CD.
|
||||
- Exceptions (break-glass, JIT) must:
|
||||
- Be pre-approved by on-call SRE and Security.
|
||||
- Be logged in an incident or change ticket.
|
||||
- Be codified via Git within 24h (or a defined SLA).
|
||||
- Direct device or cluster access is only for:
|
||||
- Documented bootstrap tasks
|
||||
- Emergency remediation with follow-up codification
|
||||
```
|
||||
|
||||
* Go through and ensure **everyone at the table** acknowledges this is the standard.
|
||||
|
||||
---
|
||||
|
||||
### 3.2 Define Recurring Verification
|
||||
|
||||
You *don't* want D6 to be a one-off. Define a cadence, e.g.:
|
||||
|
||||
* **Quarterly “mini replay” drill**:
|
||||
|
||||
* Not full bare metal → workloads, but at least:
|
||||
|
||||
* Destroy/recreate a non-critical cluster
|
||||
* Validate policies and SLOs
|
||||
* **Monthly drift review**:
|
||||
|
||||
* Review GitOps drift, network/OS drift reports, RBAC exceptions.
|
||||
|
||||
Capture this in:
|
||||
|
||||
`governance/runbooks/zero-manual-verification-cadence.md`
|
||||
|
||||
Example:
|
||||
|
||||
```markdown
|
||||
# Zero-Manual-Provisioning Verification Cadence - EU-PAR-FR01
|
||||
|
||||
## Monthly
|
||||
- [ ] Review GitOps drift logs for all clusters.
|
||||
- [ ] Review network/OS drift reports.
|
||||
- [ ] Check for any undocumented manual actions in incident/change tickets.
|
||||
|
||||
## Quarterly
|
||||
- [ ] Run a scoped replay drill:
|
||||
- Recreate a non-prod cluster from Git.
|
||||
- Validate policies & SLOs.
|
||||
- [ ] Update zero-manual checklist with any new components.
|
||||
|
||||
## Yearly
|
||||
- [ ] Full replay drill (or as close as feasible in a test environment).
|
||||
- [ ] Re-affirm break-glass policies and JIT patterns.
|
||||
- [ ] Refresh training for new team members.
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3.3 Formal Zero-Manual-Provisioning Sign-Off
|
||||
|
||||
Now build the actual **D6 “Definition of Done” artifact set**.
|
||||
Think of it as your internal audit dossier.
|
||||
|
||||
Create:
|
||||
`governance/certificates/eu-par-fr01-zero-manual-provisioning.md`
|
||||
|
||||
Suggested sections:
|
||||
|
||||
```markdown
|
||||
# Zero-Manual-Provisioning Certificate - EU-PAR-FR01
|
||||
|
||||
## 1. Scope
|
||||
- Site: EU-PAR-FR01 (Paris, France)
|
||||
- Regime: EU/EEA, GDPR + FR Data Protection Act
|
||||
- Stack: Facility, Network, Bare Metal, Hypervisor, K8s, Policies, Monitoring
|
||||
|
||||
## 2. Evidence Summary
|
||||
|
||||
### 2.1 Repos & Pipelines
|
||||
- infra-foundation: URL
|
||||
- Pipelines: lint_and_unit, policy_gates, site_rollout
|
||||
- platform-clusters: URL
|
||||
- Pipelines: lint_and_unit, policy_gates, integration_test, site_rollout
|
||||
- policies-and-compliance: URL
|
||||
- Pipelines: lint_and_unit, policy_gates
|
||||
|
||||
### 2.2 End-to-End Replay
|
||||
- Drill document: governance/drills/eu-par-fr01-e2e-replay.md
|
||||
- Execution date: YYYY-MM-DD
|
||||
- Pipelines:
|
||||
- infra-foundation #1234
|
||||
- platform-clusters #5678
|
||||
- policies-and-compliance #91011
|
||||
|
||||
### 2.3 Policies & SLOs
|
||||
- Residency: opa-policies/data_residency.rego (link)
|
||||
- RBAC & JIT: opa-policies/rbac.rego (link)
|
||||
- SLO specs:
|
||||
- control-plane-availability
|
||||
- gpu-job-wait-time
|
||||
- site-pue
|
||||
- Dashboards: paths to Grafana JSON files
|
||||
|
||||
## 3. Residual Manual Elements
|
||||
|
||||
- One-time bootstrap tasks:
|
||||
- OOB device initial config (documented in bootstrap/eu-par-fr01.md)
|
||||
- mgmt01 initial OS install (to be automated in next iteration)
|
||||
|
||||
- Current limitations:
|
||||
- <short list, aligned with LR1-LR4 where relevant>
|
||||
|
||||
## 4. Governance & Cadence
|
||||
|
||||
- No-manual-changes policy: governance/policies/no-manual-changes.md
|
||||
- Verification cadence: governance/runbooks/zero-manual-verification-cadence.md
|
||||
|
||||
## 5. Approvals
|
||||
|
||||
- Sovereign Compliance & Sustainability Lead: ______ (date)
|
||||
- Principal SRE / Platform Architect: ______ (date)
|
||||
- Network & Facility Leads (as required): ______ (date)
|
||||
```
|
||||
|
||||
Make this **short but precise**. It's your internal “badge” for EU-PAR-FR01.
|
||||
|
||||
---
|
||||
|
||||
### 3.4 Day 3 Definition of Done
|
||||
|
||||
* No-manual-changes policy written and adopted.
|
||||
* Verification cadence defined and stored in Git.
|
||||
* Zero-manual certificate for EU-PAR-FR01 written, with:
|
||||
|
||||
* Evidence links
|
||||
* Residual limitations
|
||||
* Signatures/approvals
|
||||
|
||||
---
|
||||
|
||||
## D6 Overall Definition of Done (Training-Level)
|
||||
|
||||
When the D6 manual is completed **in practice**, you have:
|
||||
|
||||
1. **A complete, honest inventory** of automation vs manual for EU-PAR-FR01.
|
||||
2. At least one **end-to-end replay drill** proving build-from-Git is real, not theoretical.
|
||||
3. **Controls to prevent regression**:
|
||||
|
||||
* Policy + process + drift detection.
|
||||
4. A **formal sign-off package** that says, in concrete terms:
|
||||
|
||||
* What is automated
|
||||
* What is still manually bootstrapped (and constrained)
|
||||
* How you ensure it stays that way over time.
|
||||
|
||||
That's your training cohort “graduation” — from bare metal to zero_manual_provisioning as a living practice, not just an architecture diagram.
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user