Add Micro–DC/docs/training/D1

This commit is contained in:
2025-12-05 12:30:15 +00:00
parent 6a5151418d
commit 4741f2c99e

832
Micro–DC/docs/training/D1 Normal file
View File

@@ -0,0 +1,832 @@
lets treat this as **Day-by-day training manual for D1 only**.
You can literally follow this in order, run it with a small team, and we'll do D2-D6 later.
Below is **D1 - Narrative Training Walkthrough: Complete Manual**
Target site: **EU-PAR-FR01**
---
# How to Use This Manual
* You'll run this as a **multi-day enablement** (7 days = 7 stages).
* Each day/stage has:
* **Learning objectives**
* **Inputs & pre-reqs**
* **Concrete actions** (Git changes + pipeline runs)
* **Observable outcomes & DoD**
Repos assumed:
* `infra-foundation`
* `platform-clusters`
* `policies-and-compliance`
Pipelines (in each repo):
* `lint_and_unit`
* `policy_gates`
* `integration_test`
* `promotion_to_template`
* `site_rollout`
> ✅ Golden rule: **No long-lived manual changes on infra.**
> Consoles and UIs are for *inspection*, not configuration (except the narrow bootstrap steps explicitly listed).
---
## Day 1 — Stage 0: Policy & Site Definition
### Learning Objectives
By the end of Day 1, the team can:
* Describe the **sovereign context** for EU-PAR-FR01 (jurisdiction, data classes, residency).
* Encode site metadata and data-classification rules as YAML in Git.
* Run pipelines and interpret **policy gate failures**.
### Inputs & Pre-Reqs
* Git access to:
* `infra-foundation`
* `policies-and-compliance`
* CI/CD already wired for these repos (pipelines available but not yet richly used).
* Agreement on:
* Site code: `EU-PAR-FR01`
* Country: `FR`
* Regime: EU/EEA, GDPR + FR law
### Step 1 — Create/Update Site Manifest
**Repo:** `infra-foundation`
1. Create a feature branch:
```bash
git checkout -b feat/eu-par-fr01-site
```
2. Create file:
`facility/site_manifests/eu-par-fr01.yaml`
```yaml
site:
code: EU-PAR-FR01
country: FR
city: PAR
it_load_kw: 80
racks:
gpu: 2
compute: 4
storage: 2
regulatory:
regime: EU_EEA
primary_privacy_law: GDPR
local_law: "FR Data Protection Act"
sovereignty:
allowed_regions:
- FR
- EU_EEA
```
3. Commit and push:
```bash
git add facility/site_manifests/eu-par-fr01.yaml
git commit -m "Add site manifest for EU-PAR-FR01"
git push origin feat/eu-par-fr01-site
```
4. Open a MR/PR in your Git platform.
5. CI runs `lint_and_unit` and `policy_gates`.
### Step 2 — Define Data Classification & Residency
**Repo:** `policies-and-compliance`
1. Branch:
```bash
git checkout -b feat/data-classification-fr
```
2. Edit/create `data-classification.yaml`:
```yaml
levels:
- name: PUBLIC
- name: INTERNAL
- name: PERSONAL
- name: SENSITIVE_PERSONAL
- name: CRITICAL_SOVEREIGN_FR
residency:
CRITICAL_SOVEREIGN_FR:
must_stay_in_country: FR
SENSITIVE_PERSONAL:
must_stay_in_region: EU_EEA
PERSONAL:
allowed_transfers:
- mechanism: "SCCs / approved mechanisms"
approval_required_from: "DPO"
```
3. Add sustainability targets (placeholder) to `sustainability-kpis.yaml`:
```yaml
targets:
pue_max: 1.4
renewable_share_min_percent: 70
```
4. Commit & push:
```bash
git add data-classification.yaml sustainability-kpis.yaml
git commit -m "Define data classification and sustainability targets for EU-PAR-FR01"
git push origin feat/data-classification-fr
```
5. Open MR/PR; CI runs `lint_and_unit` + `policy_gates`.
### Step 3 — Expected Policy Failures (Optional Training Twist)
To train the team on **CI feedback**, deliberately introduce a mistake:
* In `eu-par-fr01.yaml`, set `site.code: EU-PAR-FR1` (missing `0`).
OPA-like policy (`site_naming.rego`) will deny:
```rego
deny[msg] {
input.site.code != regex.match(`^EU-[A-Z]{3}-FR0[1-9]$`, input.site.code)
msg := sprintf("invalid site code: %v", [input.site.code])
}
```
**Exercise:**
1. Observe pipeline failure message.
2. Fix `EU-PAR-FR1` → `EU-PAR-FR01`.
3. Re-push, watch `policy_gates` pass.
### Day 1 Observable Outcomes
* **Green MRs/PRs** for:
* `eu-par-fr01.yaml`
* `data-classification.yaml`
* `sustainability-kpis.yaml`
* `policy_gates` successfully enforces:
* Site naming pattern
* Presence of PUE and renewable targets
### Day 1 Definition of Done
* Site EU-PAR-FR01 is **formally defined in Git**.
* Data classes & residency rules for FR sovereign data are codified.
* At least one **intentional CI policy failure** has been observed and fixed by the team.
---
## Day 2 — Stage 1: Facility Build-Out (Logical Model)
### Learning Objectives
By the end of Day 2, the team can:
* Represent physical racks, power, and cooling **as code**.
* Link facility telemetry to future observability (PUE inputs).
* Use policy gates to validate rack power density and naming.
### Inputs & Pre-Reqs
* Day 1 completed and merged.
* High-level facility design (rack count, power, cooling, UPS, generator).
* Agreement on rack codes: `EU-PAR-FR01-RK01` … `EU-PAR-FR01-RK08`.
### Step 1 — Rack Layout YAML
**Repo:** `infra-foundation`
1. Branch:
```bash
git checkout -b feat/eu-par-fr01-racks
```
2. Create `facility/rack_layouts/eu-par-fr01-racks.yaml`:
```yaml
racks:
- code: EU-PAR-FR01-RK01
role: gpu
max_kw: 20
- code: EU-PAR-FR01-RK02
role: gpu
max_kw: 20
- code: EU-PAR-FR01-RK03
role: compute
max_kw: 8
- code: EU-PAR-FR01-RK04
role: compute
max_kw: 8
- code: EU-PAR-FR01-RK05
role: compute
max_kw: 8
- code: EU-PAR-FR01-RK06
role: compute
max_kw: 8
- code: EU-PAR-FR01-RK07
role: storage
max_kw: 6
- code: EU-PAR-FR01-RK08
role: storage
max_kw: 6
```
3. Commit & push; open MR/PR; CI runs.
### Step 2 — Power & Cooling YAML
Create `facility/power_and_cooling/eu-par-fr01-power.yaml`:
```yaml
power:
ups_topology: N+1
generator:
enabled: true
autonomy_hours: 8
per_rack_pdu:
type: "intelligent"
metered: true
switched: true
cooling:
type: in-row
free_cooling: true
gpu_rack_density_kw: 20
cpu_rack_density_kw: 8
telemetry:
sensors:
- rack_inlet_temp
- rack_exhaust_temp
- room_temp
- room_humidity
- pdu_power_kw
exports:
- name: rack_pdu_power_kw
protocol: snmp
prometheus:
metric: dc_pdu_power_kw
```
Commit & push on same branch.
### Step 3 — Policy Gate & Validation
Assume a policy that checks:
* Rack codes match `<SITE>-RK<rr>`.
* `max_kw` does not exceed `gpu_rack_density_kw` or `cpu_rack_density_kw` per role.
**Optional training mistake:**
* Set `max_kw: 25` on a GPU rack.
* CI will fail with “rack max_kw (25) exceeds gpu_rack_density_kw (20)”.
Fix, re-run pipeline.
### Day 2 Observable Outcomes
* Facility model for EU-PAR-FR01 racks, power, cooling lives in Git.
* Policy gates verify:
* Rack codes
* Power density constraints
### Day 2 Definition of Done
* Racks & power are **fully codified**.
* Sustainability-related telemetry fields (power metrics) are defined.
* Team understands how facility YAML ties into later PUE/KPI observability.
---
## Day 3 — Stage 2: Network & Out-of-Band Bring-Up
### Learning Objectives
* Model underlay network (leaf/spine, VRFs, VLANs) with Terraform.
* Define OOB management network and inventories via Ansible.
* Understand the **narrow, documented bootstrap** for OOB devices.
### Inputs & Pre-Reqs
* Day 1-2 merged.
* Network design (ASNs, subnets, VRFs).
* Access to physical network devices (offline/powered but reachable via console).
### Step 1 — Network Terraform Skeleton
**Repo:** `infra-foundation`
Branch: `feat/eu-par-fr01-network`
Create:
`network/terraform/eu-par-fr01/main.tf`:
```hcl
module "leaf_spine" {
source = "../modules/leaf_spine"
site_code = "EU-PAR-FR01"
vrfs = [
"INFRA_MGMT",
"TENANT",
"STORAGE",
"OUT_OF_BAND",
]
# Example inputs - adjust to your design
mgmt_cidr = "10.10.0.0/24"
tenant_cidr = "10.20.0.0/16"
storage_cidr = "10.30.0.0/24"
oob_cidr = "10.99.0.0/24"
}
```
`network/terraform/eu-par-fr01/variables.tf` and `outputs.tf` as needed.
### Step 2 — OOB Inventory & Bootstrap Runbook
Create:
`baremetal/profiles/oob-switches.yaml`:
```yaml
oob_switches:
- name: EU-PAR-FR01-RK01-oob01
mgmt_ip: 10.99.0.10
rack: EU-PAR-FR01-RK01
role: oob
```
Create a **bootstrap runbook** (Markdown):
`facility/bootstrap/eu-par-fr01.md`:
* List **explicit allowed manual actions**:
* Connect serial console to OOB switch/firewall.
* Set initial mgmt IP, subnet, default gateway.
* Set initial credentials/SSH keys.
* Immediately codify these settings in Terraform/Ansible.
### Step 3 — Bootstrap Actions (One-Time, Documented)
**Physical/console actions (allowed):**
* For each OOB device:
* Set mgmt IP from `oob_cidr` (`10.99.0.0/24`).
* Configure credentials to match Ansible inventory.
**Important:**
After this step, **no more manual config** on network devices. Everything else is via pipelines.
### Step 4 — CI & Policy Gates
* Run `terraform validate` via `lint_and_unit`.
* Run `policy_gates` to check:
* No overlapping CIDRs between VRFs.
* OOB network is distinct from TENANT/STORAGE/INFRA_MGMT.
**Optional mistake:**
Set `oob_cidr = "10.20.0.0/24"` (same as TENANT).
Policy gate fails → fix.
### Day 3 Observable Outcomes
* Terraform underlay definition for EU-PAR-FR01 exists.
* OOB inventory & initial bootstrap runbook committed.
* Network policies catch invalid CIDR overlaps.
### Day 3 Definition of Done
* OOB devices are reachable via mgmt IPs defined in YAML.
* Leaf/spine config can be generated and applied **by pipeline**.
* Bootstrap runbook clearly documents the **only** manual config allowed.
---
## Day 4 — Stage 3: Bare-Metal & Hypervisor Provisioning
### Learning Objectives
* Drive server discovery and OS install via MAAS (or similar) and Ansible.
* Understand how `mgmt01` is used as an automation anchor.
* Handle failed provisioning via **pipeline-driven rollback**.
### Inputs & Pre-Reqs
* Network connectivity (OOB + mgmt) is in place.
* Management node: `EU-PAR-FR01-RK01-mgmt01` exists (OS can be installed manually once, or pre-provisioned).
### Step 1 — Define Bare-Metal Profiles
**Repo:** `infra-foundation`
Branch: `feat/eu-par-fr01-baremetal`
Create:
`baremetal/profiles/compute-standard.yaml`:
```yaml
profile:
name: compute-standard
cpu: "2x32"
ram_gb: 512
role: "k8s-node"
tags:
- "site:EU-PAR-FR01"
- "country:FR"
```
`baremetal/profiles/compute-gpu.yaml`:
```yaml
profile:
name: compute-gpu
cpu: "2x32"
ram_gb: 768
gpus: 4
role: "k8s-gpu-node"
tags:
- "site:EU-PAR-FR01"
- "country:FR"
- "gpu:true"
```
### Step 2 — Hypervisor Ansible Playbook
`hypervisor/ansible/proxmox_cluster.yml`:
```yaml
- hosts: eu-par-fr01-proxmox
become: true
roles:
- proxmox-base
- proxmox-cluster-join
```
Inventory could be generated from MAAS hostnames.
### Step 3 — Pipeline-Driven Commissioning
* `site_rollout` job script in CI:
```bash
# scripts/apply_site.sh
set -e
SITE_CODE="$1" # EU-PAR-FR01
./scripts/maas_commission_nodes.sh "$SITE_CODE"
ansible-playbook hypervisor/ansible/proxmox_cluster.yml -l "$SITE_CODE"
```
Run:
* From MR or manually in CI: `site_rollout EU-PAR-FR01`.
### Optional Failure Scenario (Rollback Training)
* Mistake: profile references wrong image channel (e.g., `hwe-20.04` not available).
* Commissioning fails; `integration_test` checks kernel version and fails.
* Fix YAML to a supported image; re-run `site_rollout`.
* If needed, script reverts to last known-good image.
### Day 4 Observable Outcomes
* MAAS lists all servers with correct tags (site, role, etc.).
* Proxmox (or chosen hypervisor) cluster installed via Ansible.
### Day 4 Definition of Done
* Adding a new server and powering it on can be driven to OS-installed state **only via** `site_rollout` pipeline.
* No direct SSH configuration is required.
---
## Day 5 — Stage 4: Platform Bootstrap (First K8s Control Plane)
### Learning Objectives
* Define the management K8s cluster as YAML.
* Bootstrap GitOps controller and use it as the **only** way to change cluster state.
* Understand bootstrap vs steady-state behavior.
### Inputs & Pre-Reqs
* Hypervisor cluster or dedicated K8s nodes available.
* `EU-PAR-FR01-RK01-mgmt01` has access to Git and CI.
### Step 1 — Cluster Definition
**Repo:** `platform-clusters`
Branch: `feat/eu-par-fr01-mgmt-cluster`
`k8s/clusters/eu-par-fr01/mgmt-cluster.yaml`:
```yaml
cluster:
name: eu-par-fr01-mgmt
site: EU-PAR-FR01
control_plane_nodes: 3
node_pool_profile: compute-standard
networking:
cni: cilium
sovereign_overlays:
country: FR
region: EU_EEA
```
### Step 2 — GitOps Application Definition
For Argo CD (example):
`k8s/clusters/eu-par-fr01/gitops-app.yaml`:
```yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: eu-par-fr01-mgmt
spec:
project: default
source:
repoURL: 'ssh://git@your-git/plattform-clusters.git'
path: 'k8s/clusters/eu-par-fr01'
targetRevision: main
destination:
server: 'https://kubernetes.default.svc'
namespace: argocd
syncPolicy:
automated:
prune: true
selfHeal: true
```
### Step 3 — Bootstrap Playbook (Run via Pipeline)
Create a script/playbook in `platform-clusters` that:
* Installs K8s binaries on designated nodes.
* Deploys Argo CD (or Flux).
* Applies `gitops-app.yaml`.
CI job (e.g., `site_rollout` for platform):
```bash
./scripts/bootstrap_k8s_cluster.sh eu-par-fr01-mgmt
```
**Important:**
Even this bootstrap script is **versioned** and run from CI, not manually executed from someones laptop.
### Step 4 — Integration Tests
* `integration_test` job:
* Runs `kubectl get nodes`, `kubectl get pods -A`.
* Simple HTTP probe to API server.
### Day 5 Observable Outcomes
* Management cluster `eu-par-fr01-mgmt` exists and is healthy.
* GitOps tool is deployed and syncing from `platform-clusters/k8s/clusters/eu-par-fr01/`.
* Any change in that path is reflected in the cluster via GitOps.
### Day 5 Definition of Done
* Management cluster can be **re-created** from scratch using:
* Git repos
* CI jobs
* Bootstrap scripts (no manual `kubectl apply` from laptops).
---
## Day 6 — Stage 5: Compliance & Telemetry Validation
### Learning Objectives
* Enforce data residency & RBAC via policy-as-code.
* Wire telemetry for platform + facility (PUE input metrics).
* See policy failures in CI and fix them.
### Inputs & Pre-Reqs
* K8s mgmt cluster running and managed via GitOps.
* Prometheus/Grafana charts available to deploy.
### Step 1 — Policies in `policies-and-compliance`
Branch: `feat/eu-par-fr01-policies`
Create `opa-policies/data_residency.rego` (simplified):
```rego
package data_residency
deny[msg] {
input.kind == "BackupPolicy"
input.metadata.labels["data_classification"] == "CRITICAL_SOVEREIGN_FR"
not input.spec.target.region == "fr-central"
msg := sprintf("critical FR data must backup to fr-central, got %v", [input.spec.target.region])
}
```
Create `opa-policies/rbac.rego`:
```rego
package rbac
deny[msg] {
input.kind == "ClusterRoleBinding"
input.metadata.name == "cluster-admin"
some s
s := input.subjects[_]
s.kind == "User"
not endswith(s.name, "@sovereign-ops.fr")
msg := "cluster-admin bindings must target sovereign-ops.fr principals only"
}
```
### Step 2 — Monitoring Stack in `platform-clusters`
Branch: `feat/eu-par-fr01-observability`
`addons/monitoring-logging-security/prometheus/values.yaml`:
```yaml
extraScrapeConfigs:
- job_name: "pdu-power"
static_configs:
- targets: ["eu-par-fr01-pdu-metrics:9100"]
metrics_path: /metrics
```
Plus typical K8s scrape configs (apiserver, nodes, etc.).
### Step 3 — CI Integration
* Update `scripts/run_opa.sh` to run OPA/Conftest against rendered manifests (K8s & backup policies).
* Ensure `policy_gates` uses the new rego files.
### Optional Lab Mistakes
1. Backup policy with wrong region (as in previous answer).
2. `ClusterRoleBinding` with external email domain.
Observe CI failures, fix YAML, re-run.
### Day 6 Observable Outcomes
* Policies catch non-compliant backup regions and admin bindings.
* Prometheus scrapes PDU metrics; PUE calculation possible.
### Day 6 Definition of Done
* Compliance and telemetry for EU-PAR-FR01 are **enforced via pipelines**, not manual review alone.
* Team can confidently interpret and resolve policy failures.
---
## Day 7 — Stage 6 & 7: Workload Onboarding & Scale-Out Pattern
### Learning Objectives
* Onboard initial AI/ML + SaaS workloads in sovereign-compliant namespaces.
* Reflect on how EU-PAR-FR01 patterns will be cloned to other sites.
### Inputs & Pre-Reqs
* Mgmt cluster ready; policies and monitoring active.
### Step 1 — Namespaces & Storage Classes
**Repo:** `platform-clusters`
Branch: `feat/eu-par-fr01-workloads`
Create:
`k8s/clusters/eu-par-fr01/namespaces/fr-critical-sovereign-ai.yaml`:
```yaml
apiVersion: v1
kind: Namespace
metadata:
name: fr-critical-sovereign-ai
labels:
data_classification: CRITICAL_SOVEREIGN_FR
country: FR
```
StorageClass:
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fr-critical-sovereign-sc
provisioner: "ceph.rbd"
parameters:
pool: "gpu-block-fr-local"
```
### Step 2 — Example AI/ML Job
`k8s/clusters/eu-par-fr01/workloads/ai-ml/job.yaml`:
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: fr-ai-training-job
namespace: fr-critical-sovereign-ai
labels:
data_classification: CRITICAL_SOVEREIGN_FR
spec:
template:
spec:
nodeSelector:
gpu: "true"
restartPolicy: Never
containers:
- name: trainer
image: registry.local/ai/trainer:latest
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: training-data
mountPath: /data
volumes:
- name: training-data
persistentVolumeClaim:
claimName: fr-critical-sovereign-ai-pvc
```
### Step 3 — CI & GitOps Rollout
* Push branch, open MR, run:
* `lint_and_unit`
* `policy_gates`
* `integration_test`
* `site_rollout` (applies via GitOps)
**Optional lab mistake:**
* Use storage pool `gpu-block-eu` that replicates outside FR.
* `data_residency` policy denies, pipeline fails → fix to `gpu-block-fr-local`.
### Step 4 — Scale-Out Pattern (Preview for Later Work)
* Copy EU-PAR-FR01 manifests into a new path `k8s/clusters/eu-fra-de01/` with site and country changed.
* Ensure all site-specific values are parameterized; confirm policy gates also handle DE site.
### Day 7 Observable Outcomes
* Critical sovereign workload runs in FR-only namespace and storage.
* Policy gates prevent cross-border misconfigurations.
* Initial pattern for **cloning EU-PAR-FR01** to another site is established.
### Day 7 Definition of Done
* EU-PAR-FR01 can host **real workloads**, compliant with data residency and RBAC rules.
* A repeatable pattern exists for future sites with minimal changes.
---
## D1 Wrap-Up: What You Achieve After Following All Days
After you run **Day 1-7** as above:
* EU-PAR-FR01 is modeled end-to-end in Git:
* Facility, network, bare metal, hypervisor, K8s, policies, observability.
* The first control plane is running and GitOps-managed.
* Sovereign and sustainability constraints are **built-in**, not retrofitted.
* You have a **repeatable, teachable journey** that maps exactly to the D1 requirements and sets the stage for D2-D6.
---