Add Micro–DC/docs/training/D1
This commit is contained in:
832
Micro–DC/docs/training/D1
Normal file
832
Micro–DC/docs/training/D1
Normal file
@@ -0,0 +1,832 @@
|
||||
lets treat this as **Day-by-day training manual for D1 only**.
|
||||
You can literally follow this in order, run it with a small team, and we'll do D2-D6 later.
|
||||
|
||||
Below is **D1 - Narrative Training Walkthrough: Complete Manual**
|
||||
Target site: **EU-PAR-FR01**
|
||||
|
||||
---
|
||||
|
||||
# How to Use This Manual
|
||||
|
||||
* You'll run this as a **multi-day enablement** (7 days = 7 stages).
|
||||
* Each day/stage has:
|
||||
|
||||
* **Learning objectives**
|
||||
* **Inputs & pre-reqs**
|
||||
* **Concrete actions** (Git changes + pipeline runs)
|
||||
* **Observable outcomes & DoD**
|
||||
|
||||
Repos assumed:
|
||||
|
||||
* `infra-foundation`
|
||||
* `platform-clusters`
|
||||
* `policies-and-compliance`
|
||||
|
||||
Pipelines (in each repo):
|
||||
|
||||
* `lint_and_unit`
|
||||
* `policy_gates`
|
||||
* `integration_test`
|
||||
* `promotion_to_template`
|
||||
* `site_rollout`
|
||||
|
||||
> ✅ Golden rule: **No long-lived manual changes on infra.**
|
||||
> Consoles and UIs are for *inspection*, not configuration (except the narrow bootstrap steps explicitly listed).
|
||||
|
||||
---
|
||||
|
||||
## Day 1 — Stage 0: Policy & Site Definition
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
By the end of Day 1, the team can:
|
||||
|
||||
* Describe the **sovereign context** for EU-PAR-FR01 (jurisdiction, data classes, residency).
|
||||
* Encode site metadata and data-classification rules as YAML in Git.
|
||||
* Run pipelines and interpret **policy gate failures**.
|
||||
|
||||
### Inputs & Pre-Reqs
|
||||
|
||||
* Git access to:
|
||||
|
||||
* `infra-foundation`
|
||||
* `policies-and-compliance`
|
||||
* CI/CD already wired for these repos (pipelines available but not yet richly used).
|
||||
* Agreement on:
|
||||
|
||||
* Site code: `EU-PAR-FR01`
|
||||
* Country: `FR`
|
||||
* Regime: EU/EEA, GDPR + FR law
|
||||
|
||||
### Step 1 — Create/Update Site Manifest
|
||||
|
||||
**Repo:** `infra-foundation`
|
||||
|
||||
1. Create a feature branch:
|
||||
|
||||
```bash
|
||||
git checkout -b feat/eu-par-fr01-site
|
||||
```
|
||||
|
||||
2. Create file:
|
||||
|
||||
`facility/site_manifests/eu-par-fr01.yaml`
|
||||
|
||||
```yaml
|
||||
site:
|
||||
code: EU-PAR-FR01
|
||||
country: FR
|
||||
city: PAR
|
||||
it_load_kw: 80
|
||||
racks:
|
||||
gpu: 2
|
||||
compute: 4
|
||||
storage: 2
|
||||
regulatory:
|
||||
regime: EU_EEA
|
||||
primary_privacy_law: GDPR
|
||||
local_law: "FR Data Protection Act"
|
||||
sovereignty:
|
||||
allowed_regions:
|
||||
- FR
|
||||
- EU_EEA
|
||||
```
|
||||
|
||||
3. Commit and push:
|
||||
|
||||
```bash
|
||||
git add facility/site_manifests/eu-par-fr01.yaml
|
||||
git commit -m "Add site manifest for EU-PAR-FR01"
|
||||
git push origin feat/eu-par-fr01-site
|
||||
```
|
||||
|
||||
4. Open a MR/PR in your Git platform.
|
||||
|
||||
5. CI runs `lint_and_unit` and `policy_gates`.
|
||||
|
||||
### Step 2 — Define Data Classification & Residency
|
||||
|
||||
**Repo:** `policies-and-compliance`
|
||||
|
||||
1. Branch:
|
||||
|
||||
```bash
|
||||
git checkout -b feat/data-classification-fr
|
||||
```
|
||||
|
||||
2. Edit/create `data-classification.yaml`:
|
||||
|
||||
```yaml
|
||||
levels:
|
||||
- name: PUBLIC
|
||||
- name: INTERNAL
|
||||
- name: PERSONAL
|
||||
- name: SENSITIVE_PERSONAL
|
||||
- name: CRITICAL_SOVEREIGN_FR
|
||||
|
||||
residency:
|
||||
CRITICAL_SOVEREIGN_FR:
|
||||
must_stay_in_country: FR
|
||||
SENSITIVE_PERSONAL:
|
||||
must_stay_in_region: EU_EEA
|
||||
PERSONAL:
|
||||
allowed_transfers:
|
||||
- mechanism: "SCCs / approved mechanisms"
|
||||
approval_required_from: "DPO"
|
||||
```
|
||||
|
||||
3. Add sustainability targets (placeholder) to `sustainability-kpis.yaml`:
|
||||
|
||||
```yaml
|
||||
targets:
|
||||
pue_max: 1.4
|
||||
renewable_share_min_percent: 70
|
||||
```
|
||||
|
||||
4. Commit & push:
|
||||
|
||||
```bash
|
||||
git add data-classification.yaml sustainability-kpis.yaml
|
||||
git commit -m "Define data classification and sustainability targets for EU-PAR-FR01"
|
||||
git push origin feat/data-classification-fr
|
||||
```
|
||||
|
||||
5. Open MR/PR; CI runs `lint_and_unit` + `policy_gates`.
|
||||
|
||||
### Step 3 — Expected Policy Failures (Optional Training Twist)
|
||||
|
||||
To train the team on **CI feedback**, deliberately introduce a mistake:
|
||||
|
||||
* In `eu-par-fr01.yaml`, set `site.code: EU-PAR-FR1` (missing `0`).
|
||||
|
||||
OPA-like policy (`site_naming.rego`) will deny:
|
||||
|
||||
```rego
|
||||
deny[msg] {
|
||||
input.site.code != regex.match(`^EU-[A-Z]{3}-FR0[1-9]$`, input.site.code)
|
||||
msg := sprintf("invalid site code: %v", [input.site.code])
|
||||
}
|
||||
```
|
||||
|
||||
**Exercise:**
|
||||
|
||||
1. Observe pipeline failure message.
|
||||
2. Fix `EU-PAR-FR1` → `EU-PAR-FR01`.
|
||||
3. Re-push, watch `policy_gates` pass.
|
||||
|
||||
### Day 1 Observable Outcomes
|
||||
|
||||
* **Green MRs/PRs** for:
|
||||
|
||||
* `eu-par-fr01.yaml`
|
||||
* `data-classification.yaml`
|
||||
* `sustainability-kpis.yaml`
|
||||
* `policy_gates` successfully enforces:
|
||||
|
||||
* Site naming pattern
|
||||
* Presence of PUE and renewable targets
|
||||
|
||||
### Day 1 Definition of Done
|
||||
|
||||
* Site EU-PAR-FR01 is **formally defined in Git**.
|
||||
* Data classes & residency rules for FR sovereign data are codified.
|
||||
* At least one **intentional CI policy failure** has been observed and fixed by the team.
|
||||
|
||||
---
|
||||
|
||||
## Day 2 — Stage 1: Facility Build-Out (Logical Model)
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
By the end of Day 2, the team can:
|
||||
|
||||
* Represent physical racks, power, and cooling **as code**.
|
||||
* Link facility telemetry to future observability (PUE inputs).
|
||||
* Use policy gates to validate rack power density and naming.
|
||||
|
||||
### Inputs & Pre-Reqs
|
||||
|
||||
* Day 1 completed and merged.
|
||||
* High-level facility design (rack count, power, cooling, UPS, generator).
|
||||
* Agreement on rack codes: `EU-PAR-FR01-RK01` … `EU-PAR-FR01-RK08`.
|
||||
|
||||
### Step 1 — Rack Layout YAML
|
||||
|
||||
**Repo:** `infra-foundation`
|
||||
|
||||
1. Branch:
|
||||
|
||||
```bash
|
||||
git checkout -b feat/eu-par-fr01-racks
|
||||
```
|
||||
|
||||
2. Create `facility/rack_layouts/eu-par-fr01-racks.yaml`:
|
||||
|
||||
```yaml
|
||||
racks:
|
||||
- code: EU-PAR-FR01-RK01
|
||||
role: gpu
|
||||
max_kw: 20
|
||||
- code: EU-PAR-FR01-RK02
|
||||
role: gpu
|
||||
max_kw: 20
|
||||
- code: EU-PAR-FR01-RK03
|
||||
role: compute
|
||||
max_kw: 8
|
||||
- code: EU-PAR-FR01-RK04
|
||||
role: compute
|
||||
max_kw: 8
|
||||
- code: EU-PAR-FR01-RK05
|
||||
role: compute
|
||||
max_kw: 8
|
||||
- code: EU-PAR-FR01-RK06
|
||||
role: compute
|
||||
max_kw: 8
|
||||
- code: EU-PAR-FR01-RK07
|
||||
role: storage
|
||||
max_kw: 6
|
||||
- code: EU-PAR-FR01-RK08
|
||||
role: storage
|
||||
max_kw: 6
|
||||
```
|
||||
|
||||
3. Commit & push; open MR/PR; CI runs.
|
||||
|
||||
### Step 2 — Power & Cooling YAML
|
||||
|
||||
Create `facility/power_and_cooling/eu-par-fr01-power.yaml`:
|
||||
|
||||
```yaml
|
||||
power:
|
||||
ups_topology: N+1
|
||||
generator:
|
||||
enabled: true
|
||||
autonomy_hours: 8
|
||||
per_rack_pdu:
|
||||
type: "intelligent"
|
||||
metered: true
|
||||
switched: true
|
||||
|
||||
cooling:
|
||||
type: in-row
|
||||
free_cooling: true
|
||||
gpu_rack_density_kw: 20
|
||||
cpu_rack_density_kw: 8
|
||||
|
||||
telemetry:
|
||||
sensors:
|
||||
- rack_inlet_temp
|
||||
- rack_exhaust_temp
|
||||
- room_temp
|
||||
- room_humidity
|
||||
- pdu_power_kw
|
||||
exports:
|
||||
- name: rack_pdu_power_kw
|
||||
protocol: snmp
|
||||
prometheus:
|
||||
metric: dc_pdu_power_kw
|
||||
```
|
||||
|
||||
Commit & push on same branch.
|
||||
|
||||
### Step 3 — Policy Gate & Validation
|
||||
|
||||
Assume a policy that checks:
|
||||
|
||||
* Rack codes match `<SITE>-RK<rr>`.
|
||||
* `max_kw` does not exceed `gpu_rack_density_kw` or `cpu_rack_density_kw` per role.
|
||||
|
||||
**Optional training mistake:**
|
||||
|
||||
* Set `max_kw: 25` on a GPU rack.
|
||||
* CI will fail with “rack max_kw (25) exceeds gpu_rack_density_kw (20)”.
|
||||
|
||||
Fix, re-run pipeline.
|
||||
|
||||
### Day 2 Observable Outcomes
|
||||
|
||||
* Facility model for EU-PAR-FR01 racks, power, cooling lives in Git.
|
||||
* Policy gates verify:
|
||||
|
||||
* Rack codes
|
||||
* Power density constraints
|
||||
|
||||
### Day 2 Definition of Done
|
||||
|
||||
* Racks & power are **fully codified**.
|
||||
* Sustainability-related telemetry fields (power metrics) are defined.
|
||||
* Team understands how facility YAML ties into later PUE/KPI observability.
|
||||
|
||||
---
|
||||
|
||||
## Day 3 — Stage 2: Network & Out-of-Band Bring-Up
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
* Model underlay network (leaf/spine, VRFs, VLANs) with Terraform.
|
||||
* Define OOB management network and inventories via Ansible.
|
||||
* Understand the **narrow, documented bootstrap** for OOB devices.
|
||||
|
||||
### Inputs & Pre-Reqs
|
||||
|
||||
* Day 1-2 merged.
|
||||
* Network design (ASNs, subnets, VRFs).
|
||||
* Access to physical network devices (offline/powered but reachable via console).
|
||||
|
||||
### Step 1 — Network Terraform Skeleton
|
||||
|
||||
**Repo:** `infra-foundation`
|
||||
Branch: `feat/eu-par-fr01-network`
|
||||
|
||||
Create:
|
||||
|
||||
`network/terraform/eu-par-fr01/main.tf`:
|
||||
|
||||
```hcl
|
||||
module "leaf_spine" {
|
||||
source = "../modules/leaf_spine"
|
||||
site_code = "EU-PAR-FR01"
|
||||
|
||||
vrfs = [
|
||||
"INFRA_MGMT",
|
||||
"TENANT",
|
||||
"STORAGE",
|
||||
"OUT_OF_BAND",
|
||||
]
|
||||
|
||||
# Example inputs - adjust to your design
|
||||
mgmt_cidr = "10.10.0.0/24"
|
||||
tenant_cidr = "10.20.0.0/16"
|
||||
storage_cidr = "10.30.0.0/24"
|
||||
oob_cidr = "10.99.0.0/24"
|
||||
}
|
||||
```
|
||||
|
||||
`network/terraform/eu-par-fr01/variables.tf` and `outputs.tf` as needed.
|
||||
|
||||
### Step 2 — OOB Inventory & Bootstrap Runbook
|
||||
|
||||
Create:
|
||||
|
||||
`baremetal/profiles/oob-switches.yaml`:
|
||||
|
||||
```yaml
|
||||
oob_switches:
|
||||
- name: EU-PAR-FR01-RK01-oob01
|
||||
mgmt_ip: 10.99.0.10
|
||||
rack: EU-PAR-FR01-RK01
|
||||
role: oob
|
||||
```
|
||||
|
||||
Create a **bootstrap runbook** (Markdown):
|
||||
|
||||
`facility/bootstrap/eu-par-fr01.md`:
|
||||
|
||||
* List **explicit allowed manual actions**:
|
||||
|
||||
* Connect serial console to OOB switch/firewall.
|
||||
* Set initial mgmt IP, subnet, default gateway.
|
||||
* Set initial credentials/SSH keys.
|
||||
|
||||
* Immediately codify these settings in Terraform/Ansible.
|
||||
|
||||
### Step 3 — Bootstrap Actions (One-Time, Documented)
|
||||
|
||||
**Physical/console actions (allowed):**
|
||||
|
||||
* For each OOB device:
|
||||
|
||||
* Set mgmt IP from `oob_cidr` (`10.99.0.0/24`).
|
||||
* Configure credentials to match Ansible inventory.
|
||||
|
||||
**Important:**
|
||||
After this step, **no more manual config** on network devices. Everything else is via pipelines.
|
||||
|
||||
### Step 4 — CI & Policy Gates
|
||||
|
||||
* Run `terraform validate` via `lint_and_unit`.
|
||||
* Run `policy_gates` to check:
|
||||
|
||||
* No overlapping CIDRs between VRFs.
|
||||
* OOB network is distinct from TENANT/STORAGE/INFRA_MGMT.
|
||||
|
||||
**Optional mistake:**
|
||||
Set `oob_cidr = "10.20.0.0/24"` (same as TENANT).
|
||||
Policy gate fails → fix.
|
||||
|
||||
### Day 3 Observable Outcomes
|
||||
|
||||
* Terraform underlay definition for EU-PAR-FR01 exists.
|
||||
* OOB inventory & initial bootstrap runbook committed.
|
||||
* Network policies catch invalid CIDR overlaps.
|
||||
|
||||
### Day 3 Definition of Done
|
||||
|
||||
* OOB devices are reachable via mgmt IPs defined in YAML.
|
||||
* Leaf/spine config can be generated and applied **by pipeline**.
|
||||
* Bootstrap runbook clearly documents the **only** manual config allowed.
|
||||
|
||||
---
|
||||
|
||||
## Day 4 — Stage 3: Bare-Metal & Hypervisor Provisioning
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
* Drive server discovery and OS install via MAAS (or similar) and Ansible.
|
||||
* Understand how `mgmt01` is used as an automation anchor.
|
||||
* Handle failed provisioning via **pipeline-driven rollback**.
|
||||
|
||||
### Inputs & Pre-Reqs
|
||||
|
||||
* Network connectivity (OOB + mgmt) is in place.
|
||||
* Management node: `EU-PAR-FR01-RK01-mgmt01` exists (OS can be installed manually once, or pre-provisioned).
|
||||
|
||||
### Step 1 — Define Bare-Metal Profiles
|
||||
|
||||
**Repo:** `infra-foundation`
|
||||
Branch: `feat/eu-par-fr01-baremetal`
|
||||
|
||||
Create:
|
||||
|
||||
`baremetal/profiles/compute-standard.yaml`:
|
||||
|
||||
```yaml
|
||||
profile:
|
||||
name: compute-standard
|
||||
cpu: "2x32"
|
||||
ram_gb: 512
|
||||
role: "k8s-node"
|
||||
tags:
|
||||
- "site:EU-PAR-FR01"
|
||||
- "country:FR"
|
||||
```
|
||||
|
||||
`baremetal/profiles/compute-gpu.yaml`:
|
||||
|
||||
```yaml
|
||||
profile:
|
||||
name: compute-gpu
|
||||
cpu: "2x32"
|
||||
ram_gb: 768
|
||||
gpus: 4
|
||||
role: "k8s-gpu-node"
|
||||
tags:
|
||||
- "site:EU-PAR-FR01"
|
||||
- "country:FR"
|
||||
- "gpu:true"
|
||||
```
|
||||
|
||||
### Step 2 — Hypervisor Ansible Playbook
|
||||
|
||||
`hypervisor/ansible/proxmox_cluster.yml`:
|
||||
|
||||
```yaml
|
||||
- hosts: eu-par-fr01-proxmox
|
||||
become: true
|
||||
roles:
|
||||
- proxmox-base
|
||||
- proxmox-cluster-join
|
||||
```
|
||||
|
||||
Inventory could be generated from MAAS hostnames.
|
||||
|
||||
### Step 3 — Pipeline-Driven Commissioning
|
||||
|
||||
* `site_rollout` job script in CI:
|
||||
|
||||
```bash
|
||||
# scripts/apply_site.sh
|
||||
set -e
|
||||
SITE_CODE="$1" # EU-PAR-FR01
|
||||
./scripts/maas_commission_nodes.sh "$SITE_CODE"
|
||||
ansible-playbook hypervisor/ansible/proxmox_cluster.yml -l "$SITE_CODE"
|
||||
```
|
||||
|
||||
Run:
|
||||
|
||||
* From MR or manually in CI: `site_rollout EU-PAR-FR01`.
|
||||
|
||||
### Optional Failure Scenario (Rollback Training)
|
||||
|
||||
* Mistake: profile references wrong image channel (e.g., `hwe-20.04` not available).
|
||||
* Commissioning fails; `integration_test` checks kernel version and fails.
|
||||
* Fix YAML to a supported image; re-run `site_rollout`.
|
||||
* If needed, script reverts to last known-good image.
|
||||
|
||||
### Day 4 Observable Outcomes
|
||||
|
||||
* MAAS lists all servers with correct tags (site, role, etc.).
|
||||
* Proxmox (or chosen hypervisor) cluster installed via Ansible.
|
||||
|
||||
### Day 4 Definition of Done
|
||||
|
||||
* Adding a new server and powering it on can be driven to OS-installed state **only via** `site_rollout` pipeline.
|
||||
* No direct SSH configuration is required.
|
||||
|
||||
---
|
||||
|
||||
## Day 5 — Stage 4: Platform Bootstrap (First K8s Control Plane)
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
* Define the management K8s cluster as YAML.
|
||||
* Bootstrap GitOps controller and use it as the **only** way to change cluster state.
|
||||
* Understand bootstrap vs steady-state behavior.
|
||||
|
||||
### Inputs & Pre-Reqs
|
||||
|
||||
* Hypervisor cluster or dedicated K8s nodes available.
|
||||
* `EU-PAR-FR01-RK01-mgmt01` has access to Git and CI.
|
||||
|
||||
### Step 1 — Cluster Definition
|
||||
|
||||
**Repo:** `platform-clusters`
|
||||
Branch: `feat/eu-par-fr01-mgmt-cluster`
|
||||
|
||||
`k8s/clusters/eu-par-fr01/mgmt-cluster.yaml`:
|
||||
|
||||
```yaml
|
||||
cluster:
|
||||
name: eu-par-fr01-mgmt
|
||||
site: EU-PAR-FR01
|
||||
control_plane_nodes: 3
|
||||
node_pool_profile: compute-standard
|
||||
networking:
|
||||
cni: cilium
|
||||
sovereign_overlays:
|
||||
country: FR
|
||||
region: EU_EEA
|
||||
```
|
||||
|
||||
### Step 2 — GitOps Application Definition
|
||||
|
||||
For Argo CD (example):
|
||||
|
||||
`k8s/clusters/eu-par-fr01/gitops-app.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: argoproj.io/v1alpha1
|
||||
kind: Application
|
||||
metadata:
|
||||
name: eu-par-fr01-mgmt
|
||||
spec:
|
||||
project: default
|
||||
source:
|
||||
repoURL: 'ssh://git@your-git/plattform-clusters.git'
|
||||
path: 'k8s/clusters/eu-par-fr01'
|
||||
targetRevision: main
|
||||
destination:
|
||||
server: 'https://kubernetes.default.svc'
|
||||
namespace: argocd
|
||||
syncPolicy:
|
||||
automated:
|
||||
prune: true
|
||||
selfHeal: true
|
||||
```
|
||||
|
||||
### Step 3 — Bootstrap Playbook (Run via Pipeline)
|
||||
|
||||
Create a script/playbook in `platform-clusters` that:
|
||||
|
||||
* Installs K8s binaries on designated nodes.
|
||||
* Deploys Argo CD (or Flux).
|
||||
* Applies `gitops-app.yaml`.
|
||||
|
||||
CI job (e.g., `site_rollout` for platform):
|
||||
|
||||
```bash
|
||||
./scripts/bootstrap_k8s_cluster.sh eu-par-fr01-mgmt
|
||||
```
|
||||
|
||||
**Important:**
|
||||
Even this bootstrap script is **versioned** and run from CI, not manually executed from someone’s laptop.
|
||||
|
||||
### Step 4 — Integration Tests
|
||||
|
||||
* `integration_test` job:
|
||||
|
||||
* Runs `kubectl get nodes`, `kubectl get pods -A`.
|
||||
* Simple HTTP probe to API server.
|
||||
|
||||
### Day 5 Observable Outcomes
|
||||
|
||||
* Management cluster `eu-par-fr01-mgmt` exists and is healthy.
|
||||
* GitOps tool is deployed and syncing from `platform-clusters/k8s/clusters/eu-par-fr01/`.
|
||||
* Any change in that path is reflected in the cluster via GitOps.
|
||||
|
||||
### Day 5 Definition of Done
|
||||
|
||||
* Management cluster can be **re-created** from scratch using:
|
||||
|
||||
* Git repos
|
||||
* CI jobs
|
||||
* Bootstrap scripts (no manual `kubectl apply` from laptops).
|
||||
|
||||
---
|
||||
|
||||
## Day 6 — Stage 5: Compliance & Telemetry Validation
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
* Enforce data residency & RBAC via policy-as-code.
|
||||
* Wire telemetry for platform + facility (PUE input metrics).
|
||||
* See policy failures in CI and fix them.
|
||||
|
||||
### Inputs & Pre-Reqs
|
||||
|
||||
* K8s mgmt cluster running and managed via GitOps.
|
||||
* Prometheus/Grafana charts available to deploy.
|
||||
|
||||
### Step 1 — Policies in `policies-and-compliance`
|
||||
|
||||
Branch: `feat/eu-par-fr01-policies`
|
||||
|
||||
Create `opa-policies/data_residency.rego` (simplified):
|
||||
|
||||
```rego
|
||||
package data_residency
|
||||
|
||||
deny[msg] {
|
||||
input.kind == "BackupPolicy"
|
||||
input.metadata.labels["data_classification"] == "CRITICAL_SOVEREIGN_FR"
|
||||
not input.spec.target.region == "fr-central"
|
||||
msg := sprintf("critical FR data must backup to fr-central, got %v", [input.spec.target.region])
|
||||
}
|
||||
```
|
||||
|
||||
Create `opa-policies/rbac.rego`:
|
||||
|
||||
```rego
|
||||
package rbac
|
||||
|
||||
deny[msg] {
|
||||
input.kind == "ClusterRoleBinding"
|
||||
input.metadata.name == "cluster-admin"
|
||||
some s
|
||||
s := input.subjects[_]
|
||||
s.kind == "User"
|
||||
not endswith(s.name, "@sovereign-ops.fr")
|
||||
msg := "cluster-admin bindings must target sovereign-ops.fr principals only"
|
||||
}
|
||||
```
|
||||
|
||||
### Step 2 — Monitoring Stack in `platform-clusters`
|
||||
|
||||
Branch: `feat/eu-par-fr01-observability`
|
||||
|
||||
`addons/monitoring-logging-security/prometheus/values.yaml`:
|
||||
|
||||
```yaml
|
||||
extraScrapeConfigs:
|
||||
- job_name: "pdu-power"
|
||||
static_configs:
|
||||
- targets: ["eu-par-fr01-pdu-metrics:9100"]
|
||||
metrics_path: /metrics
|
||||
```
|
||||
|
||||
Plus typical K8s scrape configs (apiserver, nodes, etc.).
|
||||
|
||||
### Step 3 — CI Integration
|
||||
|
||||
* Update `scripts/run_opa.sh` to run OPA/Conftest against rendered manifests (K8s & backup policies).
|
||||
* Ensure `policy_gates` uses the new rego files.
|
||||
|
||||
### Optional Lab Mistakes
|
||||
|
||||
1. Backup policy with wrong region (as in previous answer).
|
||||
2. `ClusterRoleBinding` with external email domain.
|
||||
|
||||
Observe CI failures, fix YAML, re-run.
|
||||
|
||||
### Day 6 Observable Outcomes
|
||||
|
||||
* Policies catch non-compliant backup regions and admin bindings.
|
||||
* Prometheus scrapes PDU metrics; PUE calculation possible.
|
||||
|
||||
### Day 6 Definition of Done
|
||||
|
||||
* Compliance and telemetry for EU-PAR-FR01 are **enforced via pipelines**, not manual review alone.
|
||||
* Team can confidently interpret and resolve policy failures.
|
||||
|
||||
---
|
||||
|
||||
## Day 7 — Stage 6 & 7: Workload Onboarding & Scale-Out Pattern
|
||||
|
||||
### Learning Objectives
|
||||
|
||||
* Onboard initial AI/ML + SaaS workloads in sovereign-compliant namespaces.
|
||||
* Reflect on how EU-PAR-FR01 patterns will be cloned to other sites.
|
||||
|
||||
### Inputs & Pre-Reqs
|
||||
|
||||
* Mgmt cluster ready; policies and monitoring active.
|
||||
|
||||
### Step 1 — Namespaces & Storage Classes
|
||||
|
||||
**Repo:** `platform-clusters`
|
||||
Branch: `feat/eu-par-fr01-workloads`
|
||||
|
||||
Create:
|
||||
|
||||
`k8s/clusters/eu-par-fr01/namespaces/fr-critical-sovereign-ai.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Namespace
|
||||
metadata:
|
||||
name: fr-critical-sovereign-ai
|
||||
labels:
|
||||
data_classification: CRITICAL_SOVEREIGN_FR
|
||||
country: FR
|
||||
```
|
||||
|
||||
StorageClass:
|
||||
|
||||
```yaml
|
||||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: fr-critical-sovereign-sc
|
||||
provisioner: "ceph.rbd"
|
||||
parameters:
|
||||
pool: "gpu-block-fr-local"
|
||||
```
|
||||
|
||||
### Step 2 — Example AI/ML Job
|
||||
|
||||
`k8s/clusters/eu-par-fr01/workloads/ai-ml/job.yaml`:
|
||||
|
||||
```yaml
|
||||
apiVersion: batch/v1
|
||||
kind: Job
|
||||
metadata:
|
||||
name: fr-ai-training-job
|
||||
namespace: fr-critical-sovereign-ai
|
||||
labels:
|
||||
data_classification: CRITICAL_SOVEREIGN_FR
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
nodeSelector:
|
||||
gpu: "true"
|
||||
restartPolicy: Never
|
||||
containers:
|
||||
- name: trainer
|
||||
image: registry.local/ai/trainer:latest
|
||||
resources:
|
||||
limits:
|
||||
nvidia.com/gpu: 1
|
||||
volumeMounts:
|
||||
- name: training-data
|
||||
mountPath: /data
|
||||
volumes:
|
||||
- name: training-data
|
||||
persistentVolumeClaim:
|
||||
claimName: fr-critical-sovereign-ai-pvc
|
||||
```
|
||||
|
||||
### Step 3 — CI & GitOps Rollout
|
||||
|
||||
* Push branch, open MR, run:
|
||||
|
||||
* `lint_and_unit`
|
||||
* `policy_gates`
|
||||
* `integration_test`
|
||||
* `site_rollout` (applies via GitOps)
|
||||
|
||||
**Optional lab mistake:**
|
||||
|
||||
* Use storage pool `gpu-block-eu` that replicates outside FR.
|
||||
* `data_residency` policy denies, pipeline fails → fix to `gpu-block-fr-local`.
|
||||
|
||||
### Step 4 — Scale-Out Pattern (Preview for Later Work)
|
||||
|
||||
* Copy EU-PAR-FR01 manifests into a new path `k8s/clusters/eu-fra-de01/` with site and country changed.
|
||||
* Ensure all site-specific values are parameterized; confirm policy gates also handle DE site.
|
||||
|
||||
### Day 7 Observable Outcomes
|
||||
|
||||
* Critical sovereign workload runs in FR-only namespace and storage.
|
||||
* Policy gates prevent cross-border misconfigurations.
|
||||
* Initial pattern for **cloning EU-PAR-FR01** to another site is established.
|
||||
|
||||
### Day 7 Definition of Done
|
||||
|
||||
* EU-PAR-FR01 can host **real workloads**, compliant with data residency and RBAC rules.
|
||||
* A repeatable pattern exists for future sites with minimal changes.
|
||||
|
||||
---
|
||||
|
||||
## D1 Wrap-Up: What You Achieve After Following All Days
|
||||
|
||||
After you run **Day 1-7** as above:
|
||||
|
||||
* EU-PAR-FR01 is modeled end-to-end in Git:
|
||||
|
||||
* Facility, network, bare metal, hypervisor, K8s, policies, observability.
|
||||
* The first control plane is running and GitOps-managed.
|
||||
* Sovereign and sustainability constraints are **built-in**, not retrofitted.
|
||||
* You have a **repeatable, teachable journey** that maps exactly to the D1 requirements and sets the stage for D2-D6.
|
||||
|
||||
---
|
||||
|
||||
Reference in New Issue
Block a user