Add Micro–DC/docs/training/staging_environment.txt

2025-12-05 13:11:11 +00:00
parent d220774716
commit fe31d875a4
1 changed files with 173 additions and 0 deletions
--- a/Micro–DC/docs/training/staging_environment.txt
+++ b/Micro–DC/docs/training/staging_environment.txt
@@ -0,0 +1,173 @@
+<details>
+<summary>Staging environment - minimum requirements to run the full D1-D6 course</summary>
+
+## 1. Scope & Isolation
+
+- **Dedicated staging / lab environment** — no shared production clusters or shared infra.
+- **Full control over**:
+  - Git repos (`infra-foundation`, `platform-clusters`, `policies-and-compliance`)
+  - CI/CD pipelines
+  - Cluster lifecycle (create, destroy, rebuild)
+- **Safe to break**: we must be allowed to cause failures (bad configs, SLO breaches) without impacting real users.
+
+---
+
+## 2. Hardware / Compute Baseline
+
+You can run the course in two modes:
+
+### A. Full-fidelity bare-metal lab (preferred)
+
+- **Servers**
+  - 6-8 physical servers, x86_64, with IPMI/BMC:
+    - 2x GPU nodes (4 GPUs each, or at least 1 per node)
+    - 3-4x general compute nodes
+    - 1-2x storage/monitoring/infra nodes
+  - 256-512 GB RAM per GPU node (or more if available)
+  - 128-256 GB RAM per general compute node
+- **Storage**
+  - Local NVMe/SSD on each server
+  - 3+ nodes with extra disks to simulate a **Ceph** cluster (or equivalent)
+- **Power + environment**
+  - At least 1 intelligent PDU or a **simulated power metrics source** to feed PUE metrics
+
+### B. Virtual-only lab (reduced footprint)
+
+If you don't have bare metal:
+
+- 1-2 large hypervisors:
+  - 32+ vCPUs
+  - 256+ GB RAM
+  - Enough disk to run:
+    - 1 mgmt K8s cluster (3 control-plane nodes + workers)
+    - 1 “GPU” cluster (can emulate GPUs with labels if no real GPU)
+    - 1 monitoring stack (Prometheus + Grafana)
+- Local or network storage that can host a small Ceph or equivalent cluster for the residency labs.
+
+> For sovereignty & GPU labs, **real hardware is ideal** but you can still run the course logically with virtualized / emulated resources.
+
+---
+
+## 3. Network & Connectivity
+
+- **Topology**
+  - At least:
+    - 1x “ToR”/leaf switch (or virtual equivalent)
+    - 1x OOB management network (even if simulated as a separate VLAN)
+- **Segmentation**
+  - Separate logical networks (VLANs or VRFs) for:
+    - `INFRA_MGMT`
+    - `TENANT`
+    - `STORAGE`
+    - `OUT_OF_BAND`
+- **Remote access**
+  - VPN or secure jump host into the lab environment
+  - Engineers can reach:
+    - Git
+    - CI
+    - K8s API
+    - MAAS/Proxmox/VM platform UI (for *observation*, not config)
+
+---
+
+## 4. Platform & Tooling Stack
+
+- **OS baseline**
+  - Modern Linux distro on all servers/VMs (e.g. Ubuntu 20.04/22.04 or equivalent)
+- **Bare metal / virtualisation**
+  - One of:
+    - MAAS (or equivalent) for discovery/commissioning, **or**
+    - A virtualisation stack where you can script VM creation (Proxmox, vSphere, etc.)
+- **Kubernetes**
+  - At least one **management cluster** in EU-PAR-FR01 “role”
+  - Optionally a separate “GPU/tenant” cluster
+- **GitOps**
+  - Argo CD or Flux installed in the mgmt cluster and managed via Git
+- **Storage**
+  - Either:
+    - Small Ceph cluster, **or**
+    - Another block/object storage with clear per-pool/zone locality for residency labs
+
+---
+
+## 5. Git & CI/CD
+
+- **Git platform**
+  - GitLab, GitHub, Gitea, or similar with:
+    - Repo-level CI/CD
+    - Branch protections
+- **Repos created**
+  - `infra-foundation`
+  - `platform-clusters`
+  - `policies-and-compliance`
+- **CI runners**
+  - Runners with:
+    - `terraform`, `ansible`, `kubectl`, `helm` (or equivalent)
+    - `opa`/`conftest` for policy-as-code
+    - `yamllint`, `kubeconform` or similar
+- **Pipelines**
+  - Standard stages configured:
+    - `lint_and_unit`
+    - `policy_gates`
+    - `integration_test`
+    - `promotion_to_template`
+    - `site_rollout`
+
+---
+
+## 6. Observability Stack
+
+- **Prometheus**
+  - Scrape:
+    - K8s components & nodes
+    - DCIM/PDU power metrics (or simulated)
+    - GPU scheduler/job metrics (or simulated)
+- **Grafana**
+  - Deployed via `platform-clusters` (dashboards as JSON in Git)
+- **Alerting**
+  - Alertmanager or equivalent for:
+    - Control-plane SLO alerts
+    - GPU wait-time SLO alerts
+    - PUE SLO alerts
+
+---
+
+## 7. Identity, Access & Security
+
+- **Identity provider** (even simple) for:
+  - Differentiating sovereign ops groups vs others (e.g. `@sovereign-ops.fr` accounts)
+- **RBAC**
+  - K8s configured to use groups from IdP (or fake groups in lab)
+- **Access model**
+  - Engineers can access:
+    - Git & pipelines
+    - Read-only UIs for infra/platform
+  - Direct SSH/root access to infra reserved for:
+    - Lab facilitators
+    - Documented bootstrap/drift-fix exercises
+
+---
+
+## 8. Data & Safety Constraints
+
+- **No real personal / production data.**
+  - Use synthetic or anonymized datasets for AI/ML and SaaS workloads.
+- **Sovereignty simulation**
+  - Regions (e.g. `fr-central`, `eu-central-1`) can be logical; the key is:
+    - Storage classes & backup targets are labelled and policy-checked as if real.
+
+---
+
+## 9. Operational Capabilities (to make the course repeatable)
+
+- Ability to:
+  - **Reset** the environment or a subset (e.g. destroy/recreate clusters) on demand
+  - Tag and restore specific Git states (e.g. `v-training-d1`, `v-training-d6`)
+  - Run **Game Day / replay drills** without impacting other labs
+
+If these minimums are in place, you can realistically complete the **full D1-D6 training**:
+- from bare-metal/virtual “nothing”
+- to GitOps-managed, sovereign-aware platform
+- with SLOs, policies, and zero_manual_provisioning verified by drills.
+
+</details>