From fe31d875a43c7722dcee4b316686e7b3d224833b Mon Sep 17 00:00:00 2001 From: sbanszky Date: Fri, 5 Dec 2025 13:11:11 +0000 Subject: [PATCH] =?UTF-8?q?Add=20Micro=E2=80=93DC/docs/training/staging=5F?= =?UTF-8?q?environment.txt?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .../docs/training/staging_environment.txt | 173 ++++++++++++++++++ 1 file changed, 173 insertions(+) create mode 100644 Micro–DC/docs/training/staging_environment.txt diff --git a/Micro–DC/docs/training/staging_environment.txt b/Micro–DC/docs/training/staging_environment.txt new file mode 100644 index 0000000..58e00c2 --- /dev/null +++ b/Micro–DC/docs/training/staging_environment.txt @@ -0,0 +1,173 @@ +
+Staging environment - minimum requirements to run the full D1-D6 course + +## 1. Scope & Isolation + +- **Dedicated staging / lab environment** — no shared production clusters or shared infra. +- **Full control over**: + - Git repos (`infra-foundation`, `platform-clusters`, `policies-and-compliance`) + - CI/CD pipelines + - Cluster lifecycle (create, destroy, rebuild) +- **Safe to break**: we must be allowed to cause failures (bad configs, SLO breaches) without impacting real users. + +--- + +## 2. Hardware / Compute Baseline + +You can run the course in two modes: + +### A. Full-fidelity bare-metal lab (preferred) + +- **Servers** + - 6-8 physical servers, x86_64, with IPMI/BMC: + - 2x GPU nodes (4 GPUs each, or at least 1 per node) + - 3-4x general compute nodes + - 1-2x storage/monitoring/infra nodes + - 256-512 GB RAM per GPU node (or more if available) + - 128-256 GB RAM per general compute node +- **Storage** + - Local NVMe/SSD on each server + - 3+ nodes with extra disks to simulate a **Ceph** cluster (or equivalent) +- **Power + environment** + - At least 1 intelligent PDU or a **simulated power metrics source** to feed PUE metrics + +### B. Virtual-only lab (reduced footprint) + +If you don't have bare metal: + +- 1-2 large hypervisors: + - 32+ vCPUs + - 256+ GB RAM + - Enough disk to run: + - 1 mgmt K8s cluster (3 control-plane nodes + workers) + - 1 “GPU” cluster (can emulate GPUs with labels if no real GPU) + - 1 monitoring stack (Prometheus + Grafana) +- Local or network storage that can host a small Ceph or equivalent cluster for the residency labs. + +> For sovereignty & GPU labs, **real hardware is ideal** but you can still run the course logically with virtualized / emulated resources. + +--- + +## 3. Network & Connectivity + +- **Topology** + - At least: + - 1x “ToR”/leaf switch (or virtual equivalent) + - 1x OOB management network (even if simulated as a separate VLAN) +- **Segmentation** + - Separate logical networks (VLANs or VRFs) for: + - `INFRA_MGMT` + - `TENANT` + - `STORAGE` + - `OUT_OF_BAND` +- **Remote access** + - VPN or secure jump host into the lab environment + - Engineers can reach: + - Git + - CI + - K8s API + - MAAS/Proxmox/VM platform UI (for *observation*, not config) + +--- + +## 4. Platform & Tooling Stack + +- **OS baseline** + - Modern Linux distro on all servers/VMs (e.g. Ubuntu 20.04/22.04 or equivalent) +- **Bare metal / virtualisation** + - One of: + - MAAS (or equivalent) for discovery/commissioning, **or** + - A virtualisation stack where you can script VM creation (Proxmox, vSphere, etc.) +- **Kubernetes** + - At least one **management cluster** in EU-PAR-FR01 “role” + - Optionally a separate “GPU/tenant” cluster +- **GitOps** + - Argo CD or Flux installed in the mgmt cluster and managed via Git +- **Storage** + - Either: + - Small Ceph cluster, **or** + - Another block/object storage with clear per-pool/zone locality for residency labs + +--- + +## 5. Git & CI/CD + +- **Git platform** + - GitLab, GitHub, Gitea, or similar with: + - Repo-level CI/CD + - Branch protections +- **Repos created** + - `infra-foundation` + - `platform-clusters` + - `policies-and-compliance` +- **CI runners** + - Runners with: + - `terraform`, `ansible`, `kubectl`, `helm` (or equivalent) + - `opa`/`conftest` for policy-as-code + - `yamllint`, `kubeconform` or similar +- **Pipelines** + - Standard stages configured: + - `lint_and_unit` + - `policy_gates` + - `integration_test` + - `promotion_to_template` + - `site_rollout` + +--- + +## 6. Observability Stack + +- **Prometheus** + - Scrape: + - K8s components & nodes + - DCIM/PDU power metrics (or simulated) + - GPU scheduler/job metrics (or simulated) +- **Grafana** + - Deployed via `platform-clusters` (dashboards as JSON in Git) +- **Alerting** + - Alertmanager or equivalent for: + - Control-plane SLO alerts + - GPU wait-time SLO alerts + - PUE SLO alerts + +--- + +## 7. Identity, Access & Security + +- **Identity provider** (even simple) for: + - Differentiating sovereign ops groups vs others (e.g. `@sovereign-ops.fr` accounts) +- **RBAC** + - K8s configured to use groups from IdP (or fake groups in lab) +- **Access model** + - Engineers can access: + - Git & pipelines + - Read-only UIs for infra/platform + - Direct SSH/root access to infra reserved for: + - Lab facilitators + - Documented bootstrap/drift-fix exercises + +--- + +## 8. Data & Safety Constraints + +- **No real personal / production data.** + - Use synthetic or anonymized datasets for AI/ML and SaaS workloads. +- **Sovereignty simulation** + - Regions (e.g. `fr-central`, `eu-central-1`) can be logical; the key is: + - Storage classes & backup targets are labelled and policy-checked as if real. + +--- + +## 9. Operational Capabilities (to make the course repeatable) + +- Ability to: + - **Reset** the environment or a subset (e.g. destroy/recreate clusters) on demand + - Tag and restore specific Git states (e.g. `v-training-d1`, `v-training-d6`) + - Run **Game Day / replay drills** without impacting other labs + +If these minimums are in place, you can realistically complete the **full D1-D6 training**: +- from bare-metal/virtual “nothing” +- to GitOps-managed, sovereign-aware platform +- with SLOs, policies, and zero_manual_provisioning verified by drills. + +