Add Micro–DC/docs/training/staging_environment.txt

2025-12-05 13:11:11 +00:00
parent d220774716
commit fe31d875a4
1 changed files with 173 additions and 0 deletions
--- a/Micro–DC/docs/training/staging_environment.txt
+++ b/Micro–DC/docs/training/staging_environment.txt
@@ -0,0 +1,173 @@
 <details>
 <summary>Staging environment - minimum requirements to run the full D1-D6 course</summary>
 ## 1. Scope & Isolation
 - **Dedicated staging / lab environment** — no shared production clusters or shared infra.
 - **Full control over**:
  - Git repos (`infra-foundation`, `platform-clusters`, `policies-and-compliance`)
  - CI/CD pipelines
  - Cluster lifecycle (create, destroy, rebuild)
 - **Safe to break**: we must be allowed to cause failures (bad configs, SLO breaches) without impacting real users.
 ---
 ## 2. Hardware / Compute Baseline
 You can run the course in two modes:
 ### A. Full-fidelity bare-metal lab (preferred)
 - **Servers**
  - 6-8 physical servers, x86_64, with IPMI/BMC:
    - 2x GPU nodes (4 GPUs each, or at least 1 per node)
    - 3-4x general compute nodes
    - 1-2x storage/monitoring/infra nodes
  - 256-512 GB RAM per GPU node (or more if available)
  - 128-256 GB RAM per general compute node
 - **Storage**
  - Local NVMe/SSD on each server
  - 3+ nodes with extra disks to simulate a **Ceph** cluster (or equivalent)
 - **Power + environment**
  - At least 1 intelligent PDU or a **simulated power metrics source** to feed PUE metrics
 ### B. Virtual-only lab (reduced footprint)
 If you don't have bare metal:
 - 1-2 large hypervisors:
  - 32+ vCPUs
  - 256+ GB RAM
  - Enough disk to run:
    - 1 mgmt K8s cluster (3 control-plane nodes + workers)
    - 1 “GPU” cluster (can emulate GPUs with labels if no real GPU)
    - 1 monitoring stack (Prometheus + Grafana)
 - Local or network storage that can host a small Ceph or equivalent cluster for the residency labs.
 > For sovereignty & GPU labs, **real hardware is ideal** but you can still run the course logically with virtualized / emulated resources.
 ---
 ## 3. Network & Connectivity
 - **Topology**
  - At least:
    - 1x “ToR”/leaf switch (or virtual equivalent)
    - 1x OOB management network (even if simulated as a separate VLAN)
 - **Segmentation**
  - Separate logical networks (VLANs or VRFs) for:
    - `INFRA_MGMT`
    - `TENANT`
    - `STORAGE`
    - `OUT_OF_BAND`
 - **Remote access**
  - VPN or secure jump host into the lab environment
  - Engineers can reach:
    - Git
    - CI
    - K8s API
    - MAAS/Proxmox/VM platform UI (for *observation*, not config)
 ---
 ## 4. Platform & Tooling Stack
 - **OS baseline**
  - Modern Linux distro on all servers/VMs (e.g. Ubuntu 20.04/22.04 or equivalent)
 - **Bare metal / virtualisation**
  - One of:
    - MAAS (or equivalent) for discovery/commissioning, **or**
    - A virtualisation stack where you can script VM creation (Proxmox, vSphere, etc.)
 - **Kubernetes**
  - At least one **management cluster** in EU-PAR-FR01 “role”
  - Optionally a separate “GPU/tenant” cluster
 - **GitOps**
  - Argo CD or Flux installed in the mgmt cluster and managed via Git
 - **Storage**
  - Either:
    - Small Ceph cluster, **or**
    - Another block/object storage with clear per-pool/zone locality for residency labs
 ---
 ## 5. Git & CI/CD
 - **Git platform**
  - GitLab, GitHub, Gitea, or similar with:
    - Repo-level CI/CD
    - Branch protections
 - **Repos created**
  - `infra-foundation`
  - `platform-clusters`
  - `policies-and-compliance`
 - **CI runners**
  - Runners with:
    - `terraform`, `ansible`, `kubectl`, `helm` (or equivalent)
    - `opa`/`conftest` for policy-as-code
    - `yamllint`, `kubeconform` or similar
 - **Pipelines**
  - Standard stages configured:
    - `lint_and_unit`
    - `policy_gates`
    - `integration_test`
    - `promotion_to_template`
    - `site_rollout`
 ---
 ## 6. Observability Stack
 - **Prometheus**
  - Scrape:
    - K8s components & nodes
    - DCIM/PDU power metrics (or simulated)
    - GPU scheduler/job metrics (or simulated)
 - **Grafana**
  - Deployed via `platform-clusters` (dashboards as JSON in Git)
 - **Alerting**
  - Alertmanager or equivalent for:
    - Control-plane SLO alerts
    - GPU wait-time SLO alerts
    - PUE SLO alerts
 ---
 ## 7. Identity, Access & Security
 - **Identity provider** (even simple) for:
  - Differentiating sovereign ops groups vs others (e.g. `@sovereign-ops.fr` accounts)
 - **RBAC**
  - K8s configured to use groups from IdP (or fake groups in lab)
 - **Access model**
  - Engineers can access:
    - Git & pipelines
    - Read-only UIs for infra/platform
  - Direct SSH/root access to infra reserved for:
    - Lab facilitators
    - Documented bootstrap/drift-fix exercises
 ---
 ## 8. Data & Safety Constraints
 - **No real personal / production data.**
  - Use synthetic or anonymized datasets for AI/ML and SaaS workloads.
 - **Sovereignty simulation**
  - Regions (e.g. `fr-central`, `eu-central-1`) can be logical; the key is:
    - Storage classes & backup targets are labelled and policy-checked as if real.
 ---
 ## 9. Operational Capabilities (to make the course repeatable)
 - Ability to:
  - **Reset** the environment or a subset (e.g. destroy/recreate clusters) on demand
  - Tag and restore specific Git states (e.g. `v-training-d1`, `v-training-d6`)
  - Run **Game Day / replay drills** without impacting other labs
 If these minimums are in place, you can realistically complete the **full D1-D6 training**:
 - from bare-metal/virtual “nothing”
 - to GitOps-managed, sovereign-aware platform
 - with SLOs, policies, and zero_manual_provisioning verified by drills.
 </details>