Add Micro–DC/docs/training/staging_environment.txt

This commit is contained in:
2025-12-05 13:11:11 +00:00
parent d220774716
commit fe31d875a4

View File

@@ -0,0 +1,173 @@
<details>
<summary>Staging environment - minimum requirements to run the full D1-D6 course</summary>
## 1. Scope & Isolation
- **Dedicated staging / lab environment** — no shared production clusters or shared infra.
- **Full control over**:
- Git repos (`infra-foundation`, `platform-clusters`, `policies-and-compliance`)
- CI/CD pipelines
- Cluster lifecycle (create, destroy, rebuild)
- **Safe to break**: we must be allowed to cause failures (bad configs, SLO breaches) without impacting real users.
---
## 2. Hardware / Compute Baseline
You can run the course in two modes:
### A. Full-fidelity bare-metal lab (preferred)
- **Servers**
- 6-8 physical servers, x86_64, with IPMI/BMC:
- 2x GPU nodes (4 GPUs each, or at least 1 per node)
- 3-4x general compute nodes
- 1-2x storage/monitoring/infra nodes
- 256-512 GB RAM per GPU node (or more if available)
- 128-256 GB RAM per general compute node
- **Storage**
- Local NVMe/SSD on each server
- 3+ nodes with extra disks to simulate a **Ceph** cluster (or equivalent)
- **Power + environment**
- At least 1 intelligent PDU or a **simulated power metrics source** to feed PUE metrics
### B. Virtual-only lab (reduced footprint)
If you don't have bare metal:
- 1-2 large hypervisors:
- 32+ vCPUs
- 256+ GB RAM
- Enough disk to run:
- 1 mgmt K8s cluster (3 control-plane nodes + workers)
- 1 “GPU” cluster (can emulate GPUs with labels if no real GPU)
- 1 monitoring stack (Prometheus + Grafana)
- Local or network storage that can host a small Ceph or equivalent cluster for the residency labs.
> For sovereignty & GPU labs, **real hardware is ideal** but you can still run the course logically with virtualized / emulated resources.
---
## 3. Network & Connectivity
- **Topology**
- At least:
- 1x “ToR”/leaf switch (or virtual equivalent)
- 1x OOB management network (even if simulated as a separate VLAN)
- **Segmentation**
- Separate logical networks (VLANs or VRFs) for:
- `INFRA_MGMT`
- `TENANT`
- `STORAGE`
- `OUT_OF_BAND`
- **Remote access**
- VPN or secure jump host into the lab environment
- Engineers can reach:
- Git
- CI
- K8s API
- MAAS/Proxmox/VM platform UI (for *observation*, not config)
---
## 4. Platform & Tooling Stack
- **OS baseline**
- Modern Linux distro on all servers/VMs (e.g. Ubuntu 20.04/22.04 or equivalent)
- **Bare metal / virtualisation**
- One of:
- MAAS (or equivalent) for discovery/commissioning, **or**
- A virtualisation stack where you can script VM creation (Proxmox, vSphere, etc.)
- **Kubernetes**
- At least one **management cluster** in EU-PAR-FR01 “role”
- Optionally a separate “GPU/tenant” cluster
- **GitOps**
- Argo CD or Flux installed in the mgmt cluster and managed via Git
- **Storage**
- Either:
- Small Ceph cluster, **or**
- Another block/object storage with clear per-pool/zone locality for residency labs
---
## 5. Git & CI/CD
- **Git platform**
- GitLab, GitHub, Gitea, or similar with:
- Repo-level CI/CD
- Branch protections
- **Repos created**
- `infra-foundation`
- `platform-clusters`
- `policies-and-compliance`
- **CI runners**
- Runners with:
- `terraform`, `ansible`, `kubectl`, `helm` (or equivalent)
- `opa`/`conftest` for policy-as-code
- `yamllint`, `kubeconform` or similar
- **Pipelines**
- Standard stages configured:
- `lint_and_unit`
- `policy_gates`
- `integration_test`
- `promotion_to_template`
- `site_rollout`
---
## 6. Observability Stack
- **Prometheus**
- Scrape:
- K8s components & nodes
- DCIM/PDU power metrics (or simulated)
- GPU scheduler/job metrics (or simulated)
- **Grafana**
- Deployed via `platform-clusters` (dashboards as JSON in Git)
- **Alerting**
- Alertmanager or equivalent for:
- Control-plane SLO alerts
- GPU wait-time SLO alerts
- PUE SLO alerts
---
## 7. Identity, Access & Security
- **Identity provider** (even simple) for:
- Differentiating sovereign ops groups vs others (e.g. `@sovereign-ops.fr` accounts)
- **RBAC**
- K8s configured to use groups from IdP (or fake groups in lab)
- **Access model**
- Engineers can access:
- Git & pipelines
- Read-only UIs for infra/platform
- Direct SSH/root access to infra reserved for:
- Lab facilitators
- Documented bootstrap/drift-fix exercises
---
## 8. Data & Safety Constraints
- **No real personal / production data.**
- Use synthetic or anonymized datasets for AI/ML and SaaS workloads.
- **Sovereignty simulation**
- Regions (e.g. `fr-central`, `eu-central-1`) can be logical; the key is:
- Storage classes & backup targets are labelled and policy-checked as if real.
---
## 9. Operational Capabilities (to make the course repeatable)
- Ability to:
- **Reset** the environment or a subset (e.g. destroy/recreate clusters) on demand
- Tag and restore specific Git states (e.g. `v-training-d1`, `v-training-d6`)
- Run **Game Day / replay drills** without impacting other labs
If these minimums are in place, you can realistically complete the **full D1-D6 training**:
- from bare-metal/virtual “nothing”
- to GitOps-managed, sovereign-aware platform
- with SLOs, policies, and zero_manual_provisioning verified by drills.
</details>