Add Micro–DC/docs/training/staging_environment.txt
This commit is contained in:
173
Micro–DC/docs/training/staging_environment.txt
Normal file
173
Micro–DC/docs/training/staging_environment.txt
Normal file
@@ -0,0 +1,173 @@
|
|||||||
|
<details>
|
||||||
|
<summary>Staging environment - minimum requirements to run the full D1-D6 course</summary>
|
||||||
|
|
||||||
|
## 1. Scope & Isolation
|
||||||
|
|
||||||
|
- **Dedicated staging / lab environment** — no shared production clusters or shared infra.
|
||||||
|
- **Full control over**:
|
||||||
|
- Git repos (`infra-foundation`, `platform-clusters`, `policies-and-compliance`)
|
||||||
|
- CI/CD pipelines
|
||||||
|
- Cluster lifecycle (create, destroy, rebuild)
|
||||||
|
- **Safe to break**: we must be allowed to cause failures (bad configs, SLO breaches) without impacting real users.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Hardware / Compute Baseline
|
||||||
|
|
||||||
|
You can run the course in two modes:
|
||||||
|
|
||||||
|
### A. Full-fidelity bare-metal lab (preferred)
|
||||||
|
|
||||||
|
- **Servers**
|
||||||
|
- 6-8 physical servers, x86_64, with IPMI/BMC:
|
||||||
|
- 2x GPU nodes (4 GPUs each, or at least 1 per node)
|
||||||
|
- 3-4x general compute nodes
|
||||||
|
- 1-2x storage/monitoring/infra nodes
|
||||||
|
- 256-512 GB RAM per GPU node (or more if available)
|
||||||
|
- 128-256 GB RAM per general compute node
|
||||||
|
- **Storage**
|
||||||
|
- Local NVMe/SSD on each server
|
||||||
|
- 3+ nodes with extra disks to simulate a **Ceph** cluster (or equivalent)
|
||||||
|
- **Power + environment**
|
||||||
|
- At least 1 intelligent PDU or a **simulated power metrics source** to feed PUE metrics
|
||||||
|
|
||||||
|
### B. Virtual-only lab (reduced footprint)
|
||||||
|
|
||||||
|
If you don't have bare metal:
|
||||||
|
|
||||||
|
- 1-2 large hypervisors:
|
||||||
|
- 32+ vCPUs
|
||||||
|
- 256+ GB RAM
|
||||||
|
- Enough disk to run:
|
||||||
|
- 1 mgmt K8s cluster (3 control-plane nodes + workers)
|
||||||
|
- 1 “GPU” cluster (can emulate GPUs with labels if no real GPU)
|
||||||
|
- 1 monitoring stack (Prometheus + Grafana)
|
||||||
|
- Local or network storage that can host a small Ceph or equivalent cluster for the residency labs.
|
||||||
|
|
||||||
|
> For sovereignty & GPU labs, **real hardware is ideal** but you can still run the course logically with virtualized / emulated resources.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Network & Connectivity
|
||||||
|
|
||||||
|
- **Topology**
|
||||||
|
- At least:
|
||||||
|
- 1x “ToR”/leaf switch (or virtual equivalent)
|
||||||
|
- 1x OOB management network (even if simulated as a separate VLAN)
|
||||||
|
- **Segmentation**
|
||||||
|
- Separate logical networks (VLANs or VRFs) for:
|
||||||
|
- `INFRA_MGMT`
|
||||||
|
- `TENANT`
|
||||||
|
- `STORAGE`
|
||||||
|
- `OUT_OF_BAND`
|
||||||
|
- **Remote access**
|
||||||
|
- VPN or secure jump host into the lab environment
|
||||||
|
- Engineers can reach:
|
||||||
|
- Git
|
||||||
|
- CI
|
||||||
|
- K8s API
|
||||||
|
- MAAS/Proxmox/VM platform UI (for *observation*, not config)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Platform & Tooling Stack
|
||||||
|
|
||||||
|
- **OS baseline**
|
||||||
|
- Modern Linux distro on all servers/VMs (e.g. Ubuntu 20.04/22.04 or equivalent)
|
||||||
|
- **Bare metal / virtualisation**
|
||||||
|
- One of:
|
||||||
|
- MAAS (or equivalent) for discovery/commissioning, **or**
|
||||||
|
- A virtualisation stack where you can script VM creation (Proxmox, vSphere, etc.)
|
||||||
|
- **Kubernetes**
|
||||||
|
- At least one **management cluster** in EU-PAR-FR01 “role”
|
||||||
|
- Optionally a separate “GPU/tenant” cluster
|
||||||
|
- **GitOps**
|
||||||
|
- Argo CD or Flux installed in the mgmt cluster and managed via Git
|
||||||
|
- **Storage**
|
||||||
|
- Either:
|
||||||
|
- Small Ceph cluster, **or**
|
||||||
|
- Another block/object storage with clear per-pool/zone locality for residency labs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Git & CI/CD
|
||||||
|
|
||||||
|
- **Git platform**
|
||||||
|
- GitLab, GitHub, Gitea, or similar with:
|
||||||
|
- Repo-level CI/CD
|
||||||
|
- Branch protections
|
||||||
|
- **Repos created**
|
||||||
|
- `infra-foundation`
|
||||||
|
- `platform-clusters`
|
||||||
|
- `policies-and-compliance`
|
||||||
|
- **CI runners**
|
||||||
|
- Runners with:
|
||||||
|
- `terraform`, `ansible`, `kubectl`, `helm` (or equivalent)
|
||||||
|
- `opa`/`conftest` for policy-as-code
|
||||||
|
- `yamllint`, `kubeconform` or similar
|
||||||
|
- **Pipelines**
|
||||||
|
- Standard stages configured:
|
||||||
|
- `lint_and_unit`
|
||||||
|
- `policy_gates`
|
||||||
|
- `integration_test`
|
||||||
|
- `promotion_to_template`
|
||||||
|
- `site_rollout`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Observability Stack
|
||||||
|
|
||||||
|
- **Prometheus**
|
||||||
|
- Scrape:
|
||||||
|
- K8s components & nodes
|
||||||
|
- DCIM/PDU power metrics (or simulated)
|
||||||
|
- GPU scheduler/job metrics (or simulated)
|
||||||
|
- **Grafana**
|
||||||
|
- Deployed via `platform-clusters` (dashboards as JSON in Git)
|
||||||
|
- **Alerting**
|
||||||
|
- Alertmanager or equivalent for:
|
||||||
|
- Control-plane SLO alerts
|
||||||
|
- GPU wait-time SLO alerts
|
||||||
|
- PUE SLO alerts
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Identity, Access & Security
|
||||||
|
|
||||||
|
- **Identity provider** (even simple) for:
|
||||||
|
- Differentiating sovereign ops groups vs others (e.g. `@sovereign-ops.fr` accounts)
|
||||||
|
- **RBAC**
|
||||||
|
- K8s configured to use groups from IdP (or fake groups in lab)
|
||||||
|
- **Access model**
|
||||||
|
- Engineers can access:
|
||||||
|
- Git & pipelines
|
||||||
|
- Read-only UIs for infra/platform
|
||||||
|
- Direct SSH/root access to infra reserved for:
|
||||||
|
- Lab facilitators
|
||||||
|
- Documented bootstrap/drift-fix exercises
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Data & Safety Constraints
|
||||||
|
|
||||||
|
- **No real personal / production data.**
|
||||||
|
- Use synthetic or anonymized datasets for AI/ML and SaaS workloads.
|
||||||
|
- **Sovereignty simulation**
|
||||||
|
- Regions (e.g. `fr-central`, `eu-central-1`) can be logical; the key is:
|
||||||
|
- Storage classes & backup targets are labelled and policy-checked as if real.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Operational Capabilities (to make the course repeatable)
|
||||||
|
|
||||||
|
- Ability to:
|
||||||
|
- **Reset** the environment or a subset (e.g. destroy/recreate clusters) on demand
|
||||||
|
- Tag and restore specific Git states (e.g. `v-training-d1`, `v-training-d6`)
|
||||||
|
- Run **Game Day / replay drills** without impacting other labs
|
||||||
|
|
||||||
|
If these minimums are in place, you can realistically complete the **full D1-D6 training**:
|
||||||
|
- from bare-metal/virtual “nothing”
|
||||||
|
- to GitOps-managed, sovereign-aware platform
|
||||||
|
- with SLOs, policies, and zero_manual_provisioning verified by drills.
|
||||||
|
|
||||||
|
</details>
|
||||||
Reference in New Issue
Block a user