Add Micro–DC/docs/training/D2

This commit is contained in:
2025-12-05 12:33:02 +00:00
parent 36d383b02a
commit 2a798b7565

528
Micro–DC/docs/training/D2 Normal file
View File

@@ -0,0 +1,528 @@
Let's level up the Git side of this.
Below is the **D2 - Git & Repo Training Manual** you can run after D1.
It assumes EU-PAR-FR01 is your current “active lab site”.
---
# D2 — Git & Repo Training Manual
**Focus:** Git & repo structure, cross-repo workflows, and how engineers actually work day-to-day.
Repos in scope:
* `infra-foundation`
* `platform-clusters`
* `policies-and-compliance`
Pipelines (already defined from D1):
* `lint_and_unit`
* `policy_gates`
* `integration_test`
* `promotion_to_template`
* `site_rollout`
We'll treat this as a **3-day enablement block**:
* **Day 1:** Repo roles & layout + branch/PR strategy
* **Day 2:** Site overlays & template reuse (EU-PAR-FR01 as the first “instantiation”)
* **Day 3:** Cross-repo workflows & ownership (who touches what, when)
---
## Day 1 — Repo Roles, Layout & Branch Strategy
### Learning Objectives
By the end of Day 1, the team can:
* Explain the purpose of each repo and what *belongs* there.
* Navigate the directory tree for EU-PAR-FR01.
* Use a consistent **branch naming + PR** pattern for infra changes.
### 1.1 Repo Responsibilities (Mental Model)
**`infra-foundation`**
* Physical + logical infra:
* Facility, racks, power, cooling.
* Network (Terraform).
* Bare metal profiles and hypervisor provisioning (Ansible).
* “Below” cluster level.
**`platform-clusters`**
* Everything “on top” of the infra:
* K8s cluster definitions (mgmt + workload clusters).
* Namespaces, storage classes, workloads.
* GitOps app definitions.
* Monitoring/logging/security add-ons.
**`policies-and-compliance`**
* Cross-cutting governance:
* Data classification and residency rules.
* OPA/Kyverno policies.
* RBAC & IAM guidelines / templates.
* Sustainability KPIs & policy checks.
> Rule of thumb:
> **If it configures metal/network → `infra-foundation`.
> If it configures K8s / workloads → `platform-clusters`.
> If it *constrains* any of the above → `policies-and-compliance`.**
### 1.2 Standard Directory Layouts (Training Baseline)
#### `infra-foundation` (target baseline)
```text
infra-foundation/
facility/
site_manifests/
eu-par-fr01.yaml
rack_layouts/
eu-par-fr01-racks.yaml
power_and_cooling/
eu-par-fr01-power.yaml
bootstrap/
eu-par-fr01.md
network/
terraform/
modules/
leaf_spine/
...
eu-par-fr01/
main.tf
variables.tf
outputs.tf
baremetal/
profiles/
compute-standard.yaml
compute-gpu.yaml
storage-ceph.yaml
hypervisor/
ansible/
inventory/
eu-par-fr01.ini
proxmox_cluster.yml
roles/
proxmox-base/
proxmox-cluster-join/
.gitlab-ci.yml (or equivalent)
```
#### `platform-clusters` (target baseline)
```text
platform-clusters/
k8s/
clusters/
eu-par-fr01/
mgmt-cluster.yaml
gitops-app.yaml
namespaces/
fr-public.yaml
fr-internal.yaml
fr-personal.yaml
fr-sensitive.yaml
fr-critical-sovereign-ai.yaml
storageclasses/
fr-critical-sovereign-sc.yaml
workloads/
ai-ml/
job.yaml
saas/
app-deployment.yaml
addons/
monitoring-logging-security/
prometheus/
values.yaml
grafana/
dashboards/
pue-overview.json
control-plane-slo.json
.gitlab-ci.yml
```
#### `policies-and-compliance` (target baseline)
```text
policies-and-compliance/
data-classification.yaml
sustainability-kpis.yaml
rbac-and-iam.yaml
opa-policies/
site_naming.rego
data_residency.rego
naming_conventions.rego
rbac.rego
platform-bootstrap.rego
scripts/
run_opa.sh
.gitlab-ci.yml
```
### 1.3 Branch & PR Pattern (Day-to-Day)
**Naming convention (recommendation):**
* Features: `feat/<area>/<short-desc>`
* `feat/infra/eu-par-fr01-racks`
* `feat/platform/eu-par-fr01-mgmt-cluster`
* `feat/policy/data-residency-fr`
* Fixes: `fix/<area>/<issue-id-or-short-desc>`
* `fix/platform/eu-par-fr01-backup-region`
* Experiments/spikes: `exp/<topic>`
**Rules (enforced socially + via repo settings):**
* All changes go via MR/PR.
* `main` (or `master`) is **protected**:
* Must pass `lint_and_unit` + `policy_gates` at minimum.
* Optional: require 1-2 reviewers for infra repos.
### 1.4 Day 1 Hands-On Exercise
**Goal:** Give everyone muscle memory for “change → MR → CI → merge”.
For each repo:
1. Create a trivial but real change (e.g., add a comment, small documentation note, or fix a label).
2. Use the naming scheme for branches.
3. Open PR/MR.
4. Confirm `lint_and_unit` + `policy_gates` stages pass.
5. Merge.
**Definition of Done (Day 1)**
* All engineers can:
* Clone all three repos.
* Navigate the EU-PAR-FR01 directories.
* Cut a branch, push, open MR, and reason about CI results.
---
## Day 2 — Site Overlays & Template Reuse (EU-PAR-FR01 as Example)
### Learning Objectives
By the end of Day 2, the team can:
* Understand the difference between **global templates** and **site overlays**.
* Instantiate EU-PAR-FR01 from a global template with minimal duplication.
* Use `promotion_to_template` to push good patterns “upstream”.
### 2.1 Concepts: Template vs Overlay
We model:
* **Global template** (reusable patterns across sites):
* Stored under `templates/` or similar.
* Site-agnostic; parametrised by `site_code`, `country`, etc.
* **Site overlay**:
* Specific to `EU-PAR-FR01` (or other site).
* Only overrides per-site values (site code, CIDRs, rack counts, local laws).
#### Example structure in `infra-foundation`
```text
infra-foundation/
facility/
templates/
site_manifest_base.yaml
site_manifests/
eu-par-fr01.yaml
```
`site_manifest_base.yaml`:
```yaml
site:
code: "<SITE_CODE>"
country: "<COUNTRY_CODE>"
city: "<CITY_CODE>"
it_load_kw: 80
racks:
gpu: 2
compute: 4
storage: 2
regulatory:
regime: EU_EEA
primary_privacy_law: GDPR
local_law: "<LOCAL_LAW>"
sovereignty:
allowed_regions:
- "<COUNTRY_CODE>"
- EU_EEA
```
`eu-par-fr01.yaml` (overlay) provides concrete values:
```yaml
site:
<<: *base # conceptual - rendered via templating step
code: EU-PAR-FR01
country: FR
city: PAR
local_law: "FR Data Protection Act"
```
(Implementation detail can be Helm, Kustomize, simple generator script, etc. Training-wise, it's enough that trainees understand there is a **base** and an **overlay**.)
Same pattern applies in `platform-clusters`:
```text
platform-clusters/
k8s/
templates/
mgmt-cluster-base.yaml
clusters/
eu-par-fr01/
mgmt-cluster.yaml
```
### 2.2 `promotion_to_template` Stage (How Trainees Use It)
Intent:
* When a site overlay proves stable and useful, we **generalise** it:
* Extract common bits into `templates/`.
* Keep site-specific bits as overlays.
**Example pipeline sketch:**
```yaml
promotion_to_template:
stage: promotion_to_template
script:
- ./scripts/promote_to_templates.sh
only:
- main
```
`promote_to_templates.sh` might:
* Diff `clusters/eu-par-fr01/` against `templates/`.
* Suggest or apply updates to `templates/` where patterns are identical.
### 2.3 Day 2 Hands-On Tasks
**Task 1 — Introduce templates in `infra-foundation`**
1. Create `facility/templates/site_manifest_base.yaml`.
2. Refactor `eu-par-fr01.yaml` to logically inherit from the base (even if the “inheritance” is just a comment plus a small script):
```yaml
# generated/from: facility/templates/site_manifest_base.yaml
site:
code: EU-PAR-FR01
country: FR
city: PAR
it_load_kw: 80
racks:
gpu: 2
compute: 4
storage: 2
regulatory:
regime: EU_EEA
primary_privacy_law: GDPR
local_law: "FR Data Protection Act"
sovereignty:
allowed_regions:
- FR
- EU_EEA
```
3. Run CI: `lint_and_unit`, `policy_gates`.
**Task 2 — Introduce templates in `platform-clusters`**
1. Create `k8s/templates/mgmt-cluster-base.yaml`:
```yaml
cluster:
name: "<SITE_CODE>-mgmt"
control_plane_nodes: 3
node_pool_profile: compute-standard
networking:
cni: cilium
```
2. Refactor `k8s/clusters/eu-par-fr01/mgmt-cluster.yaml` to be clearly derived from base.
3. Again, run CI and confirm nothing breaks.
**Task 3 — Use `promotion_to_template`**
Simulate a change:
* Update `eu-par-fr01/mgmt-cluster.yaml` to add a cluster-wide annotation or logging sidecar pattern.
* Merge; `promotion_to_template` stage suggests/extracts this into `mgmt-cluster-base.yaml`.
* Trainees review the diff and approve.
### Day 2 Definition of Done
* Templates directories exist and are used.
* EU-PAR-FR01 manifests clearly follow “base + overlay” pattern.
* Team understands when a change belongs to:
* A **site overlay** vs
* The **global template**.
---
## Day 3 — Cross-Repo Workflows & Ownership
### Learning Objectives
By the end of Day 3, the team can:
* Decide **which repo** to change for a given requirement.
* Understand the end-to-end flow when a business/tenant asks for something.
* See how a single business change can touch all three repos in a controlled way.
### 3.1 Ownership Matrix (Who Does What, Where)
Create and share a simple matrix like:
| Change type | Repo(s) | Primary owner(s) |
| ----------------------------- | ----------------------------------------- | ------------------------- |
| New site (e.g., EU-FRA-DE01) | infra-foundation, policies-and-compliance | Facility + Compliance |
| New network VRF for tenant | infra-foundation, policies-and-compliance | Network + Security |
| New K8s cluster for site | platform-clusters | Platform/SRE |
| New data classification level | policies-and-compliance | Compliance |
| New workload (app / AI job) | platform-clusters | Tenant/product team + SRE |
| Change in residency rules | policies-and-compliance | Compliance + Security |
### 3.2 Example End-to-End Change: New Sovereign AI Tenant
**Scenario:**
Public-sector AI team wants a new tenant in EU-PAR-FR01 with:
* Namespace `fr-critical-sovereign-justice`.
* Storage pinned to FR-only pools.
* Backups to FR-only object storage.
**Workflows across repos:**
1. **`policies-and-compliance`**
* Update `data-classification.yaml` if needed (e.g., add `CRITICAL_SOVEREIGN_JUSTICE_FR` or treat as `CRITICAL_SOVEREIGN_FR`).
* Ensure `data_residency.rego` knows new labels/constraints.
2. **`platform-clusters`**
* Add namespace manifest under `k8s/clusters/eu-par-fr01/namespaces/`.
* Add StorageClass/BackupPolicy for this tenant.
* Add initial workload skeleton (Deployment or Job).
3. **`infra-foundation`**
* If new dedicated Ceph pool is required, update `storage-ceph.yaml` and any relevant Terraform/Ansible to create and map that pool.
**Lab Script for Day 3**
* Split group into 3 subteams (Infra, Platform, Policy).
* Provide them with the scenario.
* Have each team:
* Propose changes in their repo.
* Open MRs with clear titles like
* `feat/policy/justice-tenant-classification`
* `feat/platform/eu-par-fr01-justice-tenant`
* `feat/infra/justice-tenant-ceph-pool`
* Run CI on all three; ensure policy gates line up.
* Merge in order:
1. Policies
2. Infra
3. Platform
### 3.3 Cross-Repo Integration Points
Make explicit where repos **reference each other**:
* `infra-foundation` site codes and device naming → used by:
* `platform-clusters` (labels: `site:EU-PAR-FR01`)
* `policies-and-compliance` (pattern checks in `site_naming.rego`)
* `data-classification.yaml` → used by:
* `opa-policies/data_residency.rego`
* Labels in namespaces and workloads (`platform-clusters`)
* Sustainability targets in `sustainability-kpis.yaml` →
* PUE SLO definitions in `platform-clusters/addons/.../pue-overview.json` dashboards.
For training, add a **short map file** (e.g. in `policies-and-compliance/docs/xref.md`) listing:
```text
- site.code (infra-foundation/facility/site_manifests/*)
→ validated by opa-policies/site_naming.rego
→ used as label in platform-clusters/k8s/clusters/*
- data_classification levels (data-classification.yaml)
→ referenced in namespaces, StorageClasses, BackupPolicies, RBAC policies
```
### 3.4 Day 3 Definition of Done
* Teams can talk through a **business request** and map it to:
* Which repos to touch.
* Which CI stages matter.
* Which policies will likely fail if they get it wrong.
* At least one **multi-repo change** has been executed successfully e2e in training.
---
## D2 Overall Definition of Done
When D2 manual is completed:
1. **Repo structure is stable and documented**
* `infra-foundation`, `platform-clusters`, `policies-and-compliance` all have:
* Clear directory layout.
* README or docs mapping structure to intent.
2. **Templates vs overlays are understood and used**
* At least:
* `site_manifest_base.yaml` + `eu-par-fr01.yaml`
* `mgmt-cluster-base.yaml` + `mgmt-cluster.yaml`
* `promotion_to_template` is hooked into CI and has been exercised at least once.
3. **Cross-repo workflows are practised**
* One or more end-to-end scenarios have been implemented touching all three repos.
* Ownership matrix exists and is agreed.
4. **Everything remains Git- and pipeline-first**
* No local-only scripts; anything used in training is committed.
* All meaningful changes go through:
* Branch → MR/PR → CI → merge → `site_rollout`.
---