Add Micro–DC/minimum-toolset/README.md

This commit is contained in:
2025-12-04 22:30:59 +00:00
parent ba6c1bede2
commit e3f793c7fe

View File

@@ -0,0 +1,320 @@
# Sovereign Micro-DC — Minimum Toolset Profile (MVP)
This document defines the **minimum viable toolset (MVP)** for operating sovereign
microdata center modules using Git-first, pipeline-first practices.
The goal is to:
- Keep the toolchain **small, understandable, and operable** by small teams.
- Ensure **GDPR/data sovereignty alignment** and sustainability KPIs are baked in.
- Avoid “tool sprawl” and overlapping products that increase operational risk.
> If a tool is not listed here as **canonical**, it is **non-critical** and must
> not become a dependency for production without a Toolset RFC.
---
## 1. Scope & Audience
This profile applies to:
- All sovereign micro-DC modules built from the global blueprint.
- All environments: lab, staging, and production.
- All teams working on:
- Infrastructure as Code (IaC)
- GitOps and platform configuration
- Security and compliance
- SRE / observability
- Network engineering
Primary owners:
- **CI/CD & GitOps Governance Lead** (accountable)
- **Principal SRE / DevOps Architect**
- **Sovereign Compliance & Sustainability Lead**
- **Security Architect**
---
## 2. Canonical Tool Stack (MVP)
### 2.1 Infrastructure as Code (IaC)
**Canonical tools**
- **Terraform**
- Purpose: Declarative infra configuration for network, security, IPAM, DCIM (where API-driven).
- Typical scope:
- L2/L3 network config and VRFs
- Firewalls, load balancers, VPN gateways
- IPAM, DNS records
- Cloud/virtual resources (if used)
- Conventions:
- Shared modules for naming, tagging, security baselines.
- Per-site root modules under `infra-foundation/network/terraform/sites/<SITE_CODE>`.
- Remote, encrypted state that respects data residency.
- **Ansible**
- Purpose: Host configuration management and bootstrap.
- Typical scope:
- Bare-metal OS install and base hardening
- Hypervisor configuration (e.g. Proxmox VE)
- K8s node bootstrap and cluster joins
- Ceph node configuration and initial cluster bring-up
- Conventions:
- Playbooks live in `infra-foundation/hypervisor/ansible/` and `infra-foundation/baremetal/profiles/`.
- Avoid using Ansible for topology/state that is already managed by Terraform.
> **Rule:** No additional IaC frameworks (e.g. Pulumi, CloudFormation) in production
> without an approved RFC.
---
### 2.2 GitOps
**Canonical tool**
- **Argo CD**
- Purpose: Declarative GitOps controller for Kubernetes and platform configs.
- Scope:
- Cluster bootstrapping
- Platform services (monitoring, logging, ingress, policy)
- Tenant workloads and namespaces
- Patterns:
- App-of-apps per site: `platform-clusters/k8s/clusters/<SITE_CODE>/apps.yaml`.
- Separate Argo Projects for:
- **Infra & platform**
- **Tenant workloads**
- Governance:
- Production changes must be applied via Git and Argo CD.
- Direct `kubectl apply` to production is an exception that must be logged and remediated.
---
### 2.3 Policy-as-Code
**Canonical tool**
- **Kyverno** (or OPA/Gatekeeper in an alternative profile; this MVP assumes Kyverno)
- Purpose: Kubernetes-native policies for security, residency, and consistency.
- Typical policies:
- Namespace naming and labels (`data_classification`, `country_code`).
- StorageClass ↔ namespace binding for residency.
- Ensure NetworkPolicies for non-public namespaces.
- Ban/limit privileged containers and risky capabilities.
- CI integration:
- Policies stored in `policies-and-compliance/opa-policies-or-kyverno/`.
- CI runs Kyverno CLI tests on PRs modifying policies/manifests.
> **Rule:** All new namespaces and workloads must pass policy checks in CI before
> being deployed via Argo CD.
---
### 2.4 Observability
**Canonical stack**
- **Prometheus** — metrics collection
- **Alertmanager** — alert routing
- **Loki** — log aggregation
- **Tempo** — traces
- **Grafana** — dashboards and visualizations
**Scope**
- K8s cluster health, node and app metrics
- Storage and network metrics
- Facility metrics:
- PDU / UPS power
- Rack temperatures
- Estimates for PUE, WUE, renewable share, heat reuse
**Conventions**
- “Infra” observability stack runs on one or more **infra clusters** (per country/region).
- Data residency:
- Metrics/logs/traces for **CRITICAL_SOVEREIGN** and **SENSITIVE_PERSONAL**
workloads remain within the approved jurisdiction.
- SLOs:
- SLOs/SLIs (including AI/ML fabric SLOs) are:
- Defined in Git (Prometheus rule files)
- Visualized in Grafana dashboards
- Linked from runbooks and on-call docs
---
### 2.5 Network Verification
**Canonical tools**
- **Batfish**
- Purpose: Static analysis of network configs pre-change.
- Use cases:
- Verify reachability/isolation between VRFs (TENANT, INFRA_MGMT, STORAGE, OOB).
- Confirm no unintended paths for CRITICAL_SOVEREIGN networks leaving country.
- Conventions:
- Test suites under `infra-foundation/network/tests/batfish/`.
- CI stage runs Batfish on every PR that touches network configuration.
- Merge is blocked on failed verification.
- **Synthetic probes**
- Purpose: Runtime path validation and basic performance checks.
- Implementation examples:
- K8s Jobs/DaemonSets running ping, traceroute, HTTP checks, throughput tests.
- Metrics exported to Prometheus.
- Scope:
- Intra-site fabric health (leaf/spine/ToR)
- Critical east-west and north-south paths
- Site-to-site links for DR and federation
---
## 3. Repository Layout & Where Things Live
This README assumes the global Git structure from the blueprint.
### 3.1 `infra-foundation` repo
- `facility/site_manifests/`
Site-specific overlays (power, cooling, capacity, regulatory details).
- `facility/rack_layouts/`
Rack and cabling maps, logically referenced by IaC.
- `network/terraform/`
- `modules/` — shared TF modules for network primitives.
- `sites/<SITE_CODE>/` — site root modules and environment configs.
- `tests/batfish/` — Batfish configs and test suites.
- `hypervisor/ansible/`
Playbooks/roles for Proxmox and related host config.
- `baremetal/profiles/`
Bare-metal provisioning profiles and Ansible roles.
### 3.2 `platform-clusters` repo
- `k8s/clusters/<SITE_CODE>/`
- `cluster-bootstrap/` — base cluster manifests.
- `apps.yaml` — Argo CD app-of-apps for the site.
- `addons/monitoring-logging-security/`
- Helm charts/manifests for Prometheus, Loki, Tempo, Grafana, Kyverno, etc.
### 3.3 `policies-and-compliance` repo
- `data-classification.yaml`
Definitions for `PUBLIC`, `INTERNAL`, `PERSONAL`, `SENSITIVE_PERSONAL`, `CRITICAL_SOVEREIGN_<COUNTRY_CODE>`.
- `opa-policies-or-kyverno/`
Policy definitions and tests.
- `sustainability-kpis.yaml`
KPIs and thresholds for PUE, WUE, renewable share, reuse.
- `rbac-and-iam.yaml`
Roles, groups, and access models across tools.
- `toolset-profiles/`
- `minimum-toolset-profile.toon.yaml` (this profile)
- Future profiles (e.g. `extended-observability`, `enterprise-policy-suite`)
---
## 4. Change Workflow (Git-First, Pipeline-First)
### 4.1 Typical change flow
1. **Engineer opens a PR** in the relevant repo:
- `infra-foundation` for network or facility changes.
- `platform-clusters` for K8s/app changes.
- `policies-and-compliance` for policy/toolset updates.
2. **CI pipeline runs:**
- Lint and unit tests.
- Kyverno/OPA policy checks (where applicable).
- Batfish network tests (for network changes).
- Integration tests or dry-run Argo CD sync where feasible.
3. **Review & approvals:**
- At least one peer reviewer.
- Additional approvals for:
- Policy changes (Compliance/Security).
- Toolset changes (CI/CD Governance Lead).
4. **Merge to main / protected branch.**
5. **Argo CD syncs** changes to the target environment:
- For production, use a **manual promotion** step (e.g. tag or branch).
6. **Post-deploy verification:**
- Synthetic probes and SLO dashboards checked.
- Any policy violations or drift are investigated and corrected.
---
## 5. Toolset Governance & RFCs
This MVP profile deliberately **limits** the core stack. Introducing new tools
or replacing canonical ones must go through a **Toolset RFC** process.
### 5.1 When you must write an RFC
- Adding:
- A second GitOps controller.
- A new observability backend (logs, metrics, or traces).
- Another IaC framework in production.
- Replacing:
- Terraform, Ansible, Argo CD, Kyverno, Prometheus, Loki, Tempo, Grafana, or Batfish.
- Introducing:
- New data paths that might affect sovereignty or residency.
### 5.2 RFC contents (minimal)
- Problem statement and motivation.
- Proposed tool and how it fits the existing stack.
- Security, sovereignty, and sustainability impact.
- Operational complexity impact:
- New skills required
- Runbooks and documentation needs
- Migration and rollback plan.
---
## 6. Rollout Plan for the MVP Toolset
High-level phases:
1. **T0 — Assessment & Inventory**
- Document current tools in use per site.
- Map overlaps/conflicts with this MVP profile.
2. **T1 — Pilot Site**
- Choose one non-critical site (or lab environment).
- Implement the full MVP stack and workflows.
- Run at least one full change cycle (PR → CI → deploy → verify).
3. **T2 — Template Hardening**
- Extract reusable modules and patterns.
- Update `minimum-toolset-profile.toon.yaml` with real-world findings.
- Finalize SLOs, runbooks, and on-call procedures.
4. **T3 — Broad Adoption**
- Adopt MVP toolset as default for new sites.
- Gradually migrate existing sites, prioritizing those with:
- Highest sovereignty requirements.
- Highest operational pain due to tool sprawl.
---
## 7. How to Use This Profile
- **New site?** Start from the MVP profile. Only request additional tools if strictly necessary.
- **Existing site?** Use the MVP as a north star; plan migrations away from overlapping tools.
- **New engineer?** Read this README, then:
- Explore `infra-foundation`, `platform-clusters`, `policies-and-compliance`.
- Run through a non-production change end-to-end under supervision.
If something in this document doesnt match reality, the mismatch must be fixed:
either the code or this README. The MVP is only useful if it is **live doctrine**,
not shelfware.