Add Micro–DC/minimum-toolset/README.md
This commit is contained in:
320
Micro–DC/minimum-toolset/README.md
Normal file
320
Micro–DC/minimum-toolset/README.md
Normal file
@@ -0,0 +1,320 @@
|
||||
# Sovereign Micro-DC — Minimum Toolset Profile (MVP)
|
||||
|
||||
This document defines the **minimum viable toolset (MVP)** for operating sovereign
|
||||
micro–data center modules using Git-first, pipeline-first practices.
|
||||
|
||||
The goal is to:
|
||||
|
||||
- Keep the toolchain **small, understandable, and operable** by small teams.
|
||||
- Ensure **GDPR/data sovereignty alignment** and sustainability KPIs are baked in.
|
||||
- Avoid “tool sprawl” and overlapping products that increase operational risk.
|
||||
|
||||
> If a tool is not listed here as **canonical**, it is **non-critical** and must
|
||||
> not become a dependency for production without a Toolset RFC.
|
||||
|
||||
---
|
||||
|
||||
## 1. Scope & Audience
|
||||
|
||||
This profile applies to:
|
||||
|
||||
- All sovereign micro-DC modules built from the global blueprint.
|
||||
- All environments: lab, staging, and production.
|
||||
- All teams working on:
|
||||
- Infrastructure as Code (IaC)
|
||||
- GitOps and platform configuration
|
||||
- Security and compliance
|
||||
- SRE / observability
|
||||
- Network engineering
|
||||
|
||||
Primary owners:
|
||||
|
||||
- **CI/CD & GitOps Governance Lead** (accountable)
|
||||
- **Principal SRE / DevOps Architect**
|
||||
- **Sovereign Compliance & Sustainability Lead**
|
||||
- **Security Architect**
|
||||
|
||||
---
|
||||
|
||||
## 2. Canonical Tool Stack (MVP)
|
||||
|
||||
### 2.1 Infrastructure as Code (IaC)
|
||||
|
||||
**Canonical tools**
|
||||
|
||||
- **Terraform**
|
||||
- Purpose: Declarative infra configuration for network, security, IPAM, DCIM (where API-driven).
|
||||
- Typical scope:
|
||||
- L2/L3 network config and VRFs
|
||||
- Firewalls, load balancers, VPN gateways
|
||||
- IPAM, DNS records
|
||||
- Cloud/virtual resources (if used)
|
||||
- Conventions:
|
||||
- Shared modules for naming, tagging, security baselines.
|
||||
- Per-site root modules under `infra-foundation/network/terraform/sites/<SITE_CODE>`.
|
||||
- Remote, encrypted state that respects data residency.
|
||||
|
||||
- **Ansible**
|
||||
- Purpose: Host configuration management and bootstrap.
|
||||
- Typical scope:
|
||||
- Bare-metal OS install and base hardening
|
||||
- Hypervisor configuration (e.g. Proxmox VE)
|
||||
- K8s node bootstrap and cluster joins
|
||||
- Ceph node configuration and initial cluster bring-up
|
||||
- Conventions:
|
||||
- Playbooks live in `infra-foundation/hypervisor/ansible/` and `infra-foundation/baremetal/profiles/`.
|
||||
- Avoid using Ansible for topology/state that is already managed by Terraform.
|
||||
|
||||
> **Rule:** No additional IaC frameworks (e.g. Pulumi, CloudFormation) in production
|
||||
> without an approved RFC.
|
||||
|
||||
---
|
||||
|
||||
### 2.2 GitOps
|
||||
|
||||
**Canonical tool**
|
||||
|
||||
- **Argo CD**
|
||||
- Purpose: Declarative GitOps controller for Kubernetes and platform configs.
|
||||
- Scope:
|
||||
- Cluster bootstrapping
|
||||
- Platform services (monitoring, logging, ingress, policy)
|
||||
- Tenant workloads and namespaces
|
||||
- Patterns:
|
||||
- App-of-apps per site: `platform-clusters/k8s/clusters/<SITE_CODE>/apps.yaml`.
|
||||
- Separate Argo Projects for:
|
||||
- **Infra & platform**
|
||||
- **Tenant workloads**
|
||||
- Governance:
|
||||
- Production changes must be applied via Git and Argo CD.
|
||||
- Direct `kubectl apply` to production is an exception that must be logged and remediated.
|
||||
|
||||
---
|
||||
|
||||
### 2.3 Policy-as-Code
|
||||
|
||||
**Canonical tool**
|
||||
|
||||
- **Kyverno** (or OPA/Gatekeeper in an alternative profile; this MVP assumes Kyverno)
|
||||
- Purpose: Kubernetes-native policies for security, residency, and consistency.
|
||||
- Typical policies:
|
||||
- Namespace naming and labels (`data_classification`, `country_code`).
|
||||
- StorageClass ↔ namespace binding for residency.
|
||||
- Ensure NetworkPolicies for non-public namespaces.
|
||||
- Ban/limit privileged containers and risky capabilities.
|
||||
- CI integration:
|
||||
- Policies stored in `policies-and-compliance/opa-policies-or-kyverno/`.
|
||||
- CI runs Kyverno CLI tests on PRs modifying policies/manifests.
|
||||
|
||||
> **Rule:** All new namespaces and workloads must pass policy checks in CI before
|
||||
> being deployed via Argo CD.
|
||||
|
||||
---
|
||||
|
||||
### 2.4 Observability
|
||||
|
||||
**Canonical stack**
|
||||
|
||||
- **Prometheus** — metrics collection
|
||||
- **Alertmanager** — alert routing
|
||||
- **Loki** — log aggregation
|
||||
- **Tempo** — traces
|
||||
- **Grafana** — dashboards and visualizations
|
||||
|
||||
**Scope**
|
||||
|
||||
- K8s cluster health, node and app metrics
|
||||
- Storage and network metrics
|
||||
- Facility metrics:
|
||||
- PDU / UPS power
|
||||
- Rack temperatures
|
||||
- Estimates for PUE, WUE, renewable share, heat reuse
|
||||
|
||||
**Conventions**
|
||||
|
||||
- “Infra” observability stack runs on one or more **infra clusters** (per country/region).
|
||||
- Data residency:
|
||||
- Metrics/logs/traces for **CRITICAL_SOVEREIGN** and **SENSITIVE_PERSONAL**
|
||||
workloads remain within the approved jurisdiction.
|
||||
- SLOs:
|
||||
- SLOs/SLIs (including AI/ML fabric SLOs) are:
|
||||
- Defined in Git (Prometheus rule files)
|
||||
- Visualized in Grafana dashboards
|
||||
- Linked from runbooks and on-call docs
|
||||
|
||||
---
|
||||
|
||||
### 2.5 Network Verification
|
||||
|
||||
**Canonical tools**
|
||||
|
||||
- **Batfish**
|
||||
- Purpose: Static analysis of network configs pre-change.
|
||||
- Use cases:
|
||||
- Verify reachability/isolation between VRFs (TENANT, INFRA_MGMT, STORAGE, OOB).
|
||||
- Confirm no unintended paths for CRITICAL_SOVEREIGN networks leaving country.
|
||||
- Conventions:
|
||||
- Test suites under `infra-foundation/network/tests/batfish/`.
|
||||
- CI stage runs Batfish on every PR that touches network configuration.
|
||||
- Merge is blocked on failed verification.
|
||||
|
||||
- **Synthetic probes**
|
||||
- Purpose: Runtime path validation and basic performance checks.
|
||||
- Implementation examples:
|
||||
- K8s Jobs/DaemonSets running ping, traceroute, HTTP checks, throughput tests.
|
||||
- Metrics exported to Prometheus.
|
||||
- Scope:
|
||||
- Intra-site fabric health (leaf/spine/ToR)
|
||||
- Critical east-west and north-south paths
|
||||
- Site-to-site links for DR and federation
|
||||
|
||||
---
|
||||
|
||||
## 3. Repository Layout & Where Things Live
|
||||
|
||||
This README assumes the global Git structure from the blueprint.
|
||||
|
||||
### 3.1 `infra-foundation` repo
|
||||
|
||||
- `facility/site_manifests/`
|
||||
Site-specific overlays (power, cooling, capacity, regulatory details).
|
||||
|
||||
- `facility/rack_layouts/`
|
||||
Rack and cabling maps, logically referenced by IaC.
|
||||
|
||||
- `network/terraform/`
|
||||
- `modules/` — shared TF modules for network primitives.
|
||||
- `sites/<SITE_CODE>/` — site root modules and environment configs.
|
||||
- `tests/batfish/` — Batfish configs and test suites.
|
||||
|
||||
- `hypervisor/ansible/`
|
||||
Playbooks/roles for Proxmox and related host config.
|
||||
|
||||
- `baremetal/profiles/`
|
||||
Bare-metal provisioning profiles and Ansible roles.
|
||||
|
||||
### 3.2 `platform-clusters` repo
|
||||
|
||||
- `k8s/clusters/<SITE_CODE>/`
|
||||
- `cluster-bootstrap/` — base cluster manifests.
|
||||
- `apps.yaml` — Argo CD app-of-apps for the site.
|
||||
- `addons/monitoring-logging-security/`
|
||||
- Helm charts/manifests for Prometheus, Loki, Tempo, Grafana, Kyverno, etc.
|
||||
|
||||
### 3.3 `policies-and-compliance` repo
|
||||
|
||||
- `data-classification.yaml`
|
||||
Definitions for `PUBLIC`, `INTERNAL`, `PERSONAL`, `SENSITIVE_PERSONAL`, `CRITICAL_SOVEREIGN_<COUNTRY_CODE>`.
|
||||
|
||||
- `opa-policies-or-kyverno/`
|
||||
Policy definitions and tests.
|
||||
|
||||
- `sustainability-kpis.yaml`
|
||||
KPIs and thresholds for PUE, WUE, renewable share, reuse.
|
||||
|
||||
- `rbac-and-iam.yaml`
|
||||
Roles, groups, and access models across tools.
|
||||
|
||||
- `toolset-profiles/`
|
||||
- `minimum-toolset-profile.toon.yaml` (this profile)
|
||||
- Future profiles (e.g. `extended-observability`, `enterprise-policy-suite`)
|
||||
|
||||
---
|
||||
|
||||
## 4. Change Workflow (Git-First, Pipeline-First)
|
||||
|
||||
### 4.1 Typical change flow
|
||||
|
||||
1. **Engineer opens a PR** in the relevant repo:
|
||||
- `infra-foundation` for network or facility changes.
|
||||
- `platform-clusters` for K8s/app changes.
|
||||
- `policies-and-compliance` for policy/toolset updates.
|
||||
|
||||
2. **CI pipeline runs:**
|
||||
- Lint and unit tests.
|
||||
- Kyverno/OPA policy checks (where applicable).
|
||||
- Batfish network tests (for network changes).
|
||||
- Integration tests or dry-run Argo CD sync where feasible.
|
||||
|
||||
3. **Review & approvals:**
|
||||
- At least one peer reviewer.
|
||||
- Additional approvals for:
|
||||
- Policy changes (Compliance/Security).
|
||||
- Toolset changes (CI/CD Governance Lead).
|
||||
|
||||
4. **Merge to main / protected branch.**
|
||||
|
||||
5. **Argo CD syncs** changes to the target environment:
|
||||
- For production, use a **manual promotion** step (e.g. tag or branch).
|
||||
|
||||
6. **Post-deploy verification:**
|
||||
- Synthetic probes and SLO dashboards checked.
|
||||
- Any policy violations or drift are investigated and corrected.
|
||||
|
||||
---
|
||||
|
||||
## 5. Toolset Governance & RFCs
|
||||
|
||||
This MVP profile deliberately **limits** the core stack. Introducing new tools
|
||||
or replacing canonical ones must go through a **Toolset RFC** process.
|
||||
|
||||
### 5.1 When you must write an RFC
|
||||
|
||||
- Adding:
|
||||
- A second GitOps controller.
|
||||
- A new observability backend (logs, metrics, or traces).
|
||||
- Another IaC framework in production.
|
||||
- Replacing:
|
||||
- Terraform, Ansible, Argo CD, Kyverno, Prometheus, Loki, Tempo, Grafana, or Batfish.
|
||||
- Introducing:
|
||||
- New data paths that might affect sovereignty or residency.
|
||||
|
||||
### 5.2 RFC contents (minimal)
|
||||
|
||||
- Problem statement and motivation.
|
||||
- Proposed tool and how it fits the existing stack.
|
||||
- Security, sovereignty, and sustainability impact.
|
||||
- Operational complexity impact:
|
||||
- New skills required
|
||||
- Runbooks and documentation needs
|
||||
- Migration and rollback plan.
|
||||
|
||||
---
|
||||
|
||||
## 6. Rollout Plan for the MVP Toolset
|
||||
|
||||
High-level phases:
|
||||
|
||||
1. **T0 — Assessment & Inventory**
|
||||
- Document current tools in use per site.
|
||||
- Map overlaps/conflicts with this MVP profile.
|
||||
|
||||
2. **T1 — Pilot Site**
|
||||
- Choose one non-critical site (or lab environment).
|
||||
- Implement the full MVP stack and workflows.
|
||||
- Run at least one full change cycle (PR → CI → deploy → verify).
|
||||
|
||||
3. **T2 — Template Hardening**
|
||||
- Extract reusable modules and patterns.
|
||||
- Update `minimum-toolset-profile.toon.yaml` with real-world findings.
|
||||
- Finalize SLOs, runbooks, and on-call procedures.
|
||||
|
||||
4. **T3 — Broad Adoption**
|
||||
- Adopt MVP toolset as default for new sites.
|
||||
- Gradually migrate existing sites, prioritizing those with:
|
||||
- Highest sovereignty requirements.
|
||||
- Highest operational pain due to tool sprawl.
|
||||
|
||||
---
|
||||
|
||||
## 7. How to Use This Profile
|
||||
|
||||
- **New site?** Start from the MVP profile. Only request additional tools if strictly necessary.
|
||||
- **Existing site?** Use the MVP as a north star; plan migrations away from overlapping tools.
|
||||
- **New engineer?** Read this README, then:
|
||||
- Explore `infra-foundation`, `platform-clusters`, `policies-and-compliance`.
|
||||
- Run through a non-production change end-to-end under supervision.
|
||||
|
||||
If something in this document doesn’t match reality, the mismatch must be fixed:
|
||||
either the code or this README. The MVP is only useful if it is **live doctrine**,
|
||||
not shelfware.
|
||||
Reference in New Issue
Block a user