Files

Sovereign Micro-DC — Minimum Toolset Profile (MVP)

This document defines the minimum viable toolset (MVP) for operating sovereign microdata center modules using Git-first, pipeline-first practices.

The goal is to:

  • Keep the toolchain small, understandable, and operable by small teams.
  • Ensure GDPR/data sovereignty alignment and sustainability KPIs are baked in.
  • Avoid “tool sprawl” and overlapping products that increase operational risk.

If a tool is not listed here as canonical, it is non-critical and must not become a dependency for production without a Toolset RFC.


1. Scope & Audience

This profile applies to:

  • All sovereign micro-DC modules built from the global blueprint.
  • All environments: lab, staging, and production.
  • All teams working on:
    • Infrastructure as Code (IaC)
    • GitOps and platform configuration
    • Security and compliance
    • SRE / observability
    • Network engineering

Primary owners:

  • CI/CD & GitOps Governance Lead (accountable)
  • Principal SRE / DevOps Architect
  • Sovereign Compliance & Sustainability Lead
  • Security Architect

2. Canonical Tool Stack (MVP)

2.1 Infrastructure as Code (IaC)

Canonical tools

  • Terraform

    • Purpose: Declarative infra configuration for network, security, IPAM, DCIM (where API-driven).
    • Typical scope:
      • L2/L3 network config and VRFs
      • Firewalls, load balancers, VPN gateways
      • IPAM, DNS records
      • Cloud/virtual resources (if used)
    • Conventions:
      • Shared modules for naming, tagging, security baselines.
      • Per-site root modules under infra-foundation/network/terraform/sites/<SITE_CODE>.
      • Remote, encrypted state that respects data residency.
  • Ansible

    • Purpose: Host configuration management and bootstrap.
    • Typical scope:
      • Bare-metal OS install and base hardening
      • Hypervisor configuration (e.g. Proxmox VE)
      • K8s node bootstrap and cluster joins
      • Ceph node configuration and initial cluster bring-up
    • Conventions:
      • Playbooks live in infra-foundation/hypervisor/ansible/ and infra-foundation/baremetal/profiles/.
      • Avoid using Ansible for topology/state that is already managed by Terraform.

Rule: No additional IaC frameworks (e.g. Pulumi, CloudFormation) in production without an approved RFC.


2.2 GitOps

Canonical tool

  • Argo CD
    • Purpose: Declarative GitOps controller for Kubernetes and platform configs.
    • Scope:
      • Cluster bootstrapping
      • Platform services (monitoring, logging, ingress, policy)
      • Tenant workloads and namespaces
    • Patterns:
      • App-of-apps per site: platform-clusters/k8s/clusters/<SITE_CODE>/apps.yaml.
      • Separate Argo Projects for:
        • Infra & platform
        • Tenant workloads
    • Governance:
      • Production changes must be applied via Git and Argo CD.
      • Direct kubectl apply to production is an exception that must be logged and remediated.

2.3 Policy-as-Code

Canonical tool

  • Kyverno (or OPA/Gatekeeper in an alternative profile; this MVP assumes Kyverno)
    • Purpose: Kubernetes-native policies for security, residency, and consistency.
    • Typical policies:
      • Namespace naming and labels (data_classification, country_code).
      • StorageClass ↔ namespace binding for residency.
      • Ensure NetworkPolicies for non-public namespaces.
      • Ban/limit privileged containers and risky capabilities.
    • CI integration:
      • Policies stored in policies-and-compliance/opa-policies-or-kyverno/.
      • CI runs Kyverno CLI tests on PRs modifying policies/manifests.

Rule: All new namespaces and workloads must pass policy checks in CI before being deployed via Argo CD.


2.4 Observability

Canonical stack

  • Prometheus — metrics collection
  • Alertmanager — alert routing
  • Loki — log aggregation
  • Tempo — traces
  • Grafana — dashboards and visualizations

Scope

  • K8s cluster health, node and app metrics
  • Storage and network metrics
  • Facility metrics:
    • PDU / UPS power
    • Rack temperatures
    • Estimates for PUE, WUE, renewable share, heat reuse

Conventions

  • “Infra” observability stack runs on one or more infra clusters (per country/region).
  • Data residency:
    • Metrics/logs/traces for CRITICAL_SOVEREIGN and SENSITIVE_PERSONAL workloads remain within the approved jurisdiction.
  • SLOs:
    • SLOs/SLIs (including AI/ML fabric SLOs) are:
      • Defined in Git (Prometheus rule files)
      • Visualized in Grafana dashboards
      • Linked from runbooks and on-call docs

2.5 Network Verification

Canonical tools

  • Batfish

    • Purpose: Static analysis of network configs pre-change.
    • Use cases:
      • Verify reachability/isolation between VRFs (TENANT, INFRA_MGMT, STORAGE, OOB).
      • Confirm no unintended paths for CRITICAL_SOVEREIGN networks leaving country.
    • Conventions:
      • Test suites under infra-foundation/network/tests/batfish/.
      • CI stage runs Batfish on every PR that touches network configuration.
      • Merge is blocked on failed verification.
  • Synthetic probes

    • Purpose: Runtime path validation and basic performance checks.
    • Implementation examples:
      • K8s Jobs/DaemonSets running ping, traceroute, HTTP checks, throughput tests.
      • Metrics exported to Prometheus.
    • Scope:
      • Intra-site fabric health (leaf/spine/ToR)
      • Critical east-west and north-south paths
      • Site-to-site links for DR and federation

3. Repository Layout & Where Things Live

This README assumes the global Git structure from the blueprint.

3.1 infra-foundation repo

  • facility/site_manifests/
    Site-specific overlays (power, cooling, capacity, regulatory details).

  • facility/rack_layouts/
    Rack and cabling maps, logically referenced by IaC.

  • network/terraform/

    • modules/ — shared TF modules for network primitives.
    • sites/<SITE_CODE>/ — site root modules and environment configs.
    • tests/batfish/ — Batfish configs and test suites.
  • hypervisor/ansible/
    Playbooks/roles for Proxmox and related host config.

  • baremetal/profiles/
    Bare-metal provisioning profiles and Ansible roles.

3.2 platform-clusters repo

  • k8s/clusters/<SITE_CODE>/
    • cluster-bootstrap/ — base cluster manifests.
    • apps.yaml — Argo CD app-of-apps for the site.
  • addons/monitoring-logging-security/
    • Helm charts/manifests for Prometheus, Loki, Tempo, Grafana, Kyverno, etc.

3.3 policies-and-compliance repo

  • data-classification.yaml
    Definitions for PUBLIC, INTERNAL, PERSONAL, SENSITIVE_PERSONAL, CRITICAL_SOVEREIGN_<COUNTRY_CODE>.

  • opa-policies-or-kyverno/
    Policy definitions and tests.

  • sustainability-kpis.yaml
    KPIs and thresholds for PUE, WUE, renewable share, reuse.

  • rbac-and-iam.yaml
    Roles, groups, and access models across tools.

  • toolset-profiles/

    • minimum-toolset-profile.toon.yaml (this profile)
    • Future profiles (e.g. extended-observability, enterprise-policy-suite)

4. Change Workflow (Git-First, Pipeline-First)

4.1 Typical change flow

  1. Engineer opens a PR in the relevant repo:

    • infra-foundation for network or facility changes.
    • platform-clusters for K8s/app changes.
    • policies-and-compliance for policy/toolset updates.
  2. CI pipeline runs:

    • Lint and unit tests.
    • Kyverno/OPA policy checks (where applicable).
    • Batfish network tests (for network changes).
    • Integration tests or dry-run Argo CD sync where feasible.
  3. Review & approvals:

    • At least one peer reviewer.
    • Additional approvals for:
      • Policy changes (Compliance/Security).
      • Toolset changes (CI/CD Governance Lead).
  4. Merge to main / protected branch.

  5. Argo CD syncs changes to the target environment:

    • For production, use a manual promotion step (e.g. tag or branch).
  6. Post-deploy verification:

    • Synthetic probes and SLO dashboards checked.
    • Any policy violations or drift are investigated and corrected.

5. Toolset Governance & RFCs

This MVP profile deliberately limits the core stack. Introducing new tools or replacing canonical ones must go through a Toolset RFC process.

5.1 When you must write an RFC

  • Adding:
    • A second GitOps controller.
    • A new observability backend (logs, metrics, or traces).
    • Another IaC framework in production.
  • Replacing:
    • Terraform, Ansible, Argo CD, Kyverno, Prometheus, Loki, Tempo, Grafana, or Batfish.
  • Introducing:
    • New data paths that might affect sovereignty or residency.

5.2 RFC contents (minimal)

  • Problem statement and motivation.
  • Proposed tool and how it fits the existing stack.
  • Security, sovereignty, and sustainability impact.
  • Operational complexity impact:
    • New skills required
    • Runbooks and documentation needs
  • Migration and rollback plan.

6. Rollout Plan for the MVP Toolset

High-level phases:

  1. T0 — Assessment & Inventory

    • Document current tools in use per site.
    • Map overlaps/conflicts with this MVP profile.
  2. T1 — Pilot Site

    • Choose one non-critical site (or lab environment).
    • Implement the full MVP stack and workflows.
    • Run at least one full change cycle (PR → CI → deploy → verify).
  3. T2 — Template Hardening

    • Extract reusable modules and patterns.
    • Update minimum-toolset-profile.toon.yaml with real-world findings.
    • Finalize SLOs, runbooks, and on-call procedures.
  4. T3 — Broad Adoption

    • Adopt MVP toolset as default for new sites.
    • Gradually migrate existing sites, prioritizing those with:
      • Highest sovereignty requirements.
      • Highest operational pain due to tool sprawl.

7. How to Use This Profile

  • New site? Start from the MVP profile. Only request additional tools if strictly necessary.
  • Existing site? Use the MVP as a north star; plan migrations away from overlapping tools.
  • New engineer? Read this README, then:
    • Explore infra-foundation, platform-clusters, policies-and-compliance.
    • Run through a non-production change end-to-end under supervision.

If something in this document doesnt match reality, the mismatch must be fixed: either the code or this README. The MVP is only useful if it is live doctrine, not shelfware.