Add checklists/platform-architecture.md

2025-12-04 13:35:36 +00:00
parent 1002c2ad9d
commit fd184d4255
1 changed files with 382 additions and 0 deletions
--- a/checklists/platform-architecture.md
+++ b/checklists/platform-architecture.md
@@ -0,0 +1,382 @@
+a work-through checklist,  from Metal → Network → Virtualization → Self-service.
+
+---
+
+## 0. Foundation: Goals, Contracts, and Users
+
+* [ ] **Define platform goals**
+
+  * [ ] Write 3–5 bullet points for what the platform must enable (e.g. “fast provisioning”, “reliable GPU workloads”).
+  * [ ] Define initial SLOs for the *platform itself* (e.g. “VM provision succeeds within 15 minutes 99% of the time”).
+
+* [ ] **Define platform customers**
+
+  * [ ] List main user types: backend devs, ML engineers, data engineers, etc.
+  * [ ] For each, write 3 typical tasks (e.g. “deploy a new microservice”, “run a 3-day GPU training job”).
+
+* [ ] **Write a one-page Platform Contract**
+
+  * [ ] What the platform guarantees (uptime, basic security, monitoring defaults).
+  * [ ] What product teams must do (health checks, logging, deployment pattern, secrets usage).
+  * [ ] Store this in version control and share it.
+
+---
+
+## 1. Metal Layer Checklist (Racks → MAAS → Images)
+
+### 1.1 Hardware & Physical Layout
+
+* [ ] **Inventory physical assets**
+
+  * [ ] List all servers with: CPU, RAM, disk(s), NICs, GPU(s) where applicable.
+  * [ ] Identify roles: `compute`, `gpu`, `storage`, `infra`, `control-plane`.
+
+* [ ] **Define physical topology**
+
+  * [ ] Map which racks and switches each server connects to.
+  * [ ] Document power feeds and any redundancy (A/B feeds, UPS, etc).
+
+### 1.2 MAAS / Bare-Metal Provisioning
+
+* [ ] **Design MAAS/Ironic architecture**
+
+  * [ ] Decide MAAS region(s) and rack controllers per site.
+  * [ ] Decide where MAAS database/API lives and how it’s backed up.
+  * [ ] Define access rules to MAAS (who can log in, via what SSO/LDAP/etc).
+
+* [ ] **Standardize provisioning pipeline**
+
+  * [ ] Confirm **single flow**: power on → PXE → MAAS → Preseed/cloud-init → config management.
+  * [ ] Remove or document every deviation / legacy path.
+  * [ ] Create a flow diagram and store it in the repo.
+
+* [ ] **Set up node classification**
+
+  * [ ] Define MAAS tags / resource pools for: `compute`, `gpu`, `storage`, `infra`, `test`.
+  * [ ] Ensure every node has:
+
+    * [ ] Role tag.
+    * [ ] Site/room/rack metadata.
+    * [ ] Any special hardware flags (GPU type, NVMe, etc).
+
+### 1.3 OS Images & Base Configuration
+
+* [ ] **Define image catalogue**
+
+  * [ ] Choose base OS (e.g. Ubuntu LTS / Debian stable).
+  * [ ] Define 3–5 golden images (max), e.g.:
+
+    * [ ] `base-os` (minimal hardened image).
+    * [ ] `infra-node` (for MAAS/OpenStack/Proxmox controllers).
+    * [ ] `gpu-node` (GPU drivers, CUDA stack).
+    * [ ] (Optional) `storage-node`.
+  * [ ] Set naming/versioning convention (e.g. `eu-baseos-2025.01`).
+
+* [ ] **Harden base image**
+
+  * [ ] Baseline SSH config:
+
+    * [ ] Key-based auth only (no passwords).
+    * [ ] No direct root SSH (use sudo).
+  * [ ] Remove obviously unnecessary packages and services.
+  * [ ] Standard logging + monitoring agent baked in (or installed by config management).
+
+* [ ] **Rebuild confidence**
+
+  * [ ] Confirm you can rebuild *any* node via MAAS + config management:
+
+    * [ ] Pick one node per role and do a full reinstall.
+    * [ ] Verify node comes back into expected state (role, monitoring, access).
+
+---
+
+## 2. Network Layer Checklist (Segmentation → Access → Security)
+
+### 2.1 Logical Network Design
+
+* [ ] **Define network segments**
+
+  * [ ] List core VLANs/networks:
+
+    * [ ] `mgmt` (infra control plane).
+    * [ ] `storage`.
+    * [ ] `tenant` (workload traffic).
+    * [ ] `public` (north-south).
+    * [ ] `backup`.
+    * [ ] `oob` (BMC / IPMI).
+  * [ ] Assign CIDRs and names to each.
+
+* [ ] **Routing & L3 boundaries**
+
+  * [ ] Decide where routing happens (e.g. ToR vs core).
+  * [ ] Identify L3 gateways and firewall points.
+  * [ ] Document which networks can speak to which (and why).
+
+### 2.2 Admin & Operator Access
+
+* [ ] **Bastion / jump hosts**
+
+  * [ ] Create at least one bastion per site.
+  * [ ] Restrict SSH to infra nodes so it must go through bastion.
+  * [ ] Enforce key-based auth and logins to bastions.
+
+* [ ] **VPN design**
+
+  * [ ] Choose VPN solution for remote admins.
+  * [ ] Restrict admin subnets via VPN (no full corporate LAN free-for-all).
+  * [ ] Document joining/removal procedure for admin devices.
+
+* [ ] **Out-of-band (OOB)**
+
+  * [ ] Put BMC/IPMI interfaces on a dedicated OOB network.
+  * [ ] Restrict OOB access to admin/vpn ranges only.
+  * [ ] Document OOB procedure in case primary network is down.
+
+### 2.3 Security & Policies
+
+* [ ] **Access control**
+
+  * [ ] Define who can:
+
+    * [ ] Access MAAS.
+    * [ ] Access virtualization APIs (OpenStack/Proxmox).
+    * [ ] Access network gear.
+  * [ ] Implement least privilege (roles/groups).
+
+* [ ] **Network policy baselines**
+
+  * [ ] Default deny for inbound traffic from public.
+  * [ ] Clear rules for:
+
+    * [ ] SSH to infra.
+    * [ ] DB access (from which networks, via which services).
+  * [ ] Document exceptions and their owners.
+
+* [ ] **Document network diagram**
+
+  * [ ] Produce one main L2/L3 + VLAN diagram.
+  * [ ] Store it in git and link from platform docs.
+
+---
+
+## 3. Virtualization / Platform Layer Checklist (OpenStack/Proxmox)
+
+### 3.1 Platform Scope & Roles
+
+* [ ] **Choose role split**
+
+  * [ ] Decide what runs on:
+
+    * [ ] OpenStack (multi-tenant workloads).
+    * [ ] Proxmox (infra VMs, special cases).
+    * [ ] Bare metal only (DBs, specific storage, heavy GPU jobs, etc).
+  * [ ] Write this down as “What runs where” guidance.
+
+### 3.2 OpenStack (or primary cloud platform)
+
+* [ ] **Projects & Tenancy model**
+
+  * [ ] Decide project structure:
+
+    * [ ] `project = team + environment` (e.g. `payments-prod`) **or**
+    * [ ] `project = team` (environments as labels).
+  * [ ] Define naming conventions:
+
+    * [ ] Projects, security groups, networks, instances.
+
+* [ ] **Flavors & quotas**
+
+  * [ ] Define a small set of standard flavors:
+
+    * [ ] `small`, `medium`, `large`, `gpu-small`, `gpu-large`, etc.
+  * [ ] Set default quotas per project:
+
+    * [ ] CPU, RAM, disk, number of instances, GPUs.
+  * [ ] Document process for requesting quota increases.
+
+* [ ] **Networks for tenants**
+
+  * [ ] Standard pattern for tenant networks:
+
+    * [ ] One internal network per project.
+    * [ ] Optional external/public network attachment rules.
+  * [ ] Standard floating IP usage rules (who can get one, how many).
+
+* [ ] **Control plane Hardening & HA**
+
+  * [ ] Run key components in HA (if feasible):
+
+    * [ ] API, schedulers, message queue, DB.
+  * [ ] Enable TLS where possible for dashboard/API endpoints.
+  * [ ] Ensure backups for OpenStack DB + configs.
+
+### 3.3 Proxmox (or secondary virtualization platform)
+
+* [ ] **Scope definition**
+
+  * [ ] Decide clearly:
+
+    * [ ] Which workloads belong here (infra services? special vendor appliances?).
+  * [ ] Avoid overlapping with OpenStack use cases when possible.
+
+* [ ] **Resource & naming policy**
+
+  * [ ] Define naming for Proxmox clusters and VMs.
+  * [ ] Decide whether teams get self-service Proxmox or it’s SRE-only.
+
+### 3.4 Configuration Management
+
+* [ ] **IaC coverage**
+
+  * [ ] Ensure configs for:
+
+    * [ ] OpenStack projects, networks, flavors.
+    * [ ] Proxmox clusters and key VMs.
+  * [ ] Are stored as code (Ansible, Terraform, etc.) and not just in the UI.
+
+* [ ] **Reproducibility**
+
+  * [ ] Test that you can:
+
+    * [ ] Recreate a project and its associated resources from code.
+    * [ ] Rebuild a critical controller VM (from base image + config management).
+
+---
+
+## 4. Self-Service Layer Checklist (APIs → CLI → UX)
+
+### 4.1 Define Core Self-Service Use Cases
+
+* [ ] **List top flows**
+
+  * [ ] “Create new project/environment for a team.”
+  * [ ] “Provision compute (VM) for a service.”
+  * [ ] “Request GPU capacity.”
+  * [ ] “Onboard a new service to monitoring.”
+  * [ ] “See my project’s resource usage.”
+
+* [ ] **For each flow:**
+
+  * [ ] Define required inputs.
+  * [ ] Define outputs and completion condition.
+  * [ ] Identify which platform components are touched (OpenStack, MAAS, observability, etc).
+
+### 4.2 API / CLI Design
+
+* [ ] **Choose primary interface**
+
+  * [ ] Decide: CLI, internal API, or both as canonical interface.
+  * [ ] Document: “All self-service flows must be available via X.”
+
+* [ ] **Implement minimal CLI/API for key flows**
+
+  * [ ] `create-project` / `create-namespace`.
+  * [ ] `request-vm` (or template-based: `create-service`).
+  * [ ] `request-gpu` with constraints and limits.
+  * [ ] `show-usage` (CPU/RAM/GPU/storage per project).
+
+* [ ] **Guardrails**
+
+  * [ ] Enforce:
+
+    * [ ] Naming standards (team/env in names).
+    * [ ] Quotas (fail fast if over).
+  * [ ] Log all actions centrally.
+
+### 4.3 Golden Paths & Templates
+
+* [ ] **Service templates**
+
+  * [ ] Provide:
+
+    * [ ] Example app repo with CI/CD pipeline ready.
+    * [ ] Example deployment manifest (VM/K8s/etc).
+    * [ ] Built-in monitoring/logging configuration.
+
+* [ ] **Onboarding checklists**
+
+  * [ ] New service checklist:
+
+    * [ ] Project created.
+    * [ ] Monitoring enabled.
+    * [ ] Alerts defined.
+    * [ ] Dashboards created.
+    * [ ] Secret management integrated.
+
+### 4.4 Documentation & Feedback
+
+* [ ] **Platform docs**
+
+  * [ ] “How to get started” guide for:
+
+    * [ ] New engineers.
+    * [ ] New services.
+  * [ ] FAQ for:
+
+    * [ ] “Where do I run X?”
+    * [ ] “How do I get more quota?”
+* [ ] **Feedback loop**
+
+  * [ ] Set up a channel (Slack/discussions/form) for platform feedback.
+  * [ ] Review and triage feedback monthly into a platform backlog.
+
+---
+
+## 5. Cross-Cutting Checklist (Observability + GitOps + Failure)
+
+### 5.1 Observability
+
+* [ ] **Telemetry baseline**
+
+  * [ ] Every node:
+
+    * [ ] Exposes metrics (node exporter or equivalent).
+    * [ ] Sends logs to central store with site/role tags.
+  * [ ] Every platform service (MAAS, OpenStack, Proxmox, VPN, bastion):
+
+    * [ ] Has metrics and basic dashboards.
+
+* [ ] **Platform dashboards**
+
+  * [ ] Cluster capacity overview (CPU/RAM/storage/GPU).
+  * [ ] Provisioning pipeline health (errors, durations).
+  * [ ] Per-project usage dashboards.
+
+### 5.2 GitOps
+
+* [ ] **Repositories**
+
+  * [ ] At least:
+
+    * [ ] `infra-baremetal`.
+    * [ ] `infra-network`.
+    * [ ] `infra-platform` (OpenStack/Proxmox).
+    * [ ] `infra-observability`.
+  * [ ] Each has clear README and ownership.
+
+* [ ] **Change process**
+
+  * [ ] All changes go through PRs.
+  * [ ] CI validates syntax, lint, and (where possible) dry runs.
+  * [ ] Changes deployed via pipelines, not ad-hoc scripts.
+
+### 5.3 Failure & Recovery
+
+* [ ] **Document failure scenarios**
+
+  * [ ] Single node failure.
+  * [ ] Rack switch failure.
+  * [ ] Loss of MAAS / OpenStack API for a period.
+  * [ ] Partial network partition (e.g. mgmt vs tenant).
+
+* [ ] **For each scenario:**
+
+  * [ ] Define expected behavior.
+  * [ ] Define manual/automatic recovery steps.
+  * [ ] Run at least one game day per quarter to validate.
+
+---
+
+
+
+