Add checklists/platform-architecture.md

This commit is contained in:
2025-12-04 13:35:36 +00:00
parent 1002c2ad9d
commit fd184d4255

View File

@@ -0,0 +1,382 @@
a work-through checklist, from Metal → Network → Virtualization → Self-service.
---
## 0. Foundation: Goals, Contracts, and Users
* [ ] **Define platform goals**
* [ ] Write 35 bullet points for what the platform must enable (e.g. “fast provisioning”, “reliable GPU workloads”).
* [ ] Define initial SLOs for the *platform itself* (e.g. “VM provision succeeds within 15 minutes 99% of the time”).
* [ ] **Define platform customers**
* [ ] List main user types: backend devs, ML engineers, data engineers, etc.
* [ ] For each, write 3 typical tasks (e.g. “deploy a new microservice”, “run a 3-day GPU training job”).
* [ ] **Write a one-page Platform Contract**
* [ ] What the platform guarantees (uptime, basic security, monitoring defaults).
* [ ] What product teams must do (health checks, logging, deployment pattern, secrets usage).
* [ ] Store this in version control and share it.
---
## 1. Metal Layer Checklist (Racks → MAAS → Images)
### 1.1 Hardware & Physical Layout
* [ ] **Inventory physical assets**
* [ ] List all servers with: CPU, RAM, disk(s), NICs, GPU(s) where applicable.
* [ ] Identify roles: `compute`, `gpu`, `storage`, `infra`, `control-plane`.
* [ ] **Define physical topology**
* [ ] Map which racks and switches each server connects to.
* [ ] Document power feeds and any redundancy (A/B feeds, UPS, etc).
### 1.2 MAAS / Bare-Metal Provisioning
* [ ] **Design MAAS/Ironic architecture**
* [ ] Decide MAAS region(s) and rack controllers per site.
* [ ] Decide where MAAS database/API lives and how its backed up.
* [ ] Define access rules to MAAS (who can log in, via what SSO/LDAP/etc).
* [ ] **Standardize provisioning pipeline**
* [ ] Confirm **single flow**: power on → PXE → MAAS → Preseed/cloud-init → config management.
* [ ] Remove or document every deviation / legacy path.
* [ ] Create a flow diagram and store it in the repo.
* [ ] **Set up node classification**
* [ ] Define MAAS tags / resource pools for: `compute`, `gpu`, `storage`, `infra`, `test`.
* [ ] Ensure every node has:
* [ ] Role tag.
* [ ] Site/room/rack metadata.
* [ ] Any special hardware flags (GPU type, NVMe, etc).
### 1.3 OS Images & Base Configuration
* [ ] **Define image catalogue**
* [ ] Choose base OS (e.g. Ubuntu LTS / Debian stable).
* [ ] Define 35 golden images (max), e.g.:
* [ ] `base-os` (minimal hardened image).
* [ ] `infra-node` (for MAAS/OpenStack/Proxmox controllers).
* [ ] `gpu-node` (GPU drivers, CUDA stack).
* [ ] (Optional) `storage-node`.
* [ ] Set naming/versioning convention (e.g. `eu-baseos-2025.01`).
* [ ] **Harden base image**
* [ ] Baseline SSH config:
* [ ] Key-based auth only (no passwords).
* [ ] No direct root SSH (use sudo).
* [ ] Remove obviously unnecessary packages and services.
* [ ] Standard logging + monitoring agent baked in (or installed by config management).
* [ ] **Rebuild confidence**
* [ ] Confirm you can rebuild *any* node via MAAS + config management:
* [ ] Pick one node per role and do a full reinstall.
* [ ] Verify node comes back into expected state (role, monitoring, access).
---
## 2. Network Layer Checklist (Segmentation → Access → Security)
### 2.1 Logical Network Design
* [ ] **Define network segments**
* [ ] List core VLANs/networks:
* [ ] `mgmt` (infra control plane).
* [ ] `storage`.
* [ ] `tenant` (workload traffic).
* [ ] `public` (north-south).
* [ ] `backup`.
* [ ] `oob` (BMC / IPMI).
* [ ] Assign CIDRs and names to each.
* [ ] **Routing & L3 boundaries**
* [ ] Decide where routing happens (e.g. ToR vs core).
* [ ] Identify L3 gateways and firewall points.
* [ ] Document which networks can speak to which (and why).
### 2.2 Admin & Operator Access
* [ ] **Bastion / jump hosts**
* [ ] Create at least one bastion per site.
* [ ] Restrict SSH to infra nodes so it must go through bastion.
* [ ] Enforce key-based auth and logins to bastions.
* [ ] **VPN design**
* [ ] Choose VPN solution for remote admins.
* [ ] Restrict admin subnets via VPN (no full corporate LAN free-for-all).
* [ ] Document joining/removal procedure for admin devices.
* [ ] **Out-of-band (OOB)**
* [ ] Put BMC/IPMI interfaces on a dedicated OOB network.
* [ ] Restrict OOB access to admin/vpn ranges only.
* [ ] Document OOB procedure in case primary network is down.
### 2.3 Security & Policies
* [ ] **Access control**
* [ ] Define who can:
* [ ] Access MAAS.
* [ ] Access virtualization APIs (OpenStack/Proxmox).
* [ ] Access network gear.
* [ ] Implement least privilege (roles/groups).
* [ ] **Network policy baselines**
* [ ] Default deny for inbound traffic from public.
* [ ] Clear rules for:
* [ ] SSH to infra.
* [ ] DB access (from which networks, via which services).
* [ ] Document exceptions and their owners.
* [ ] **Document network diagram**
* [ ] Produce one main L2/L3 + VLAN diagram.
* [ ] Store it in git and link from platform docs.
---
## 3. Virtualization / Platform Layer Checklist (OpenStack/Proxmox)
### 3.1 Platform Scope & Roles
* [ ] **Choose role split**
* [ ] Decide what runs on:
* [ ] OpenStack (multi-tenant workloads).
* [ ] Proxmox (infra VMs, special cases).
* [ ] Bare metal only (DBs, specific storage, heavy GPU jobs, etc).
* [ ] Write this down as “What runs where” guidance.
### 3.2 OpenStack (or primary cloud platform)
* [ ] **Projects & Tenancy model**
* [ ] Decide project structure:
* [ ] `project = team + environment` (e.g. `payments-prod`) **or**
* [ ] `project = team` (environments as labels).
* [ ] Define naming conventions:
* [ ] Projects, security groups, networks, instances.
* [ ] **Flavors & quotas**
* [ ] Define a small set of standard flavors:
* [ ] `small`, `medium`, `large`, `gpu-small`, `gpu-large`, etc.
* [ ] Set default quotas per project:
* [ ] CPU, RAM, disk, number of instances, GPUs.
* [ ] Document process for requesting quota increases.
* [ ] **Networks for tenants**
* [ ] Standard pattern for tenant networks:
* [ ] One internal network per project.
* [ ] Optional external/public network attachment rules.
* [ ] Standard floating IP usage rules (who can get one, how many).
* [ ] **Control plane Hardening & HA**
* [ ] Run key components in HA (if feasible):
* [ ] API, schedulers, message queue, DB.
* [ ] Enable TLS where possible for dashboard/API endpoints.
* [ ] Ensure backups for OpenStack DB + configs.
### 3.3 Proxmox (or secondary virtualization platform)
* [ ] **Scope definition**
* [ ] Decide clearly:
* [ ] Which workloads belong here (infra services? special vendor appliances?).
* [ ] Avoid overlapping with OpenStack use cases when possible.
* [ ] **Resource & naming policy**
* [ ] Define naming for Proxmox clusters and VMs.
* [ ] Decide whether teams get self-service Proxmox or its SRE-only.
### 3.4 Configuration Management
* [ ] **IaC coverage**
* [ ] Ensure configs for:
* [ ] OpenStack projects, networks, flavors.
* [ ] Proxmox clusters and key VMs.
* [ ] Are stored as code (Ansible, Terraform, etc.) and not just in the UI.
* [ ] **Reproducibility**
* [ ] Test that you can:
* [ ] Recreate a project and its associated resources from code.
* [ ] Rebuild a critical controller VM (from base image + config management).
---
## 4. Self-Service Layer Checklist (APIs → CLI → UX)
### 4.1 Define Core Self-Service Use Cases
* [ ] **List top flows**
* [ ] “Create new project/environment for a team.”
* [ ] “Provision compute (VM) for a service.”
* [ ] “Request GPU capacity.”
* [ ] “Onboard a new service to monitoring.”
* [ ] “See my projects resource usage.”
* [ ] **For each flow:**
* [ ] Define required inputs.
* [ ] Define outputs and completion condition.
* [ ] Identify which platform components are touched (OpenStack, MAAS, observability, etc).
### 4.2 API / CLI Design
* [ ] **Choose primary interface**
* [ ] Decide: CLI, internal API, or both as canonical interface.
* [ ] Document: “All self-service flows must be available via X.”
* [ ] **Implement minimal CLI/API for key flows**
* [ ] `create-project` / `create-namespace`.
* [ ] `request-vm` (or template-based: `create-service`).
* [ ] `request-gpu` with constraints and limits.
* [ ] `show-usage` (CPU/RAM/GPU/storage per project).
* [ ] **Guardrails**
* [ ] Enforce:
* [ ] Naming standards (team/env in names).
* [ ] Quotas (fail fast if over).
* [ ] Log all actions centrally.
### 4.3 Golden Paths & Templates
* [ ] **Service templates**
* [ ] Provide:
* [ ] Example app repo with CI/CD pipeline ready.
* [ ] Example deployment manifest (VM/K8s/etc).
* [ ] Built-in monitoring/logging configuration.
* [ ] **Onboarding checklists**
* [ ] New service checklist:
* [ ] Project created.
* [ ] Monitoring enabled.
* [ ] Alerts defined.
* [ ] Dashboards created.
* [ ] Secret management integrated.
### 4.4 Documentation & Feedback
* [ ] **Platform docs**
* [ ] “How to get started” guide for:
* [ ] New engineers.
* [ ] New services.
* [ ] FAQ for:
* [ ] “Where do I run X?”
* [ ] “How do I get more quota?”
* [ ] **Feedback loop**
* [ ] Set up a channel (Slack/discussions/form) for platform feedback.
* [ ] Review and triage feedback monthly into a platform backlog.
---
## 5. Cross-Cutting Checklist (Observability + GitOps + Failure)
### 5.1 Observability
* [ ] **Telemetry baseline**
* [ ] Every node:
* [ ] Exposes metrics (node exporter or equivalent).
* [ ] Sends logs to central store with site/role tags.
* [ ] Every platform service (MAAS, OpenStack, Proxmox, VPN, bastion):
* [ ] Has metrics and basic dashboards.
* [ ] **Platform dashboards**
* [ ] Cluster capacity overview (CPU/RAM/storage/GPU).
* [ ] Provisioning pipeline health (errors, durations).
* [ ] Per-project usage dashboards.
### 5.2 GitOps
* [ ] **Repositories**
* [ ] At least:
* [ ] `infra-baremetal`.
* [ ] `infra-network`.
* [ ] `infra-platform` (OpenStack/Proxmox).
* [ ] `infra-observability`.
* [ ] Each has clear README and ownership.
* [ ] **Change process**
* [ ] All changes go through PRs.
* [ ] CI validates syntax, lint, and (where possible) dry runs.
* [ ] Changes deployed via pipelines, not ad-hoc scripts.
### 5.3 Failure & Recovery
* [ ] **Document failure scenarios**
* [ ] Single node failure.
* [ ] Rack switch failure.
* [ ] Loss of MAAS / OpenStack API for a period.
* [ ] Partial network partition (e.g. mgmt vs tenant).
* [ ] **For each scenario:**
* [ ] Define expected behavior.
* [ ] Define manual/automatic recovery steps.
* [ ] Run at least one game day per quarter to validate.
---