Add checklists/platform-architecture.md
This commit is contained in:
382
checklists/platform-architecture.md
Normal file
382
checklists/platform-architecture.md
Normal file
@@ -0,0 +1,382 @@
|
||||
a work-through checklist, from Metal → Network → Virtualization → Self-service.
|
||||
|
||||
---
|
||||
|
||||
## 0. Foundation: Goals, Contracts, and Users
|
||||
|
||||
* [ ] **Define platform goals**
|
||||
|
||||
* [ ] Write 3–5 bullet points for what the platform must enable (e.g. “fast provisioning”, “reliable GPU workloads”).
|
||||
* [ ] Define initial SLOs for the *platform itself* (e.g. “VM provision succeeds within 15 minutes 99% of the time”).
|
||||
|
||||
* [ ] **Define platform customers**
|
||||
|
||||
* [ ] List main user types: backend devs, ML engineers, data engineers, etc.
|
||||
* [ ] For each, write 3 typical tasks (e.g. “deploy a new microservice”, “run a 3-day GPU training job”).
|
||||
|
||||
* [ ] **Write a one-page Platform Contract**
|
||||
|
||||
* [ ] What the platform guarantees (uptime, basic security, monitoring defaults).
|
||||
* [ ] What product teams must do (health checks, logging, deployment pattern, secrets usage).
|
||||
* [ ] Store this in version control and share it.
|
||||
|
||||
---
|
||||
|
||||
## 1. Metal Layer Checklist (Racks → MAAS → Images)
|
||||
|
||||
### 1.1 Hardware & Physical Layout
|
||||
|
||||
* [ ] **Inventory physical assets**
|
||||
|
||||
* [ ] List all servers with: CPU, RAM, disk(s), NICs, GPU(s) where applicable.
|
||||
* [ ] Identify roles: `compute`, `gpu`, `storage`, `infra`, `control-plane`.
|
||||
|
||||
* [ ] **Define physical topology**
|
||||
|
||||
* [ ] Map which racks and switches each server connects to.
|
||||
* [ ] Document power feeds and any redundancy (A/B feeds, UPS, etc).
|
||||
|
||||
### 1.2 MAAS / Bare-Metal Provisioning
|
||||
|
||||
* [ ] **Design MAAS/Ironic architecture**
|
||||
|
||||
* [ ] Decide MAAS region(s) and rack controllers per site.
|
||||
* [ ] Decide where MAAS database/API lives and how it’s backed up.
|
||||
* [ ] Define access rules to MAAS (who can log in, via what SSO/LDAP/etc).
|
||||
|
||||
* [ ] **Standardize provisioning pipeline**
|
||||
|
||||
* [ ] Confirm **single flow**: power on → PXE → MAAS → Preseed/cloud-init → config management.
|
||||
* [ ] Remove or document every deviation / legacy path.
|
||||
* [ ] Create a flow diagram and store it in the repo.
|
||||
|
||||
* [ ] **Set up node classification**
|
||||
|
||||
* [ ] Define MAAS tags / resource pools for: `compute`, `gpu`, `storage`, `infra`, `test`.
|
||||
* [ ] Ensure every node has:
|
||||
|
||||
* [ ] Role tag.
|
||||
* [ ] Site/room/rack metadata.
|
||||
* [ ] Any special hardware flags (GPU type, NVMe, etc).
|
||||
|
||||
### 1.3 OS Images & Base Configuration
|
||||
|
||||
* [ ] **Define image catalogue**
|
||||
|
||||
* [ ] Choose base OS (e.g. Ubuntu LTS / Debian stable).
|
||||
* [ ] Define 3–5 golden images (max), e.g.:
|
||||
|
||||
* [ ] `base-os` (minimal hardened image).
|
||||
* [ ] `infra-node` (for MAAS/OpenStack/Proxmox controllers).
|
||||
* [ ] `gpu-node` (GPU drivers, CUDA stack).
|
||||
* [ ] (Optional) `storage-node`.
|
||||
* [ ] Set naming/versioning convention (e.g. `eu-baseos-2025.01`).
|
||||
|
||||
* [ ] **Harden base image**
|
||||
|
||||
* [ ] Baseline SSH config:
|
||||
|
||||
* [ ] Key-based auth only (no passwords).
|
||||
* [ ] No direct root SSH (use sudo).
|
||||
* [ ] Remove obviously unnecessary packages and services.
|
||||
* [ ] Standard logging + monitoring agent baked in (or installed by config management).
|
||||
|
||||
* [ ] **Rebuild confidence**
|
||||
|
||||
* [ ] Confirm you can rebuild *any* node via MAAS + config management:
|
||||
|
||||
* [ ] Pick one node per role and do a full reinstall.
|
||||
* [ ] Verify node comes back into expected state (role, monitoring, access).
|
||||
|
||||
---
|
||||
|
||||
## 2. Network Layer Checklist (Segmentation → Access → Security)
|
||||
|
||||
### 2.1 Logical Network Design
|
||||
|
||||
* [ ] **Define network segments**
|
||||
|
||||
* [ ] List core VLANs/networks:
|
||||
|
||||
* [ ] `mgmt` (infra control plane).
|
||||
* [ ] `storage`.
|
||||
* [ ] `tenant` (workload traffic).
|
||||
* [ ] `public` (north-south).
|
||||
* [ ] `backup`.
|
||||
* [ ] `oob` (BMC / IPMI).
|
||||
* [ ] Assign CIDRs and names to each.
|
||||
|
||||
* [ ] **Routing & L3 boundaries**
|
||||
|
||||
* [ ] Decide where routing happens (e.g. ToR vs core).
|
||||
* [ ] Identify L3 gateways and firewall points.
|
||||
* [ ] Document which networks can speak to which (and why).
|
||||
|
||||
### 2.2 Admin & Operator Access
|
||||
|
||||
* [ ] **Bastion / jump hosts**
|
||||
|
||||
* [ ] Create at least one bastion per site.
|
||||
* [ ] Restrict SSH to infra nodes so it must go through bastion.
|
||||
* [ ] Enforce key-based auth and logins to bastions.
|
||||
|
||||
* [ ] **VPN design**
|
||||
|
||||
* [ ] Choose VPN solution for remote admins.
|
||||
* [ ] Restrict admin subnets via VPN (no full corporate LAN free-for-all).
|
||||
* [ ] Document joining/removal procedure for admin devices.
|
||||
|
||||
* [ ] **Out-of-band (OOB)**
|
||||
|
||||
* [ ] Put BMC/IPMI interfaces on a dedicated OOB network.
|
||||
* [ ] Restrict OOB access to admin/vpn ranges only.
|
||||
* [ ] Document OOB procedure in case primary network is down.
|
||||
|
||||
### 2.3 Security & Policies
|
||||
|
||||
* [ ] **Access control**
|
||||
|
||||
* [ ] Define who can:
|
||||
|
||||
* [ ] Access MAAS.
|
||||
* [ ] Access virtualization APIs (OpenStack/Proxmox).
|
||||
* [ ] Access network gear.
|
||||
* [ ] Implement least privilege (roles/groups).
|
||||
|
||||
* [ ] **Network policy baselines**
|
||||
|
||||
* [ ] Default deny for inbound traffic from public.
|
||||
* [ ] Clear rules for:
|
||||
|
||||
* [ ] SSH to infra.
|
||||
* [ ] DB access (from which networks, via which services).
|
||||
* [ ] Document exceptions and their owners.
|
||||
|
||||
* [ ] **Document network diagram**
|
||||
|
||||
* [ ] Produce one main L2/L3 + VLAN diagram.
|
||||
* [ ] Store it in git and link from platform docs.
|
||||
|
||||
---
|
||||
|
||||
## 3. Virtualization / Platform Layer Checklist (OpenStack/Proxmox)
|
||||
|
||||
### 3.1 Platform Scope & Roles
|
||||
|
||||
* [ ] **Choose role split**
|
||||
|
||||
* [ ] Decide what runs on:
|
||||
|
||||
* [ ] OpenStack (multi-tenant workloads).
|
||||
* [ ] Proxmox (infra VMs, special cases).
|
||||
* [ ] Bare metal only (DBs, specific storage, heavy GPU jobs, etc).
|
||||
* [ ] Write this down as “What runs where” guidance.
|
||||
|
||||
### 3.2 OpenStack (or primary cloud platform)
|
||||
|
||||
* [ ] **Projects & Tenancy model**
|
||||
|
||||
* [ ] Decide project structure:
|
||||
|
||||
* [ ] `project = team + environment` (e.g. `payments-prod`) **or**
|
||||
* [ ] `project = team` (environments as labels).
|
||||
* [ ] Define naming conventions:
|
||||
|
||||
* [ ] Projects, security groups, networks, instances.
|
||||
|
||||
* [ ] **Flavors & quotas**
|
||||
|
||||
* [ ] Define a small set of standard flavors:
|
||||
|
||||
* [ ] `small`, `medium`, `large`, `gpu-small`, `gpu-large`, etc.
|
||||
* [ ] Set default quotas per project:
|
||||
|
||||
* [ ] CPU, RAM, disk, number of instances, GPUs.
|
||||
* [ ] Document process for requesting quota increases.
|
||||
|
||||
* [ ] **Networks for tenants**
|
||||
|
||||
* [ ] Standard pattern for tenant networks:
|
||||
|
||||
* [ ] One internal network per project.
|
||||
* [ ] Optional external/public network attachment rules.
|
||||
* [ ] Standard floating IP usage rules (who can get one, how many).
|
||||
|
||||
* [ ] **Control plane Hardening & HA**
|
||||
|
||||
* [ ] Run key components in HA (if feasible):
|
||||
|
||||
* [ ] API, schedulers, message queue, DB.
|
||||
* [ ] Enable TLS where possible for dashboard/API endpoints.
|
||||
* [ ] Ensure backups for OpenStack DB + configs.
|
||||
|
||||
### 3.3 Proxmox (or secondary virtualization platform)
|
||||
|
||||
* [ ] **Scope definition**
|
||||
|
||||
* [ ] Decide clearly:
|
||||
|
||||
* [ ] Which workloads belong here (infra services? special vendor appliances?).
|
||||
* [ ] Avoid overlapping with OpenStack use cases when possible.
|
||||
|
||||
* [ ] **Resource & naming policy**
|
||||
|
||||
* [ ] Define naming for Proxmox clusters and VMs.
|
||||
* [ ] Decide whether teams get self-service Proxmox or it’s SRE-only.
|
||||
|
||||
### 3.4 Configuration Management
|
||||
|
||||
* [ ] **IaC coverage**
|
||||
|
||||
* [ ] Ensure configs for:
|
||||
|
||||
* [ ] OpenStack projects, networks, flavors.
|
||||
* [ ] Proxmox clusters and key VMs.
|
||||
* [ ] Are stored as code (Ansible, Terraform, etc.) and not just in the UI.
|
||||
|
||||
* [ ] **Reproducibility**
|
||||
|
||||
* [ ] Test that you can:
|
||||
|
||||
* [ ] Recreate a project and its associated resources from code.
|
||||
* [ ] Rebuild a critical controller VM (from base image + config management).
|
||||
|
||||
---
|
||||
|
||||
## 4. Self-Service Layer Checklist (APIs → CLI → UX)
|
||||
|
||||
### 4.1 Define Core Self-Service Use Cases
|
||||
|
||||
* [ ] **List top flows**
|
||||
|
||||
* [ ] “Create new project/environment for a team.”
|
||||
* [ ] “Provision compute (VM) for a service.”
|
||||
* [ ] “Request GPU capacity.”
|
||||
* [ ] “Onboard a new service to monitoring.”
|
||||
* [ ] “See my project’s resource usage.”
|
||||
|
||||
* [ ] **For each flow:**
|
||||
|
||||
* [ ] Define required inputs.
|
||||
* [ ] Define outputs and completion condition.
|
||||
* [ ] Identify which platform components are touched (OpenStack, MAAS, observability, etc).
|
||||
|
||||
### 4.2 API / CLI Design
|
||||
|
||||
* [ ] **Choose primary interface**
|
||||
|
||||
* [ ] Decide: CLI, internal API, or both as canonical interface.
|
||||
* [ ] Document: “All self-service flows must be available via X.”
|
||||
|
||||
* [ ] **Implement minimal CLI/API for key flows**
|
||||
|
||||
* [ ] `create-project` / `create-namespace`.
|
||||
* [ ] `request-vm` (or template-based: `create-service`).
|
||||
* [ ] `request-gpu` with constraints and limits.
|
||||
* [ ] `show-usage` (CPU/RAM/GPU/storage per project).
|
||||
|
||||
* [ ] **Guardrails**
|
||||
|
||||
* [ ] Enforce:
|
||||
|
||||
* [ ] Naming standards (team/env in names).
|
||||
* [ ] Quotas (fail fast if over).
|
||||
* [ ] Log all actions centrally.
|
||||
|
||||
### 4.3 Golden Paths & Templates
|
||||
|
||||
* [ ] **Service templates**
|
||||
|
||||
* [ ] Provide:
|
||||
|
||||
* [ ] Example app repo with CI/CD pipeline ready.
|
||||
* [ ] Example deployment manifest (VM/K8s/etc).
|
||||
* [ ] Built-in monitoring/logging configuration.
|
||||
|
||||
* [ ] **Onboarding checklists**
|
||||
|
||||
* [ ] New service checklist:
|
||||
|
||||
* [ ] Project created.
|
||||
* [ ] Monitoring enabled.
|
||||
* [ ] Alerts defined.
|
||||
* [ ] Dashboards created.
|
||||
* [ ] Secret management integrated.
|
||||
|
||||
### 4.4 Documentation & Feedback
|
||||
|
||||
* [ ] **Platform docs**
|
||||
|
||||
* [ ] “How to get started” guide for:
|
||||
|
||||
* [ ] New engineers.
|
||||
* [ ] New services.
|
||||
* [ ] FAQ for:
|
||||
|
||||
* [ ] “Where do I run X?”
|
||||
* [ ] “How do I get more quota?”
|
||||
* [ ] **Feedback loop**
|
||||
|
||||
* [ ] Set up a channel (Slack/discussions/form) for platform feedback.
|
||||
* [ ] Review and triage feedback monthly into a platform backlog.
|
||||
|
||||
---
|
||||
|
||||
## 5. Cross-Cutting Checklist (Observability + GitOps + Failure)
|
||||
|
||||
### 5.1 Observability
|
||||
|
||||
* [ ] **Telemetry baseline**
|
||||
|
||||
* [ ] Every node:
|
||||
|
||||
* [ ] Exposes metrics (node exporter or equivalent).
|
||||
* [ ] Sends logs to central store with site/role tags.
|
||||
* [ ] Every platform service (MAAS, OpenStack, Proxmox, VPN, bastion):
|
||||
|
||||
* [ ] Has metrics and basic dashboards.
|
||||
|
||||
* [ ] **Platform dashboards**
|
||||
|
||||
* [ ] Cluster capacity overview (CPU/RAM/storage/GPU).
|
||||
* [ ] Provisioning pipeline health (errors, durations).
|
||||
* [ ] Per-project usage dashboards.
|
||||
|
||||
### 5.2 GitOps
|
||||
|
||||
* [ ] **Repositories**
|
||||
|
||||
* [ ] At least:
|
||||
|
||||
* [ ] `infra-baremetal`.
|
||||
* [ ] `infra-network`.
|
||||
* [ ] `infra-platform` (OpenStack/Proxmox).
|
||||
* [ ] `infra-observability`.
|
||||
* [ ] Each has clear README and ownership.
|
||||
|
||||
* [ ] **Change process**
|
||||
|
||||
* [ ] All changes go through PRs.
|
||||
* [ ] CI validates syntax, lint, and (where possible) dry runs.
|
||||
* [ ] Changes deployed via pipelines, not ad-hoc scripts.
|
||||
|
||||
### 5.3 Failure & Recovery
|
||||
|
||||
* [ ] **Document failure scenarios**
|
||||
|
||||
* [ ] Single node failure.
|
||||
* [ ] Rack switch failure.
|
||||
* [ ] Loss of MAAS / OpenStack API for a period.
|
||||
* [ ] Partial network partition (e.g. mgmt vs tenant).
|
||||
|
||||
* [ ] **For each scenario:**
|
||||
|
||||
* [ ] Define expected behavior.
|
||||
* [ ] Define manual/automatic recovery steps.
|
||||
* [ ] Run at least one game day per quarter to validate.
|
||||
|
||||
---
|
||||
|
||||
|
||||
|
||||
|
||||
Reference in New Issue
Block a user