11 KiB
a work-through checklist, from Metal → Network → Virtualization → Self-service.
0. Foundation: Goals, Contracts, and Users
-
Define platform goals
- Write 3-5 bullet points for what the platform must enable (e.g. “fast provisioning”, “reliable GPU workloads”).
- Define initial SLOs for the platform itself (e.g. “VM provision succeeds within 15 minutes 99% of the time”).
-
Define platform customers
- List main user types: backend devs, ML engineers, data engineers, etc.
- For each, write 3 typical tasks (e.g. “deploy a new microservice”, “run a 3-day GPU training job”).
-
Write a one-page Platform Contract
- What the platform guarantees (uptime, basic security, monitoring defaults).
- What product teams must do (health checks, logging, deployment pattern, secrets usage).
- Store this in version control and share it.
1. Metal Layer Checklist (Racks → MAAS → Images)
1.1 Hardware & Physical Layout
-
Inventory physical assets
- List all servers with: CPU, RAM, disk(s), NICs, GPU(s) where applicable.
- Identify roles:
compute,gpu,storage,infra,control-plane.
-
Define physical topology
- Map which racks and switches each server connects to.
- Document power feeds and any redundancy (A/B feeds, UPS, etc).
1.2 MAAS / Bare-Metal Provisioning
-
Design MAAS/Ironic architecture
- Decide MAAS region(s) and rack controllers per site.
- Decide where MAAS database/API lives and how it's backed up.
- Define access rules to MAAS (who can log in, via what SSO/LDAP/etc).
-
Standardize provisioning pipeline
- Confirm single flow: power on → PXE → MAAS → Preseed/cloud-init → config management.
- Remove or document every deviation / legacy path.
- Create a flow diagram and store it in the repo.
-
Set up node classification
-
Define MAAS tags / resource pools for:
compute,gpu,storage,infra,test. -
Ensure every node has:
- Role tag.
- Site/room/rack metadata.
- Any special hardware flags (GPU type, NVMe, etc).
-
1.3 OS Images & Base Configuration
-
Define image catalogue
-
Choose base OS (e.g. Ubuntu LTS / Debian stable).
-
Define 3-5 golden images (max), e.g.:
base-os(minimal hardened image).infra-node(for MAAS/OpenStack/Proxmox controllers).gpu-node(GPU drivers, CUDA stack).- (Optional)
storage-node.
-
Set naming/versioning convention (e.g.
eu-baseos-2025.01).
-
-
Harden base image
-
Baseline SSH config:
- Key-based auth only (no passwords).
- No direct root SSH (use sudo).
-
Remove obviously unnecessary packages and services.
-
Standard logging + monitoring agent baked in (or installed by config management).
-
-
Rebuild confidence
-
Confirm you can rebuild any node via MAAS + config management:
- Pick one node per role and do a full reinstall.
- Verify node comes back into expected state (role, monitoring, access).
-
2. Network Layer Checklist (Segmentation → Access → Security)
2.1 Logical Network Design
-
Define network segments
-
List core VLANs/networks:
mgmt(infra control plane).storage.tenant(workload traffic).public(north-south).backup.oob(BMC / IPMI).
-
Assign CIDRs and names to each.
-
-
Routing & L3 boundaries
- Decide where routing happens (e.g. ToR vs core).
- Identify L3 gateways and firewall points.
- Document which networks can speak to which (and why).
2.2 Admin & Operator Access
-
Bastion / jump hosts
- Create at least one bastion per site.
- Restrict SSH to infra nodes so it must go through bastion.
- Enforce key-based auth and logins to bastions.
-
VPN design
- Choose VPN solution for remote admins.
- Restrict admin subnets via VPN (no full corporate LAN free-for-all).
- Document joining/removal procedure for admin devices.
-
Out-of-band (OOB)
- Put BMC/IPMI interfaces on a dedicated OOB network.
- Restrict OOB access to admin/vpn ranges only.
- Document OOB procedure in case primary network is down.
2.3 Security & Policies
-
Access control
-
Define who can:
- Access MAAS.
- Access virtualization APIs (OpenStack/Proxmox).
- Access network gear.
-
Implement least privilege (roles/groups).
-
-
Network policy baselines
-
Default deny for inbound traffic from public.
-
Clear rules for:
- SSH to infra.
- DB access (from which networks, via which services).
-
Document exceptions and their owners.
-
-
Document network diagram
- Produce one main L2/L3 + VLAN diagram.
- Store it in git and link from platform docs.
3. Virtualization / Platform Layer Checklist (OpenStack/Proxmox)
3.1 Platform Scope & Roles
-
Choose role split
-
Decide what runs on:
- OpenStack (multi-tenant workloads).
- Proxmox (infra VMs, special cases).
- Bare metal only (DBs, specific storage, heavy GPU jobs, etc).
-
Write this down as “What runs where” guidance.
-
3.2 OpenStack (or primary cloud platform)
-
Projects & Tenancy model
-
Decide project structure:
project = team + environment(e.g.payments-prod) orproject = team(environments as labels).
-
Define naming conventions:
- Projects, security groups, networks, instances.
-
-
Flavors & quotas
-
Define a small set of standard flavors:
small,medium,large,gpu-small,gpu-large, etc.
-
Set default quotas per project:
- CPU, RAM, disk, number of instances, GPUs.
-
Document process for requesting quota increases.
-
-
Networks for tenants
-
Standard pattern for tenant networks:
- One internal network per project.
- Optional external/public network attachment rules.
-
Standard floating IP usage rules (who can get one, how many).
-
-
Control plane Hardening & HA
-
Run key components in HA (if feasible):
- API, schedulers, message queue, DB.
-
Enable TLS where possible for dashboard/API endpoints.
-
Ensure backups for OpenStack DB + configs.
-
3.3 Proxmox (or secondary virtualization platform)
-
Scope definition
-
Decide clearly:
- Which workloads belong here (infra services? special vendor appliances?).
-
Avoid overlapping with OpenStack use cases when possible.
-
-
Resource & naming policy
- Define naming for Proxmox clusters and VMs.
- Decide whether teams get self-service Proxmox or it's SRE-only.
3.4 Configuration Management
-
IaC coverage
-
Ensure configs for:
- OpenStack projects, networks, flavors.
- Proxmox clusters and key VMs.
-
Are stored as code (Ansible, Terraform, etc.) and not just in the UI.
-
-
Reproducibility
-
Test that you can:
- Recreate a project and its associated resources from code.
- Rebuild a critical controller VM (from base image + config management).
-
4. Self-Service Layer Checklist (APIs → CLI → UX)
4.1 Define Core Self-Service Use Cases
-
List top flows
- “Create new project/environment for a team.”
- “Provision compute (VM) for a service.”
- “Request GPU capacity.”
- “Onboard a new service to monitoring.”
- “See my project's resource usage.”
-
For each flow:
- Define required inputs.
- Define outputs and completion condition.
- Identify which platform components are touched (OpenStack, MAAS, observability, etc).
4.2 API / CLI Design
-
Choose primary interface
- Decide: CLI, internal API, or both as canonical interface.
- Document: “All self-service flows must be available via X.”
-
Implement minimal CLI/API for key flows
create-project/create-namespace.request-vm(or template-based:create-service).request-gpuwith constraints and limits.show-usage(CPU/RAM/GPU/storage per project).
-
Guardrails
-
Enforce:
- Naming standards (team/env in names).
- Quotas (fail fast if over).
-
Log all actions centrally.
-
4.3 Golden Paths & Templates
-
Service templates
-
Provide:
- Example app repo with CI/CD pipeline ready.
- Example deployment manifest (VM/K8s/etc).
- Built-in monitoring/logging configuration.
-
-
Onboarding checklists
-
New service checklist:
- Project created.
- Monitoring enabled.
- Alerts defined.
- Dashboards created.
- Secret management integrated.
-
4.4 Documentation & Feedback
-
Platform docs
-
“How to get started” guide for:
- New engineers.
- New services.
-
FAQ for:
- “Where do I run X?”
- “How do I get more quota?”
-
-
Feedback loop
- Set up a channel (Slack/discussions/form) for platform feedback.
- Review and triage feedback monthly into a platform backlog.
5. Cross-Cutting Checklist (Observability + GitOps + Failure)
5.1 Observability
-
Telemetry baseline
-
Every node:
- Exposes metrics (node exporter or equivalent).
- Sends logs to central store with site/role tags.
-
Every platform service (MAAS, OpenStack, Proxmox, VPN, bastion):
- Has metrics and basic dashboards.
-
-
Platform dashboards
- Cluster capacity overview (CPU/RAM/storage/GPU).
- Provisioning pipeline health (errors, durations).
- Per-project usage dashboards.
5.2 GitOps
-
Repositories
-
At least:
infra-baremetal.infra-network.infra-platform(OpenStack/Proxmox).infra-observability.
-
Each has clear README and ownership.
-
-
Change process
- All changes go through PRs.
- CI validates syntax, lint, and (where possible) dry runs.
- Changes deployed via pipelines, not ad-hoc scripts.
5.3 Failure & Recovery
-
Document failure scenarios
- Single node failure.
- Rack switch failure.
- Loss of MAAS / OpenStack API for a period.
- Partial network partition (e.g. mgmt vs tenant).
-
For each scenario:
- Define expected behavior.
- Define manual/automatic recovery steps.
- Run at least one game day per quarter to validate.