Files
EU-startup/checklists/platform-architecture.md

11 KiB

a work-through checklist, from Metal → Network → Virtualization → Self-service.


0. Foundation: Goals, Contracts, and Users

  • Define platform goals

    • Write 3-5 bullet points for what the platform must enable (e.g. “fast provisioning”, “reliable GPU workloads”).
    • Define initial SLOs for the platform itself (e.g. “VM provision succeeds within 15 minutes 99% of the time”).
  • Define platform customers

    • List main user types: backend devs, ML engineers, data engineers, etc.
    • For each, write 3 typical tasks (e.g. “deploy a new microservice”, “run a 3-day GPU training job”).
  • Write a one-page Platform Contract

    • What the platform guarantees (uptime, basic security, monitoring defaults).
    • What product teams must do (health checks, logging, deployment pattern, secrets usage).
    • Store this in version control and share it.

1. Metal Layer Checklist (Racks → MAAS → Images)

1.1 Hardware & Physical Layout

  • Inventory physical assets

    • List all servers with: CPU, RAM, disk(s), NICs, GPU(s) where applicable.
    • Identify roles: compute, gpu, storage, infra, control-plane.
  • Define physical topology

    • Map which racks and switches each server connects to.
    • Document power feeds and any redundancy (A/B feeds, UPS, etc).

1.2 MAAS / Bare-Metal Provisioning

  • Design MAAS/Ironic architecture

    • Decide MAAS region(s) and rack controllers per site.
    • Decide where MAAS database/API lives and how it's backed up.
    • Define access rules to MAAS (who can log in, via what SSO/LDAP/etc).
  • Standardize provisioning pipeline

    • Confirm single flow: power on → PXE → MAAS → Preseed/cloud-init → config management.
    • Remove or document every deviation / legacy path.
    • Create a flow diagram and store it in the repo.
  • Set up node classification

    • Define MAAS tags / resource pools for: compute, gpu, storage, infra, test.

    • Ensure every node has:

      • Role tag.
      • Site/room/rack metadata.
      • Any special hardware flags (GPU type, NVMe, etc).

1.3 OS Images & Base Configuration

  • Define image catalogue

    • Choose base OS (e.g. Ubuntu LTS / Debian stable).

    • Define 3-5 golden images (max), e.g.:

      • base-os (minimal hardened image).
      • infra-node (for MAAS/OpenStack/Proxmox controllers).
      • gpu-node (GPU drivers, CUDA stack).
      • (Optional) storage-node.
    • Set naming/versioning convention (e.g. eu-baseos-2025.01).

  • Harden base image

    • Baseline SSH config:

      • Key-based auth only (no passwords).
      • No direct root SSH (use sudo).
    • Remove obviously unnecessary packages and services.

    • Standard logging + monitoring agent baked in (or installed by config management).

  • Rebuild confidence

    • Confirm you can rebuild any node via MAAS + config management:

      • Pick one node per role and do a full reinstall.
      • Verify node comes back into expected state (role, monitoring, access).

2. Network Layer Checklist (Segmentation → Access → Security)

2.1 Logical Network Design

  • Define network segments

    • List core VLANs/networks:

      • mgmt (infra control plane).
      • storage.
      • tenant (workload traffic).
      • public (north-south).
      • backup.
      • oob (BMC / IPMI).
    • Assign CIDRs and names to each.

  • Routing & L3 boundaries

    • Decide where routing happens (e.g. ToR vs core).
    • Identify L3 gateways and firewall points.
    • Document which networks can speak to which (and why).

2.2 Admin & Operator Access

  • Bastion / jump hosts

    • Create at least one bastion per site.
    • Restrict SSH to infra nodes so it must go through bastion.
    • Enforce key-based auth and logins to bastions.
  • VPN design

    • Choose VPN solution for remote admins.
    • Restrict admin subnets via VPN (no full corporate LAN free-for-all).
    • Document joining/removal procedure for admin devices.
  • Out-of-band (OOB)

    • Put BMC/IPMI interfaces on a dedicated OOB network.
    • Restrict OOB access to admin/vpn ranges only.
    • Document OOB procedure in case primary network is down.

2.3 Security & Policies

  • Access control

    • Define who can:

      • Access MAAS.
      • Access virtualization APIs (OpenStack/Proxmox).
      • Access network gear.
    • Implement least privilege (roles/groups).

  • Network policy baselines

    • Default deny for inbound traffic from public.

    • Clear rules for:

      • SSH to infra.
      • DB access (from which networks, via which services).
    • Document exceptions and their owners.

  • Document network diagram

    • Produce one main L2/L3 + VLAN diagram.
    • Store it in git and link from platform docs.

3. Virtualization / Platform Layer Checklist (OpenStack/Proxmox)

3.1 Platform Scope & Roles

  • Choose role split

    • Decide what runs on:

      • OpenStack (multi-tenant workloads).
      • Proxmox (infra VMs, special cases).
      • Bare metal only (DBs, specific storage, heavy GPU jobs, etc).
    • Write this down as “What runs where” guidance.

3.2 OpenStack (or primary cloud platform)

  • Projects & Tenancy model

    • Decide project structure:

      • project = team + environment (e.g. payments-prod) or
      • project = team (environments as labels).
    • Define naming conventions:

      • Projects, security groups, networks, instances.
  • Flavors & quotas

    • Define a small set of standard flavors:

      • small, medium, large, gpu-small, gpu-large, etc.
    • Set default quotas per project:

      • CPU, RAM, disk, number of instances, GPUs.
    • Document process for requesting quota increases.

  • Networks for tenants

    • Standard pattern for tenant networks:

      • One internal network per project.
      • Optional external/public network attachment rules.
    • Standard floating IP usage rules (who can get one, how many).

  • Control plane Hardening & HA

    • Run key components in HA (if feasible):

      • API, schedulers, message queue, DB.
    • Enable TLS where possible for dashboard/API endpoints.

    • Ensure backups for OpenStack DB + configs.

3.3 Proxmox (or secondary virtualization platform)

  • Scope definition

    • Decide clearly:

      • Which workloads belong here (infra services? special vendor appliances?).
    • Avoid overlapping with OpenStack use cases when possible.

  • Resource & naming policy

    • Define naming for Proxmox clusters and VMs.
    • Decide whether teams get self-service Proxmox or it's SRE-only.

3.4 Configuration Management

  • IaC coverage

    • Ensure configs for:

      • OpenStack projects, networks, flavors.
      • Proxmox clusters and key VMs.
    • Are stored as code (Ansible, Terraform, etc.) and not just in the UI.

  • Reproducibility

    • Test that you can:

      • Recreate a project and its associated resources from code.
      • Rebuild a critical controller VM (from base image + config management).

4. Self-Service Layer Checklist (APIs → CLI → UX)

4.1 Define Core Self-Service Use Cases

  • List top flows

    • “Create new project/environment for a team.”
    • “Provision compute (VM) for a service.”
    • “Request GPU capacity.”
    • “Onboard a new service to monitoring.”
    • “See my project's resource usage.”
  • For each flow:

    • Define required inputs.
    • Define outputs and completion condition.
    • Identify which platform components are touched (OpenStack, MAAS, observability, etc).

4.2 API / CLI Design

  • Choose primary interface

    • Decide: CLI, internal API, or both as canonical interface.
    • Document: “All self-service flows must be available via X.”
  • Implement minimal CLI/API for key flows

    • create-project / create-namespace.
    • request-vm (or template-based: create-service).
    • request-gpu with constraints and limits.
    • show-usage (CPU/RAM/GPU/storage per project).
  • Guardrails

    • Enforce:

      • Naming standards (team/env in names).
      • Quotas (fail fast if over).
    • Log all actions centrally.

4.3 Golden Paths & Templates

  • Service templates

    • Provide:

      • Example app repo with CI/CD pipeline ready.
      • Example deployment manifest (VM/K8s/etc).
      • Built-in monitoring/logging configuration.
  • Onboarding checklists

    • New service checklist:

      • Project created.
      • Monitoring enabled.
      • Alerts defined.
      • Dashboards created.
      • Secret management integrated.

4.4 Documentation & Feedback

  • Platform docs

    • “How to get started” guide for:

      • New engineers.
      • New services.
    • FAQ for:

      • “Where do I run X?”
      • “How do I get more quota?”
  • Feedback loop

    • Set up a channel (Slack/discussions/form) for platform feedback.
    • Review and triage feedback monthly into a platform backlog.

5. Cross-Cutting Checklist (Observability + GitOps + Failure)

5.1 Observability

  • Telemetry baseline

    • Every node:

      • Exposes metrics (node exporter or equivalent).
      • Sends logs to central store with site/role tags.
    • Every platform service (MAAS, OpenStack, Proxmox, VPN, bastion):

      • Has metrics and basic dashboards.
  • Platform dashboards

    • Cluster capacity overview (CPU/RAM/storage/GPU).
    • Provisioning pipeline health (errors, durations).
    • Per-project usage dashboards.

5.2 GitOps

  • Repositories

    • At least:

      • infra-baremetal.
      • infra-network.
      • infra-platform (OpenStack/Proxmox).
      • infra-observability.
    • Each has clear README and ownership.

  • Change process

    • All changes go through PRs.
    • CI validates syntax, lint, and (where possible) dry runs.
    • Changes deployed via pipelines, not ad-hoc scripts.

5.3 Failure & Recovery

  • Document failure scenarios

    • Single node failure.
    • Rack switch failure.
    • Loss of MAAS / OpenStack API for a period.
    • Partial network partition (e.g. mgmt vs tenant).
  • For each scenario:

    • Define expected behavior.
    • Define manual/automatic recovery steps.
    • Run at least one game day per quarter to validate.