EU-startup/checklists/platform-architecture.md at 5886236c88a2fcf75f18db2432e3d48e276308f5

sbanszky/EU-startup

Fork 0

Files

sbanszky 5886236c88 Update checklists/platform-architecture.md

2025-12-05 16:32:23 +00:00

11 KiB

Raw Blame History

a work-through checklist, from Metal → Network → Virtualization → Self-service.

0. Foundation: Goals, Contracts, and Users

Define platform goals
- Write 3-5 bullet points for what the platform must enable (e.g. “fast provisioning”, “reliable GPU workloads”).
- Define initial SLOs for the platform itself (e.g. “VM provision succeeds within 15 minutes 99% of the time”).
Define platform customers
- List main user types: backend devs, ML engineers, data engineers, etc.
- For each, write 3 typical tasks (e.g. “deploy a new microservice”, “run a 3-day GPU training job”).
Write a one-page Platform Contract
- What the platform guarantees (uptime, basic security, monitoring defaults).
- What product teams must do (health checks, logging, deployment pattern, secrets usage).
- Store this in version control and share it.

1. Metal Layer Checklist (Racks → MAAS → Images)

1.1 Hardware & Physical Layout

Inventory physical assets
- List all servers with: CPU, RAM, disk(s), NICs, GPU(s) where applicable.
- Identify roles: compute, gpu, storage, infra, control-plane.
Define physical topology
- Map which racks and switches each server connects to.
- Document power feeds and any redundancy (A/B feeds, UPS, etc).

1.2 MAAS / Bare-Metal Provisioning

Design MAAS/Ironic architecture
- Decide MAAS region(s) and rack controllers per site.
- Decide where MAAS database/API lives and how it's backed up.
- Define access rules to MAAS (who can log in, via what SSO/LDAP/etc).
Standardize provisioning pipeline
- Confirm single flow: power on → PXE → MAAS → Preseed/cloud-init → config management.
- Remove or document every deviation / legacy path.
- Create a flow diagram and store it in the repo.
Set up node classification
- Define MAAS tags / resource pools for: compute, gpu, storage, infra, test.
- Ensure every node has:
  - Role tag.
  - Site/room/rack metadata.
  - Any special hardware flags (GPU type, NVMe, etc).

1.3 OS Images & Base Configuration

Define image catalogue
- Choose base OS (e.g. Ubuntu LTS / Debian stable).
- Define 3-5 golden images (max), e.g.:
  - base-os (minimal hardened image).
  - infra-node (for MAAS/OpenStack/Proxmox controllers).
  - gpu-node (GPU drivers, CUDA stack).
  - (Optional) storage-node.
- Set naming/versioning convention (e.g. eu-baseos-2025.01).
Harden base image
- Baseline SSH config:
  - Key-based auth only (no passwords).
  - No direct root SSH (use sudo).
- Remove obviously unnecessary packages and services.
- Standard logging + monitoring agent baked in (or installed by config management).
Rebuild confidence
- Confirm you can rebuild any node via MAAS + config management:
  - Pick one node per role and do a full reinstall.
  - Verify node comes back into expected state (role, monitoring, access).

2. Network Layer Checklist (Segmentation → Access → Security)

2.1 Logical Network Design

Define network segments
- List core VLANs/networks:
  - mgmt (infra control plane).
  - storage.
  - tenant (workload traffic).
  - public (north-south).
  - backup.
  - oob (BMC / IPMI).
- Assign CIDRs and names to each.
Routing & L3 boundaries
- Decide where routing happens (e.g. ToR vs core).
- Identify L3 gateways and firewall points.
- Document which networks can speak to which (and why).

2.2 Admin & Operator Access

Bastion / jump hosts
- Create at least one bastion per site.
- Restrict SSH to infra nodes so it must go through bastion.
- Enforce key-based auth and logins to bastions.
VPN design
- Choose VPN solution for remote admins.
- Restrict admin subnets via VPN (no full corporate LAN free-for-all).
- Document joining/removal procedure for admin devices.
Out-of-band (OOB)
- Put BMC/IPMI interfaces on a dedicated OOB network.
- Restrict OOB access to admin/vpn ranges only.
- Document OOB procedure in case primary network is down.

2.3 Security & Policies

Access control
- Define who can:
  - Access MAAS.
  - Access virtualization APIs (OpenStack/Proxmox).
  - Access network gear.
- Implement least privilege (roles/groups).
Network policy baselines
- Default deny for inbound traffic from public.
- Clear rules for:
  - SSH to infra.
  - DB access (from which networks, via which services).
- Document exceptions and their owners.
Document network diagram
- Produce one main L2/L3 + VLAN diagram.
- Store it in git and link from platform docs.

3. Virtualization / Platform Layer Checklist (OpenStack/Proxmox)

3.1 Platform Scope & Roles

Choose role split
- Decide what runs on:
  - OpenStack (multi-tenant workloads).
  - Proxmox (infra VMs, special cases).
  - Bare metal only (DBs, specific storage, heavy GPU jobs, etc).
- Write this down as “What runs where” guidance.

3.2 OpenStack (or primary cloud platform)

Projects & Tenancy model
- Decide project structure:
  - project = team + environment (e.g. payments-prod) or
  - project = team (environments as labels).
- Define naming conventions:
  - Projects, security groups, networks, instances.
Flavors & quotas
- Define a small set of standard flavors:
  - small, medium, large, gpu-small, gpu-large, etc.
- Set default quotas per project:
  - CPU, RAM, disk, number of instances, GPUs.
- Document process for requesting quota increases.
Networks for tenants
- Standard pattern for tenant networks:
  - One internal network per project.
  - Optional external/public network attachment rules.
- Standard floating IP usage rules (who can get one, how many).
Control plane Hardening & HA
- Run key components in HA (if feasible):
  - API, schedulers, message queue, DB.
- Enable TLS where possible for dashboard/API endpoints.
- Ensure backups for OpenStack DB + configs.

3.3 Proxmox (or secondary virtualization platform)

Scope definition
- Decide clearly:
  - Which workloads belong here (infra services? special vendor appliances?).
- Avoid overlapping with OpenStack use cases when possible.
Resource & naming policy
- Define naming for Proxmox clusters and VMs.
- Decide whether teams get self-service Proxmox or it's SRE-only.

3.4 Configuration Management

IaC coverage
- Ensure configs for:
  - OpenStack projects, networks, flavors.
  - Proxmox clusters and key VMs.
- Are stored as code (Ansible, Terraform, etc.) and not just in the UI.
Reproducibility
- Test that you can:
  - Recreate a project and its associated resources from code.
  - Rebuild a critical controller VM (from base image + config management).

4. Self-Service Layer Checklist (APIs → CLI → UX)

4.1 Define Core Self-Service Use Cases

List top flows
- “Create new project/environment for a team.”
- “Provision compute (VM) for a service.”
- “Request GPU capacity.”
- “Onboard a new service to monitoring.”
- “See my project's resource usage.”
For each flow:
- Define required inputs.
- Define outputs and completion condition.
- Identify which platform components are touched (OpenStack, MAAS, observability, etc).

4.2 API / CLI Design

Choose primary interface
- Decide: CLI, internal API, or both as canonical interface.
- Document: “All self-service flows must be available via X.”
Implement minimal CLI/API for key flows
- create-project / create-namespace.
- request-vm (or template-based: create-service).
- request-gpu with constraints and limits.
- show-usage (CPU/RAM/GPU/storage per project).
Guardrails
- Enforce:
  - Naming standards (team/env in names).
  - Quotas (fail fast if over).
- Log all actions centrally.

4.3 Golden Paths & Templates

Service templates
- Provide:
  - Example app repo with CI/CD pipeline ready.
  - Example deployment manifest (VM/K8s/etc).
  - Built-in monitoring/logging configuration.
Onboarding checklists
- New service checklist:
  - Project created.
  - Monitoring enabled.
  - Alerts defined.
  - Dashboards created.
  - Secret management integrated.

4.4 Documentation & Feedback

Platform docs
- “How to get started” guide for:
  - New engineers.
  - New services.
- FAQ for:
  - “Where do I run X?”
  - “How do I get more quota?”
Feedback loop
- Set up a channel (Slack/discussions/form) for platform feedback.
- Review and triage feedback monthly into a platform backlog.

5. Cross-Cutting Checklist (Observability + GitOps + Failure)

5.1 Observability

Telemetry baseline
- Every node:
  - Exposes metrics (node exporter or equivalent).
  - Sends logs to central store with site/role tags.
- Every platform service (MAAS, OpenStack, Proxmox, VPN, bastion):
  - Has metrics and basic dashboards.
Platform dashboards
- Cluster capacity overview (CPU/RAM/storage/GPU).
- Provisioning pipeline health (errors, durations).
- Per-project usage dashboards.

5.2 GitOps

Repositories
- At least:
  - infra-baremetal.
  - infra-network.
  - infra-platform (OpenStack/Proxmox).
  - infra-observability.
- Each has clear README and ownership.
Change process
- All changes go through PRs.
- CI validates syntax, lint, and (where possible) dry runs.
- Changes deployed via pipelines, not ad-hoc scripts.

5.3 Failure & Recovery

Document failure scenarios
- Single node failure.
- Rack switch failure.
- Loss of MAAS / OpenStack API for a period.
- Partial network partition (e.g. mgmt vs tenant).
For each scenario:
- Define expected behavior.
- Define manual/automatic recovery steps.
- Run at least one game day per quarter to validate.

11 KiB Raw Blame History

0. Foundation: Goals, Contracts, and Users

1. Metal Layer Checklist (Racks → MAAS → Images)

1.1 Hardware & Physical Layout

1.2 MAAS / Bare-Metal Provisioning

1.3 OS Images & Base Configuration

2. Network Layer Checklist (Segmentation → Access → Security)

2.1 Logical Network Design

2.2 Admin & Operator Access

2.3 Security & Policies

3. Virtualization / Platform Layer Checklist (OpenStack/Proxmox)

3.1 Platform Scope & Roles

3.2 OpenStack (or primary cloud platform)

3.3 Proxmox (or secondary virtualization platform)

3.4 Configuration Management

4. Self-Service Layer Checklist (APIs → CLI → UX)

4.1 Define Core Self-Service Use Cases

4.2 API / CLI Design

4.3 Golden Paths & Templates

4.4 Documentation & Feedback

5. Cross-Cutting Checklist (Observability + GitOps + Failure)

5.1 Observability

5.2 GitOps

5.3 Failure & Recovery

11 KiB

Raw Blame History