14 KiB
EUstartup SRE / DevOps Council — Living Design Doc & Playbook
Status: Draft v0.1 Scope: All infrastructure, SRE, platform, and DevEx work for EUstartup. Owner: Principal SRE (or acting) Last Updated:
1. Purpose & North Star
The SRE / DevOps Council is the decision and coordination forum for everything related to:
- Infrastructure (bare-metal, virtualization, cloud, networking).
- Reliability, scalability, and performance of EUstartup’s products.
- Developer & ML engineer experience on the platform.
- Cost, capacity, and operational excellence.
North Star:
"Provide a secure, reliable, and cost-aware platform that enables EUstartup teams to ship and operate products quickly and safely."
We use this document as a living playbook:
- To clarify who decides what, and how.
- To capture current architecture and desired evolution.
- To document standard operating procedures (SOPs) and runbooks.
- To onboard new engineers into how EUstartup runs infrastructure.
2. Council Structure & Seats
The council is a virtual table with the following seats. One person may hold multiple seats; a seat may be empty but is still a clearly defined responsibility.
2.1 Seats at the Table
-
Principal SRE & Bare-Metal Architect
- Owns end-to-end infra architecture & reliability.
- Focus: MAAS/Ironic, PXE/Preseed/cloud-init, Debian/Ubuntu image lifecycle, multi-site bare-metal design & failure modes.
-
Network & Security Architect
- Owns network and security architecture.
- Focus: VLANs, L2/L3 routing, VPN/WAN/UniFi, segmentation, SSH/API hardening, remote admin security.
-
Platform & Virtualization Engineer
- Owns virtualization and platform abstraction.
- Focus: OpenStack (Kolla-Ansible), Proxmox, ESXi, resource quotas, integration of bare-metal with virtualization.
-
Automation & GitOps Lead
- Owns automation standards and GitOps workflows.
- Focus: Ansible, CI/CD, repos and branching models, secrets, Bash/Python tooling.
-
Observability & Incident Analyst
- Owns telemetry stack and incident analysis.
- Focus: Prometheus/Grafana, ELK/Loki/Graylog, SLOs/SLIs, alerting, postmortems.
-
FinOps & Capacity Planner
- Owns capacity planning & cost visibility.
- Focus: compute/GPU tracking, power & licenses, amortization, unit economics.
-
Product & Developer Experience Partner
- Represents product teams and ML/dev users of the platform.
- Focus: self-service APIs, UX of tooling, docs, templates.
-
Talent & Remote-Culture Partner (conditional)
- Added when hiring or changing roles.
- Focus: role definitions, interview loops, remote-first norms.
2.2 Current Seat Assignments (EUstartup)
Fill this table with actual names/aliases.
| Seat | Primary | Backup | Notes |
|---|---|---|---|
| Principal SRE & Bare-Metal Architect | TBD | TBD | |
| Network & Security Architect | TBD | TBD | |
| Platform & Virtualization Engineer | TBD | TBD | |
| Automation & GitOps Lead | TBD | TBD | |
| Observability & Incident Analyst | TBD | TBD | |
| FinOps & Capacity Planner | TBD | TBD | |
| Product & DevEx Partner | TBD | TBD | |
| Talent & Remote-Culture Partner | TBD | TBD | Only active when hiring |
3. Council Operating Model
3.1 Cadence & Rituals
Weekly Council Sync (30–45 min)
-
Goal: Review changes, risks, incidents, and upcoming work.
-
Attendees: All seats or their delegates.
-
Agenda template:
- Quick round: changes shipped last week.
- Incidents & reliability review (SLOs, recurring alerts).
- Capacity & cost updates (any red flags?).
- Upcoming infra changes / RFCs.
- Developer experience feedback.
Monthly Architecture Review (60–90 min)
- Goal: Step back and review medium/long-term architecture and roadmap.
- Focus: new cluster designs, major migrations, deprecations.
Postmortems (per major incident)
- Goal: Learn and adapt. No blame.
- Outcome: updated runbooks, alerts, docs, or architecture decisions.
3.2 Decision Types & Authority
Type A – Safety & Reliability (e.g., SLOs, incident response, rollback policies)
- Primary: Principal SRE & Observability.
- Others consulted: Platform, Network/Security.
Type B – Architecture & Platform (e.g., choosing OpenStack vs. Proxmox for a new workload)
- Primary: Principal SRE & Platform Engineer.
- Others consulted: Network/Security, Automation/GitOps, DevEx.
Type C – Process & Ways of Working (e.g., Git branching model, incident process)
- Primary: Automation/GitOps, Principal SRE.
- Others consulted: DevEx, Talent & Remote-Culture.
Type D – Cost & Capacity (e.g., when to buy more GPUs, rightsizing strategy)
- Primary: FinOps & Capacity Planner.
- Others consulted: Principal SRE, Platform, Product.
3.3 RACI Overview (Simplified)
Customize as needed.
| Area | R (Responsible) | A (Accountable) | C (Consulted) | I (Informed) |
|---|---|---|---|---|
| Bare-metal architecture | Principal SRE | Principal SRE | Platform, Network/Sec | FinOps, DevEx |
| Network & security baseline | Network/Sec | Network/Sec | Principal SRE | Everyone |
| Virtualization platform choices | Platform | Principal SRE | FinOps, DevEx | All product teams |
| CI/CD & GitOps workflow | Automation/GitOps | Principal SRE | DevEx, Security | All engineers |
| Observability stack | Observability | Principal SRE | Platform, DevEx | All teams |
| SLOs / SLIs | Observability | Principal SRE | Product, DevEx | All |
| Capacity planning & purchases | FinOps | CTO / VP Eng | Principal SRE, Platform | Finance |
| Developer self-service API | DevEx | Principal SRE | Platform, Automation | Product teams |
4. Platform & Architecture Overview
Keep this section updated as the platform evolves.
4.1 High-Level Environment
-
Regions / sites:
- Primary DC:
eu-1(location: …) - Secondary DC / DR:
eu-2(location: …) - Optional: cloud provider (e.g.,
eu-central-1), used for burst / specific services.
- Primary DC:
-
Primary workloads:
- User-facing web/API services.
- Data pipelines and batch jobs.
- ML/AI training and inference with GPUs.
4.2 Bare-Metal Layer (Principal SRE)
-
Provisioning stack:
- MAAS / Ironic for hardware management.
- PXE boot → Preseed / cloud-init for OS provisioning.
- Standard OS: Debian/Ubuntu LTS images.
-
Image lifecycle:
- Base image repo and versioning.
- Hardening baseline (SSH config, packages, security updates).
- Golden images for specific roles (compute, storage, GPU nodes, control-plane, infra-services).
-
Multi-site considerations:
- Separation of control (per-site MAAS controllers or shared?).
- Cross-site failover strategy and RTO/RPO targets.
4.3 Network & Security (Network/Sec Architect)
-
Network layout:
- Core VLANs:
mgmt,storage,public,tenant,backup,out-of-band. - Routing: L3 boundaries and where firewalls apply.
- Core VLANs:
-
Security:
- SSH & bastion host policy.
- VPN topology for remote admins.
- Zero-trust-ish: per-service authN/Z where feasible.
4.4 Virtualization / Platform Layer (Platform Engineer)
-
Stacks in use:
- OpenStack (Kolla-Ansible) for multi-tenant compute/network/storage.
- Proxmox for infra VMs / special workloads.
- ESXi (if applicable) for legacy / vendor-specific needs.
-
Tenancy model:
- Projects per product team / environment.
- Resource quotas: CPU, RAM, storage, GPU.
- Naming, tags/labels, and chargeback/finops integration.
4.5 Control Plane & Core Services
List and describe:
- Configuration management (Ansible structure, repos, roles).
- CI/CD tooling (e.g., GitLab CI, GitHub Actions, Jenkins).
- Secrets management (HashiCorp Vault / SOPS / KMS / etc.).
- Artifact/image registry.
- Observability stack.
5. Automation & GitOps Standards
5.1 Repositories & Branching
-
Infra as Code repos (examples):
infra-baremetal— MAAS/Ironic configs, PXE, images.infra-network— network definitions, firewall rules (where possible).infra-platform— OpenStack/Proxmox configs, Kolla-Ansible.infra-observability— dashboards, alert rules.
-
Branch naming and policies:
mainis always deployable.- Feature branches:
feat/<area>-<shortdesc>. - Hotfix branches:
fix/<incidentid>-<shortdesc>.
5.2 Ansible & Idempotency
-
Role structure:
- Roles grouped by domain (e.g.,
base_os,k8s_node,openstack_compute). - Strict idempotency: plays can be run repeatedly without side effects.
- Roles grouped by domain (e.g.,
-
Inventories:
- Dynamic inventories from MAAS/OpenStack.
- Grouping by role and site.
5.3 GitOps Flow
-
All infra changes:
- PR with description, risk, and rollback plan.
- Peer review by at least one council seat.
- CI validation (lint, syntax checks, dry-runs where possible).
- Merge → automated apply (or controlled pipelines with approval gates).
-
Secrets:
- Never stored in plain text.
- Clear guidance on rotation and access.
6. Observability & Incident Management
6.1 Telemetry Stack
- Metrics: Prometheus (+ exporters), visualized in Grafana.
- Logs: ELK / Loki / Graylog (pick one and document).
- Traces (if applicable): OpenTelemetry / Jaeger / Tempo, etc.
6.2 SLOs & SLIs
For each important service (API, internal platform, etc.):
- Define SLIs (e.g., availability, latency, error rate).
- Define SLOs (targets over 30d windows).
6.3 Alerting Strategy
-
Principles:
- Alerts must be actionable.
- Tie alerts to SLOs and clear runbooks.
-
Categories:
- Page: wake someone up (critical, user-impacting).
- Ticket: needs attention during working hours.
- Dashboard-only: informational.
6.4 Incident Response
-
Severity levels: SEV-1, SEV-2, SEV-3 (define examples).
-
Roles during incident:
- Incident Commander.
- Communications lead (internal & external).
- Domain experts (network, platform, etc.).
-
Timeline:
- Detection & triage.
- Containment / mitigation.
- Root cause analysis.
- Postmortem within X business days.
6.5 Postmortems
-
Always blameless.
-
Template includes:
- What happened.
- Timeline.
- Impact.
- Contributing factors.
- What worked, what didn’t.
- Follow-up actions (with owners & due dates).
7. FinOps & Capacity Planning
7.1 Asset & Usage Tracking
- Compute & GPU inventory (per node, per cluster, per tenant).
- Storage usage by project/team.
- Power consumption and rack utilization where possible.
7.2 Cost Model (Even On-Prem)
-
Components:
- Hardware amortization.
- Power & cooling.
- Licenses & support.
- Staff time (approximate).
-
Map costs to:
- Projects / teams.
- Environments (prod, staging, dev).
7.3 Capacity Planning Cycle
-
Monthly/quarterly review:
- Utilization vs. headroom.
- Forecast upcoming projects.
- Decide on new purchases vs cloud burst.
8. Product & Developer Experience
8.1 User Profiles
- Backend engineers (APIs, services).
- Data engineers.
- ML engineers / researchers.
8.2 Self-Service Interfaces
-
APIs and/or CLI for:
- Provisioning compute/storage.
- Requesting GPUs.
- Viewing usage and costs.
-
Templates:
- App scaffolding.
- CI/CD pipelines.
- Helm charts or deployment manifests.
8.3 Documentation & Onboarding
-
Single entrypoint: "How to use the platform" guide.
-
Checklists for:
- New service onboarding.
- Adding monitoring & alerting.
- Security reviews.
9. Talent & Remote Culture (When Active)
9.1 Role Design
- Clear levels (e.g., SRE I/II/III, Principal).
- Example responsibilities per level.
9.2 Hiring Process
-
Standard loop:
- Recruiter/intro.
- Technical screen (practical, scenario-based).
- Systems design / deep dive.
- Culture & collaboration interview.
9.3 Remote-First Norms
- Async communication guidelines.
- Incident handling across time zones.
- Documentation as a first-class artifact.
10. Change Management & How to Update This Doc
10.1 Updating the Playbook
- This doc lives in version control (
/docs/platform/sre-devops-council.mdor similar). - Changes follow the same PR + review flow as infra changes.
- At least one of: Principal SRE, Platform Engineer, or DevEx Partner must approve.
10.2 Versioning & Changelog
Maintain a short changelog here:
v0.1– Initial council-based structure for EUstartup.v0.x– (add entries as you refine architecture, processes, etc.).
11. Open Questions & TODOs
Use this section as a backlog of design and process questions for the council.
- Choose and document canonical logging stack (ELK vs Loki vs Graylog).
- Finalize SLOs for core user-facing services.
- Define GPU allocation policy per team.
- Decide on single primary IaC repo vs multiple domain repos.
- Document exact incident severity matrix & response SLAs.
Add more as they come up in council meetings.