Files

sbanszky 1002c2ad9d Add sre-devops-council.md

2025-12-04 11:00:17 +00:00

14 KiB

Raw Blame History

EUstartup SRE / DevOps Council — Living Design Doc & Playbook

Status: Draft v0.1 Scope: All infrastructure, SRE, platform, and DevEx work for EUstartup. Owner: Principal SRE (or acting) Last Updated:

1. Purpose & North Star

The SRE / DevOps Council is the decision and coordination forum for everything related to:

Infrastructure (bare-metal, virtualization, cloud, networking).
Reliability, scalability, and performance of EUstartup’s products.
Developer & ML engineer experience on the platform.
Cost, capacity, and operational excellence.

North Star:

"Provide a secure, reliable, and cost-aware platform that enables EUstartup teams to ship and operate products quickly and safely."

We use this document as a living playbook:

To clarify who decides what, and how.
To capture current architecture and desired evolution.
To document standard operating procedures (SOPs) and runbooks.
To onboard new engineers into how EUstartup runs infrastructure.

2. Council Structure & Seats

The council is a virtual table with the following seats. One person may hold multiple seats; a seat may be empty but is still a clearly defined responsibility.

2.1 Seats at the Table

Principal SRE & Bare-Metal Architect
- Owns end-to-end infra architecture & reliability.
- Focus: MAAS/Ironic, PXE/Preseed/cloud-init, Debian/Ubuntu image lifecycle, multi-site bare-metal design & failure modes.
Network & Security Architect
- Owns network and security architecture.
- Focus: VLANs, L2/L3 routing, VPN/WAN/UniFi, segmentation, SSH/API hardening, remote admin security.
Platform & Virtualization Engineer
- Owns virtualization and platform abstraction.
- Focus: OpenStack (Kolla-Ansible), Proxmox, ESXi, resource quotas, integration of bare-metal with virtualization.
Automation & GitOps Lead
- Owns automation standards and GitOps workflows.
- Focus: Ansible, CI/CD, repos and branching models, secrets, Bash/Python tooling.
Observability & Incident Analyst
- Owns telemetry stack and incident analysis.
- Focus: Prometheus/Grafana, ELK/Loki/Graylog, SLOs/SLIs, alerting, postmortems.
FinOps & Capacity Planner
- Owns capacity planning & cost visibility.
- Focus: compute/GPU tracking, power & licenses, amortization, unit economics.
Product & Developer Experience Partner
- Represents product teams and ML/dev users of the platform.
- Focus: self-service APIs, UX of tooling, docs, templates.
Talent & Remote-Culture Partner (conditional)
- Added when hiring or changing roles.
- Focus: role definitions, interview loops, remote-first norms.

2.2 Current Seat Assignments (EUstartup)

Fill this table with actual names/aliases.

Seat	Primary	Backup	Notes
Principal SRE & Bare-Metal Architect	TBD	TBD
Network & Security Architect	TBD	TBD
Platform & Virtualization Engineer	TBD	TBD
Automation & GitOps Lead	TBD	TBD
Observability & Incident Analyst	TBD	TBD
FinOps & Capacity Planner	TBD	TBD
Product & DevEx Partner	TBD	TBD
Talent & Remote-Culture Partner	TBD	TBD	Only active when hiring

3. Council Operating Model

3.1 Cadence & Rituals

Weekly Council Sync (30–45 min)

Goal: Review changes, risks, incidents, and upcoming work.
Attendees: All seats or their delegates.
Agenda template:
1. Quick round: changes shipped last week.
2. Incidents & reliability review (SLOs, recurring alerts).
3. Capacity & cost updates (any red flags?).
4. Upcoming infra changes / RFCs.
5. Developer experience feedback.

Monthly Architecture Review (60–90 min)

Goal: Step back and review medium/long-term architecture and roadmap.
Focus: new cluster designs, major migrations, deprecations.

Postmortems (per major incident)

Goal: Learn and adapt. No blame.
Outcome: updated runbooks, alerts, docs, or architecture decisions.

3.2 Decision Types & Authority

Type A – Safety & Reliability (e.g., SLOs, incident response, rollback policies)

Primary: Principal SRE & Observability.
Others consulted: Platform, Network/Security.

Type B – Architecture & Platform (e.g., choosing OpenStack vs. Proxmox for a new workload)

Primary: Principal SRE & Platform Engineer.
Others consulted: Network/Security, Automation/GitOps, DevEx.

Type C – Process & Ways of Working (e.g., Git branching model, incident process)

Primary: Automation/GitOps, Principal SRE.
Others consulted: DevEx, Talent & Remote-Culture.

Type D – Cost & Capacity (e.g., when to buy more GPUs, rightsizing strategy)

Primary: FinOps & Capacity Planner.
Others consulted: Principal SRE, Platform, Product.

3.3 RACI Overview (Simplified)

Customize as needed.

Area	R (Responsible)	A (Accountable)	C (Consulted)	I (Informed)
Bare-metal architecture	Principal SRE	Principal SRE	Platform, Network/Sec	FinOps, DevEx
Network & security baseline	Network/Sec	Network/Sec	Principal SRE	Everyone
Virtualization platform choices	Platform	Principal SRE	FinOps, DevEx	All product teams
CI/CD & GitOps workflow	Automation/GitOps	Principal SRE	DevEx, Security	All engineers
Observability stack	Observability	Principal SRE	Platform, DevEx	All teams
SLOs / SLIs	Observability	Principal SRE	Product, DevEx	All
Capacity planning & purchases	FinOps	CTO / VP Eng	Principal SRE, Platform	Finance
Developer self-service API	DevEx	Principal SRE	Platform, Automation	Product teams

4. Platform & Architecture Overview

Keep this section updated as the platform evolves.

4.1 High-Level Environment

Regions / sites:
- Primary DC: eu-1 (location: …)
- Secondary DC / DR: eu-2 (location: …)
- Optional: cloud provider (e.g., eu-central-1), used for burst / specific services.
Primary workloads:
- User-facing web/API services.
- Data pipelines and batch jobs.
- ML/AI training and inference with GPUs.

4.2 Bare-Metal Layer (Principal SRE)

Provisioning stack:
- MAAS / Ironic for hardware management.
- PXE boot → Preseed / cloud-init for OS provisioning.
- Standard OS: Debian/Ubuntu LTS images.
Image lifecycle:
- Base image repo and versioning.
- Hardening baseline (SSH config, packages, security updates).
- Golden images for specific roles (compute, storage, GPU nodes, control-plane, infra-services).
Multi-site considerations:
- Separation of control (per-site MAAS controllers or shared?).
- Cross-site failover strategy and RTO/RPO targets.

4.3 Network & Security (Network/Sec Architect)

Network layout:
- Core VLANs: mgmt, storage, public, tenant, backup, out-of-band.
- Routing: L3 boundaries and where firewalls apply.
Security:
- SSH & bastion host policy.
- VPN topology for remote admins.
- Zero-trust-ish: per-service authN/Z where feasible.

4.4 Virtualization / Platform Layer (Platform Engineer)

Stacks in use:
- OpenStack (Kolla-Ansible) for multi-tenant compute/network/storage.
- Proxmox for infra VMs / special workloads.
- ESXi (if applicable) for legacy / vendor-specific needs.
Tenancy model:
- Projects per product team / environment.
- Resource quotas: CPU, RAM, storage, GPU.
- Naming, tags/labels, and chargeback/finops integration.

4.5 Control Plane & Core Services

List and describe:

Configuration management (Ansible structure, repos, roles).
CI/CD tooling (e.g., GitLab CI, GitHub Actions, Jenkins).
Secrets management (HashiCorp Vault / SOPS / KMS / etc.).
Artifact/image registry.
Observability stack.

5. Automation & GitOps Standards

5.1 Repositories & Branching

Infra as Code repos (examples):
- infra-baremetal — MAAS/Ironic configs, PXE, images.
- infra-network — network definitions, firewall rules (where possible).
- infra-platform — OpenStack/Proxmox configs, Kolla-Ansible.
- infra-observability — dashboards, alert rules.
Branch naming and policies:
- main is always deployable.
- Feature branches: feat/<area>-<shortdesc>.
- Hotfix branches: fix/<incidentid>-<shortdesc>.

5.2 Ansible & Idempotency

Role structure:
- Roles grouped by domain (e.g., base_os, k8s_node, openstack_compute).
- Strict idempotency: plays can be run repeatedly without side effects.
Inventories:
- Dynamic inventories from MAAS/OpenStack.
- Grouping by role and site.

5.3 GitOps Flow

All infra changes:
1. PR with description, risk, and rollback plan.
2. Peer review by at least one council seat.
3. CI validation (lint, syntax checks, dry-runs where possible).
4. Merge → automated apply (or controlled pipelines with approval gates).
Secrets:
- Never stored in plain text.
- Clear guidance on rotation and access.

6. Observability & Incident Management

6.1 Telemetry Stack

Metrics: Prometheus (+ exporters), visualized in Grafana.
Logs: ELK / Loki / Graylog (pick one and document).
Traces (if applicable): OpenTelemetry / Jaeger / Tempo, etc.

6.2 SLOs & SLIs

For each important service (API, internal platform, etc.):

Define SLIs (e.g., availability, latency, error rate).
Define SLOs (targets over 30d windows).

6.3 Alerting Strategy

Principles:
- Alerts must be actionable.
- Tie alerts to SLOs and clear runbooks.
Categories:
- Page: wake someone up (critical, user-impacting).
- Ticket: needs attention during working hours.
- Dashboard-only: informational.

6.4 Incident Response

Severity levels: SEV-1, SEV-2, SEV-3 (define examples).
Roles during incident:
- Incident Commander.
- Communications lead (internal & external).
- Domain experts (network, platform, etc.).
Timeline:
1. Detection & triage.
2. Containment / mitigation.
3. Root cause analysis.
4. Postmortem within X business days.

6.5 Postmortems

Always blameless.
Template includes:
- What happened.
- Timeline.
- Impact.
- Contributing factors.
- What worked, what didn’t.
- Follow-up actions (with owners & due dates).

7. FinOps & Capacity Planning

7.1 Asset & Usage Tracking

Compute & GPU inventory (per node, per cluster, per tenant).
Storage usage by project/team.
Power consumption and rack utilization where possible.

7.2 Cost Model (Even On-Prem)

Components:
- Hardware amortization.
- Power & cooling.
- Licenses & support.
- Staff time (approximate).
Map costs to:
- Projects / teams.
- Environments (prod, staging, dev).

7.3 Capacity Planning Cycle

Monthly/quarterly review:
- Utilization vs. headroom.
- Forecast upcoming projects.
- Decide on new purchases vs cloud burst.

8. Product & Developer Experience

8.1 User Profiles

Backend engineers (APIs, services).
Data engineers.
ML engineers / researchers.

8.2 Self-Service Interfaces

APIs and/or CLI for:
- Provisioning compute/storage.
- Requesting GPUs.
- Viewing usage and costs.
Templates:
- App scaffolding.
- CI/CD pipelines.
- Helm charts or deployment manifests.

8.3 Documentation & Onboarding

Single entrypoint: "How to use the platform" guide.
Checklists for:
- New service onboarding.
- Adding monitoring & alerting.
- Security reviews.

9. Talent & Remote Culture (When Active)

9.1 Role Design

Clear levels (e.g., SRE I/II/III, Principal).
Example responsibilities per level.

9.2 Hiring Process

Standard loop:
- Recruiter/intro.
- Technical screen (practical, scenario-based).
- Systems design / deep dive.
- Culture & collaboration interview.

9.3 Remote-First Norms

Async communication guidelines.
Incident handling across time zones.
Documentation as a first-class artifact.

10. Change Management & How to Update This Doc

10.1 Updating the Playbook

This doc lives in version control (/docs/platform/sre-devops-council.md or similar).
Changes follow the same PR + review flow as infra changes.
At least one of: Principal SRE, Platform Engineer, or DevEx Partner must approve.

10.2 Versioning & Changelog

Maintain a short changelog here:

v0.1 – Initial council-based structure for EUstartup.
v0.x – (add entries as you refine architecture, processes, etc.).

11. Open Questions & TODOs

Use this section as a backlog of design and process questions for the council.

Choose and document canonical logging stack (ELK vs Loki vs Graylog).
Finalize SLOs for core user-facing services.
Define GPU allocation policy per team.
Decide on single primary IaC repo vs multiple domain repos.
Document exact incident severity matrix & response SLAs.

Add more as they come up in council meetings.

14 KiB Raw Blame History Unescape Escape