From 1611ca9f7d98e3684d4c27cdef642299f444723c Mon Sep 17 00:00:00 2001 From: sbanszky Date: Thu, 4 Dec 2025 18:26:26 +0000 Subject: [PATCH] update, 14 roles at the round table --- sre-devops-council.md | 492 ++++-------------------------------------- 1 file changed, 37 insertions(+), 455 deletions(-) diff --git a/sre-devops-council.md b/sre-devops-council.md index 962dd36..40c5e07 100644 --- a/sre-devops-council.md +++ b/sre-devops-council.md @@ -1,456 +1,38 @@ -# EUstartup SRE / DevOps Council — Living Design Doc & Playbook +graph LR + Council["AI Council OS — Man in the Middle\n(14-Seat Round Table)"] + + Council --- SRE_Principal + Council --- BareMetal + Council --- VirtArch + Council --- OpenStackArch + Council --- NetArch + Council --- IaCLead + Council --- GitOps + Council --- Observability + Council --- SRE_Reliability + Council --- Security + Council --- Sovereign + Council --- Facility + Council --- CapacityPerf + Council --- LifecycleOps + + %% Roles + + SRE_Principal["Principal SRE/DevOps Architect"] + BareMetal["Bare-Metal Provisioning Lead\n(MAAS/Ironic/PXE)"] + VirtArch["Virtualization Architect\n(Proxmox/ESXi/KVM)"] + OpenStackArch["OpenStack Cloud Architect\n(Kolla/Neutron/Nova)"] + NetArch["Network Architect\n(Spine/Leaf/BGP/EVPN)"] + IaCLead["Automation & IaC Lead\n(Ansible/Terraform/Python SDK)"] + GitOps["CI/CD & GitOps Governance Lead"] + Observability["Observability & Telemetry Architect"] + SRE_Reliability["SRE Reliability Engineering Lead"] + Security["Security Architect\n(Zero Trust, Compliance)"] + Sovereign["Sovereign Compliance & Sustainability Lead\n(GDPR/EU Green)"] + Facility["Physical Infrastructure & Facility Engineering Lead\n(Power/Cooling/EN 50600)"] + CapacityPerf["Capacity & Performance Engineer"] + LifecycleOps["Platform Lifecycle & Operations Lead"] + + %% Special close collaboration + Sovereign <--> Facility -> **Status:** Draft v0.1 -> **Scope:** All infrastructure, SRE, platform, and DevEx work for **EUstartup**. -> **Owner:** Principal SRE (or acting) -> **Last Updated:** - ---- - -## 1. Purpose & North Star - -The **SRE / DevOps Council** is the decision and coordination forum for everything related to: - -* Infrastructure (bare-metal, virtualization, cloud, networking). -* Reliability, scalability, and performance of EUstartup’s products. -* Developer & ML engineer experience on the platform. -* Cost, capacity, and operational excellence. - -**North Star:** - -> "Provide a secure, reliable, and cost-aware platform that enables EUstartup teams to ship and operate products quickly and safely." - -We use this document as a **living playbook**: - -* To clarify **who decides what**, and **how**. -* To capture **current architecture** and **desired evolution**. -* To document **standard operating procedures (SOPs)** and **runbooks**. -* To onboard new engineers into how EUstartup runs infrastructure. - ---- - -## 2. Council Structure & Seats - -The council is a **virtual table** with the following seats. One person may hold multiple seats; a seat may be empty but is still a clearly defined responsibility. - -### 2.1 Seats at the Table - -1. **Principal SRE & Bare-Metal Architect** - - * Owns end-to-end infra architecture & reliability. - * Focus: MAAS/Ironic, PXE/Preseed/cloud-init, Debian/Ubuntu image lifecycle, multi-site bare-metal design & failure modes. - -2. **Network & Security Architect** - - * Owns network and security architecture. - * Focus: VLANs, L2/L3 routing, VPN/WAN/UniFi, segmentation, SSH/API hardening, remote admin security. - -3. **Platform & Virtualization Engineer** - - * Owns virtualization and platform abstraction. - * Focus: OpenStack (Kolla-Ansible), Proxmox, ESXi, resource quotas, integration of bare-metal with virtualization. - -4. **Automation & GitOps Lead** - - * Owns automation standards and GitOps workflows. - * Focus: Ansible, CI/CD, repos and branching models, secrets, Bash/Python tooling. - -5. **Observability & Incident Analyst** - - * Owns telemetry stack and incident analysis. - * Focus: Prometheus/Grafana, ELK/Loki/Graylog, SLOs/SLIs, alerting, postmortems. - -6. **FinOps & Capacity Planner** - - * Owns capacity planning & cost visibility. - * Focus: compute/GPU tracking, power & licenses, amortization, unit economics. - -7. **Product & Developer Experience Partner** - - * Represents product teams and ML/dev users of the platform. - * Focus: self-service APIs, UX of tooling, docs, templates. - -8. **Talent & Remote-Culture Partner (conditional)** - - * Added when hiring or changing roles. - * Focus: role definitions, interview loops, remote-first norms. - -### 2.2 Current Seat Assignments (EUstartup) - -> **Fill this table with actual names/aliases.** - -| Seat | Primary | Backup | Notes | -| ------------------------------------ | ------- | ------ | ----------------------- | -| Principal SRE & Bare-Metal Architect | TBD | TBD | | -| Network & Security Architect | TBD | TBD | | -| Platform & Virtualization Engineer | TBD | TBD | | -| Automation & GitOps Lead | TBD | TBD | | -| Observability & Incident Analyst | TBD | TBD | | -| FinOps & Capacity Planner | TBD | TBD | | -| Product & DevEx Partner | TBD | TBD | | -| Talent & Remote-Culture Partner | TBD | TBD | Only active when hiring | - ---- - -## 3. Council Operating Model - -### 3.1 Cadence & Rituals - -**Weekly Council Sync (30–45 min)** - -* Goal: Review changes, risks, incidents, and upcoming work. -* Attendees: All seats or their delegates. -* Agenda template: - - 1. Quick round: changes shipped last week. - 2. Incidents & reliability review (SLOs, recurring alerts). - 3. Capacity & cost updates (any red flags?). - 4. Upcoming infra changes / RFCs. - 5. Developer experience feedback. - -**Monthly Architecture Review (60–90 min)** - -* Goal: Step back and review medium/long-term architecture and roadmap. -* Focus: new cluster designs, major migrations, deprecations. - -**Postmortems (per major incident)** - -* Goal: Learn and adapt. No blame. -* Outcome: updated runbooks, alerts, docs, or architecture decisions. - -### 3.2 Decision Types & Authority - -**Type A – Safety & Reliability** (e.g., SLOs, incident response, rollback policies) - -* Primary: Principal SRE & Observability. -* Others consulted: Platform, Network/Security. - -**Type B – Architecture & Platform** (e.g., choosing OpenStack vs. Proxmox for a new workload) - -* Primary: Principal SRE & Platform Engineer. -* Others consulted: Network/Security, Automation/GitOps, DevEx. - -**Type C – Process & Ways of Working** (e.g., Git branching model, incident process) - -* Primary: Automation/GitOps, Principal SRE. -* Others consulted: DevEx, Talent & Remote-Culture. - -**Type D – Cost & Capacity** (e.g., when to buy more GPUs, rightsizing strategy) - -* Primary: FinOps & Capacity Planner. -* Others consulted: Principal SRE, Platform, Product. - -### 3.3 RACI Overview (Simplified) - -> Customize as needed. - -| Area | R (Responsible) | A (Accountable) | C (Consulted) | I (Informed) | -| ------------------------------- | ----------------- | --------------- | ----------------------- | ----------------- | -| Bare-metal architecture | Principal SRE | Principal SRE | Platform, Network/Sec | FinOps, DevEx | -| Network & security baseline | Network/Sec | Network/Sec | Principal SRE | Everyone | -| Virtualization platform choices | Platform | Principal SRE | FinOps, DevEx | All product teams | -| CI/CD & GitOps workflow | Automation/GitOps | Principal SRE | DevEx, Security | All engineers | -| Observability stack | Observability | Principal SRE | Platform, DevEx | All teams | -| SLOs / SLIs | Observability | Principal SRE | Product, DevEx | All | -| Capacity planning & purchases | FinOps | CTO / VP Eng | Principal SRE, Platform | Finance | -| Developer self-service API | DevEx | Principal SRE | Platform, Automation | Product teams | - ---- - -## 4. Platform & Architecture Overview - -> Keep this section updated as the platform evolves. - -### 4.1 High-Level Environment - -* **Regions / sites:** - - * Primary DC: `eu-1` (location: …) - * Secondary DC / DR: `eu-2` (location: …) - * Optional: cloud provider (e.g., `eu-central-1`), used for burst / specific services. - -* **Primary workloads:** - - * User-facing web/API services. - * Data pipelines and batch jobs. - * ML/AI training and inference with GPUs. - -### 4.2 Bare-Metal Layer (Principal SRE) - -* Provisioning stack: - - * MAAS / Ironic for hardware management. - * PXE boot → Preseed / cloud-init for OS provisioning. - * Standard OS: Debian/Ubuntu LTS images. - -* Image lifecycle: - - * Base image repo and versioning. - * Hardening baseline (SSH config, packages, security updates). - * Golden images for specific roles (compute, storage, GPU nodes, control-plane, infra-services). - -* Multi-site considerations: - - * Separation of control (per-site MAAS controllers or shared?). - * Cross-site failover strategy and RTO/RPO targets. - -### 4.3 Network & Security (Network/Sec Architect) - -* Network layout: - - * Core VLANs: `mgmt`, `storage`, `public`, `tenant`, `backup`, `out-of-band`. - * Routing: L3 boundaries and where firewalls apply. - -* Security: - - * SSH & bastion host policy. - * VPN topology for remote admins. - * Zero-trust-ish: per-service authN/Z where feasible. - -### 4.4 Virtualization / Platform Layer (Platform Engineer) - -* Stacks in use: - - * OpenStack (Kolla-Ansible) for multi-tenant compute/network/storage. - * Proxmox for infra VMs / special workloads. - * ESXi (if applicable) for legacy / vendor-specific needs. - -* Tenancy model: - - * Projects per product team / environment. - * Resource quotas: CPU, RAM, storage, GPU. - * Naming, tags/labels, and chargeback/finops integration. - -### 4.5 Control Plane & Core Services - -List and describe: - -* Configuration management (Ansible structure, repos, roles). -* CI/CD tooling (e.g., GitLab CI, GitHub Actions, Jenkins). -* Secrets management (HashiCorp Vault / SOPS / KMS / etc.). -* Artifact/image registry. -* Observability stack. - ---- - -## 5. Automation & GitOps Standards - -### 5.1 Repositories & Branching - -* Infra as Code repos (examples): - - * `infra-baremetal` — MAAS/Ironic configs, PXE, images. - * `infra-network` — network definitions, firewall rules (where possible). - * `infra-platform` — OpenStack/Proxmox configs, Kolla-Ansible. - * `infra-observability` — dashboards, alert rules. - -* Branch naming and policies: - - * `main` is always deployable. - * Feature branches: `feat/-`. - * Hotfix branches: `fix/-`. - -### 5.2 Ansible & Idempotency - -* Role structure: - - * Roles grouped by domain (e.g., `base_os`, `k8s_node`, `openstack_compute`). - * Strict idempotency: plays can be run repeatedly without side effects. - -* Inventories: - - * Dynamic inventories from MAAS/OpenStack. - * Grouping by role and site. - -### 5.3 GitOps Flow - -* All infra changes: - - 1. PR with description, risk, and rollback plan. - 2. Peer review by at least one council seat. - 3. CI validation (lint, syntax checks, dry-runs where possible). - 4. Merge → automated apply (or controlled pipelines with approval gates). - -* Secrets: - - * Never stored in plain text. - * Clear guidance on rotation and access. - ---- - -## 6. Observability & Incident Management - -### 6.1 Telemetry Stack - -* Metrics: Prometheus (+ exporters), visualized in Grafana. -* Logs: ELK / Loki / Graylog (pick one and document). -* Traces (if applicable): OpenTelemetry / Jaeger / Tempo, etc. - -### 6.2 SLOs & SLIs - -For each important service (API, internal platform, etc.): - -* Define **SLIs** (e.g., availability, latency, error rate). -* Define **SLOs** (targets over 30d windows). - -### 6.3 Alerting Strategy - -* Principles: - - * Alerts must be actionable. - * Tie alerts to SLOs and clear runbooks. - -* Categories: - - * Page: wake someone up (critical, user-impacting). - * Ticket: needs attention during working hours. - * Dashboard-only: informational. - -### 6.4 Incident Response - -* **Severity levels:** SEV-1, SEV-2, SEV-3 (define examples). - -* **Roles during incident:** - - * Incident Commander. - * Communications lead (internal & external). - * Domain experts (network, platform, etc.). - -* **Timeline:** - - 1. Detection & triage. - 2. Containment / mitigation. - 3. Root cause analysis. - 4. Postmortem within X business days. - -### 6.5 Postmortems - -* Always blameless. -* Template includes: - - * What happened. - * Timeline. - * Impact. - * Contributing factors. - * What worked, what didn’t. - * Follow-up actions (with owners & due dates). - ---- - -## 7. FinOps & Capacity Planning - -### 7.1 Asset & Usage Tracking - -* Compute & GPU inventory (per node, per cluster, per tenant). -* Storage usage by project/team. -* Power consumption and rack utilization where possible. - -### 7.2 Cost Model (Even On-Prem) - -* Components: - - * Hardware amortization. - * Power & cooling. - * Licenses & support. - * Staff time (approximate). - -* Map costs to: - - * Projects / teams. - * Environments (prod, staging, dev). - -### 7.3 Capacity Planning Cycle - -* Monthly/quarterly review: - - * Utilization vs. headroom. - * Forecast upcoming projects. - * Decide on new purchases vs cloud burst. - ---- - -## 8. Product & Developer Experience - -### 8.1 User Profiles - -* Backend engineers (APIs, services). -* Data engineers. -* ML engineers / researchers. - -### 8.2 Self-Service Interfaces - -* APIs and/or CLI for: - - * Provisioning compute/storage. - * Requesting GPUs. - * Viewing usage and costs. - -* Templates: - - * App scaffolding. - * CI/CD pipelines. - * Helm charts or deployment manifests. - -### 8.3 Documentation & Onboarding - -* Single entrypoint: "How to use the platform" guide. -* Checklists for: - - * New service onboarding. - * Adding monitoring & alerting. - * Security reviews. - ---- - -## 9. Talent & Remote Culture (When Active) - -### 9.1 Role Design - -* Clear levels (e.g., SRE I/II/III, Principal). -* Example responsibilities per level. - -### 9.2 Hiring Process - -* Standard loop: - - * Recruiter/intro. - * Technical screen (practical, scenario-based). - * Systems design / deep dive. - * Culture & collaboration interview. - -### 9.3 Remote-First Norms - -* Async communication guidelines. -* Incident handling across time zones. -* Documentation as a first-class artifact. - ---- - -## 10. Change Management & How to Update This Doc - -### 10.1 Updating the Playbook - -* This doc lives in version control (`/docs/platform/sre-devops-council.md` or similar). -* Changes follow the same **PR + review** flow as infra changes. -* At least one of: Principal SRE, Platform Engineer, or DevEx Partner must approve. - -### 10.2 Versioning & Changelog - -Maintain a short changelog here: - -* `v0.1` – Initial council-based structure for EUstartup. -* `v0.x` – (add entries as you refine architecture, processes, etc.). - ---- - -## 11. Open Questions & TODOs - -Use this section as a backlog of design and process questions for the council. - -* [ ] Choose and document canonical logging stack (ELK vs Loki vs Graylog). -* [ ] Finalize SLOs for core user-facing services. -* [ ] Define GPU allocation policy per team. -* [ ] Decide on single primary IaC repo vs multiple domain repos. -* [ ] Document exact incident severity matrix & response SLAs. - -Add more as they come up in council meetings.