update, 14 roles at the round table

2025-12-04 18:26:26 +00:00
parent cff3d0c3f8
commit 1611ca9f7d
1 changed files with 37 additions and 455 deletions
--- a/sre-devops-council.md
+++ b/sre-devops-council.md
@@ -1,456 +1,38 @@
-# EUstartup SRE / DevOps Council — Living Design Doc & Playbook
+graph LR
    Council["AI Council OS — Man in the Middle\n(14-Seat Round Table)"]
    Council --- SRE_Principal
    Council --- BareMetal
    Council --- VirtArch
    Council --- OpenStackArch
    Council --- NetArch
    Council --- IaCLead
    Council --- GitOps
    Council --- Observability
    Council --- SRE_Reliability
    Council --- Security
    Council --- Sovereign
    Council --- Facility
    Council --- CapacityPerf
    Council --- LifecycleOps
    %% Roles
    SRE_Principal["Principal SRE/DevOps Architect"]
    BareMetal["Bare-Metal Provisioning Lead\n(MAAS/Ironic/PXE)"]
    VirtArch["Virtualization Architect\n(Proxmox/ESXi/KVM)"]
    OpenStackArch["OpenStack Cloud Architect\n(Kolla/Neutron/Nova)"]
    NetArch["Network Architect\n(Spine/Leaf/BGP/EVPN)"]
    IaCLead["Automation & IaC Lead\n(Ansible/Terraform/Python SDK)"]
    GitOps["CI/CD & GitOps Governance Lead"]
    Observability["Observability & Telemetry Architect"]
    SRE_Reliability["SRE Reliability Engineering Lead"]
    Security["Security Architect\n(Zero Trust, Compliance)"]
    Sovereign["Sovereign Compliance & Sustainability Lead\n(GDPR/EU Green)"]
    Facility["Physical Infrastructure & Facility Engineering Lead\n(Power/Cooling/EN 50600)"]
    CapacityPerf["Capacity & Performance Engineer"]
    LifecycleOps["Platform Lifecycle & Operations Lead"]
    %% Special close collaboration
    Sovereign <--> Facility
 > **Status:** Draft v0.1
 > **Scope:** All infrastructure, SRE, platform, and DevEx work for **EUstartup**.
 > **Owner:** Principal SRE (or acting)
 > **Last Updated:** <!-- update date here -->
 ---
 ## 1. Purpose & North Star
 The **SRE / DevOps Council** is the decision and coordination forum for everything related to:
 * Infrastructure (bare-metal, virtualization, cloud, networking).
 * Reliability, scalability, and performance of EUstartup’s products.
 * Developer & ML engineer experience on the platform.
 * Cost, capacity, and operational excellence.
 **North Star:**
 > "Provide a secure, reliable, and cost-aware platform that enables EUstartup teams to ship and operate products quickly and safely."
 We use this document as a **living playbook**:
 * To clarify **who decides what**, and **how**.
 * To capture **current architecture** and **desired evolution**.
 * To document **standard operating procedures (SOPs)** and **runbooks**.
 * To onboard new engineers into how EUstartup runs infrastructure.
 ---
 ## 2. Council Structure & Seats
 The council is a **virtual table** with the following seats. One person may hold multiple seats; a seat may be empty but is still a clearly defined responsibility.
 ### 2.1 Seats at the Table
 1. **Principal SRE & Bare-Metal Architect**
   * Owns end-to-end infra architecture & reliability.
   * Focus: MAAS/Ironic, PXE/Preseed/cloud-init, Debian/Ubuntu image lifecycle, multi-site bare-metal design & failure modes.
 2. **Network & Security Architect**
   * Owns network and security architecture.
   * Focus: VLANs, L2/L3 routing, VPN/WAN/UniFi, segmentation, SSH/API hardening, remote admin security.
 3. **Platform & Virtualization Engineer**
   * Owns virtualization and platform abstraction.
   * Focus: OpenStack (Kolla-Ansible), Proxmox, ESXi, resource quotas, integration of bare-metal with virtualization.
 4. **Automation & GitOps Lead**
   * Owns automation standards and GitOps workflows.
   * Focus: Ansible, CI/CD, repos and branching models, secrets, Bash/Python tooling.
 5. **Observability & Incident Analyst**
   * Owns telemetry stack and incident analysis.
   * Focus: Prometheus/Grafana, ELK/Loki/Graylog, SLOs/SLIs, alerting, postmortems.
 6. **FinOps & Capacity Planner**
   * Owns capacity planning & cost visibility.
   * Focus: compute/GPU tracking, power & licenses, amortization, unit economics.
 7. **Product & Developer Experience Partner**
   * Represents product teams and ML/dev users of the platform.
   * Focus: self-service APIs, UX of tooling, docs, templates.
 8. **Talent & Remote-Culture Partner (conditional)**
   * Added when hiring or changing roles.
   * Focus: role definitions, interview loops, remote-first norms.
 ### 2.2 Current Seat Assignments (EUstartup)
 > **Fill this table with actual names/aliases.**
 | Seat                                 | Primary | Backup | Notes                   |
 | ------------------------------------ | ------- | ------ | ----------------------- |
 | Principal SRE & Bare-Metal Architect | TBD     | TBD    |                         |
 | Network & Security Architect         | TBD     | TBD    |                         |
 | Platform & Virtualization Engineer   | TBD     | TBD    |                         |
 | Automation & GitOps Lead             | TBD     | TBD    |                         |
 | Observability & Incident Analyst     | TBD     | TBD    |                         |
 | FinOps & Capacity Planner            | TBD     | TBD    |                         |
 | Product & DevEx Partner              | TBD     | TBD    |                         |
 | Talent & Remote-Culture Partner      | TBD     | TBD    | Only active when hiring |
 ---
 ## 3. Council Operating Model
 ### 3.1 Cadence & Rituals
 **Weekly Council Sync (30–45 min)**
 * Goal: Review changes, risks, incidents, and upcoming work.
 * Attendees: All seats or their delegates.
 * Agenda template:
  1. Quick round: changes shipped last week.
  2. Incidents & reliability review (SLOs, recurring alerts).
  3. Capacity & cost updates (any red flags?).
  4. Upcoming infra changes / RFCs.
  5. Developer experience feedback.
 **Monthly Architecture Review (60–90 min)**
 * Goal: Step back and review medium/long-term architecture and roadmap.
 * Focus: new cluster designs, major migrations, deprecations.
 **Postmortems (per major incident)**
 * Goal: Learn and adapt. No blame.
 * Outcome: updated runbooks, alerts, docs, or architecture decisions.
 ### 3.2 Decision Types & Authority
 **Type A – Safety & Reliability** (e.g., SLOs, incident response, rollback policies)
 * Primary: Principal SRE & Observability.
 * Others consulted: Platform, Network/Security.
 **Type B – Architecture & Platform** (e.g., choosing OpenStack vs. Proxmox for a new workload)
 * Primary: Principal SRE & Platform Engineer.
 * Others consulted: Network/Security, Automation/GitOps, DevEx.
 **Type C – Process & Ways of Working** (e.g., Git branching model, incident process)
 * Primary: Automation/GitOps, Principal SRE.
 * Others consulted: DevEx, Talent & Remote-Culture.
 **Type D – Cost & Capacity** (e.g., when to buy more GPUs, rightsizing strategy)
 * Primary: FinOps & Capacity Planner.
 * Others consulted: Principal SRE, Platform, Product.
 ### 3.3 RACI Overview (Simplified)
 > Customize as needed.
 | Area                            | R (Responsible)   | A (Accountable) | C (Consulted)           | I (Informed)      |
 | ------------------------------- | ----------------- | --------------- | ----------------------- | ----------------- |
 | Bare-metal architecture         | Principal SRE     | Principal SRE   | Platform, Network/Sec   | FinOps, DevEx     |
 | Network & security baseline     | Network/Sec       | Network/Sec     | Principal SRE           | Everyone          |
 | Virtualization platform choices | Platform          | Principal SRE   | FinOps, DevEx           | All product teams |
 | CI/CD & GitOps workflow         | Automation/GitOps | Principal SRE   | DevEx, Security         | All engineers     |
 | Observability stack             | Observability     | Principal SRE   | Platform, DevEx         | All teams         |
 | SLOs / SLIs                     | Observability     | Principal SRE   | Product, DevEx          | All               |
 | Capacity planning & purchases   | FinOps            | CTO / VP Eng    | Principal SRE, Platform | Finance           |
 | Developer self-service API      | DevEx             | Principal SRE   | Platform, Automation    | Product teams     |
 ---
 ## 4. Platform & Architecture Overview
 > Keep this section updated as the platform evolves.
 ### 4.1 High-Level Environment
 * **Regions / sites:**
  * Primary DC: `eu-1` (location: …)
  * Secondary DC / DR: `eu-2` (location: …)
  * Optional: cloud provider (e.g., `eu-central-1`), used for burst / specific services.
 * **Primary workloads:**
  * User-facing web/API services.
  * Data pipelines and batch jobs.
  * ML/AI training and inference with GPUs.
 ### 4.2 Bare-Metal Layer (Principal SRE)
 * Provisioning stack:
  * MAAS / Ironic for hardware management.
  * PXE boot → Preseed / cloud-init for OS provisioning.
  * Standard OS: Debian/Ubuntu LTS images.
 * Image lifecycle:
  * Base image repo and versioning.
  * Hardening baseline (SSH config, packages, security updates).
  * Golden images for specific roles (compute, storage, GPU nodes, control-plane, infra-services).
 * Multi-site considerations:
  * Separation of control (per-site MAAS controllers or shared?).
  * Cross-site failover strategy and RTO/RPO targets.
 ### 4.3 Network & Security (Network/Sec Architect)
 * Network layout:
  * Core VLANs: `mgmt`, `storage`, `public`, `tenant`, `backup`, `out-of-band`.
  * Routing: L3 boundaries and where firewalls apply.
 * Security:
  * SSH & bastion host policy.
  * VPN topology for remote admins.
  * Zero-trust-ish: per-service authN/Z where feasible.
 ### 4.4 Virtualization / Platform Layer (Platform Engineer)
 * Stacks in use:
  * OpenStack (Kolla-Ansible) for multi-tenant compute/network/storage.
  * Proxmox for infra VMs / special workloads.
  * ESXi (if applicable) for legacy / vendor-specific needs.
 * Tenancy model:
  * Projects per product team / environment.
  * Resource quotas: CPU, RAM, storage, GPU.
  * Naming, tags/labels, and chargeback/finops integration.
 ### 4.5 Control Plane & Core Services
 List and describe:
 * Configuration management (Ansible structure, repos, roles).
 * CI/CD tooling (e.g., GitLab CI, GitHub Actions, Jenkins).
 * Secrets management (HashiCorp Vault / SOPS / KMS / etc.).
 * Artifact/image registry.
 * Observability stack.
 ---
 ## 5. Automation & GitOps Standards
 ### 5.1 Repositories & Branching
 * Infra as Code repos (examples):
  * `infra-baremetal` — MAAS/Ironic configs, PXE, images.
  * `infra-network` — network definitions, firewall rules (where possible).
  * `infra-platform` — OpenStack/Proxmox configs, Kolla-Ansible.
  * `infra-observability` — dashboards, alert rules.
 * Branch naming and policies:
  * `main` is always deployable.
  * Feature branches: `feat/<area>-<shortdesc>`.
  * Hotfix branches: `fix/<incidentid>-<shortdesc>`.
 ### 5.2 Ansible & Idempotency
 * Role structure:
  * Roles grouped by domain (e.g., `base_os`, `k8s_node`, `openstack_compute`).
  * Strict idempotency: plays can be run repeatedly without side effects.
 * Inventories:
  * Dynamic inventories from MAAS/OpenStack.
  * Grouping by role and site.
 ### 5.3 GitOps Flow
 * All infra changes:
  1. PR with description, risk, and rollback plan.
  2. Peer review by at least one council seat.
  3. CI validation (lint, syntax checks, dry-runs where possible).
  4. Merge → automated apply (or controlled pipelines with approval gates).
 * Secrets:
  * Never stored in plain text.
  * Clear guidance on rotation and access.
 ---
 ## 6. Observability & Incident Management
 ### 6.1 Telemetry Stack
 * Metrics: Prometheus (+ exporters), visualized in Grafana.
 * Logs: ELK / Loki / Graylog (pick one and document).
 * Traces (if applicable): OpenTelemetry / Jaeger / Tempo, etc.
 ### 6.2 SLOs & SLIs
 For each important service (API, internal platform, etc.):
 * Define **SLIs** (e.g., availability, latency, error rate).
 * Define **SLOs** (targets over 30d windows).
 ### 6.3 Alerting Strategy
 * Principles:
  * Alerts must be actionable.
  * Tie alerts to SLOs and clear runbooks.
 * Categories:
  * Page: wake someone up (critical, user-impacting).
  * Ticket: needs attention during working hours.
  * Dashboard-only: informational.
 ### 6.4 Incident Response
 * **Severity levels:** SEV-1, SEV-2, SEV-3 (define examples).
 * **Roles during incident:**
  * Incident Commander.
  * Communications lead (internal & external).
  * Domain experts (network, platform, etc.).
 * **Timeline:**
  1. Detection & triage.
  2. Containment / mitigation.
  3. Root cause analysis.
  4. Postmortem within X business days.
 ### 6.5 Postmortems
 * Always blameless.
 * Template includes:
  * What happened.
  * Timeline.
  * Impact.
  * Contributing factors.
  * What worked, what didn’t.
  * Follow-up actions (with owners & due dates).
 ---
 ## 7. FinOps & Capacity Planning
 ### 7.1 Asset & Usage Tracking
 * Compute & GPU inventory (per node, per cluster, per tenant).
 * Storage usage by project/team.
 * Power consumption and rack utilization where possible.
 ### 7.2 Cost Model (Even On-Prem)
 * Components:
  * Hardware amortization.
  * Power & cooling.
  * Licenses & support.
  * Staff time (approximate).
 * Map costs to:
  * Projects / teams.
  * Environments (prod, staging, dev).
 ### 7.3 Capacity Planning Cycle
 * Monthly/quarterly review:
  * Utilization vs. headroom.
  * Forecast upcoming projects.
  * Decide on new purchases vs cloud burst.
 ---
 ## 8. Product & Developer Experience
 ### 8.1 User Profiles
 * Backend engineers (APIs, services).
 * Data engineers.
 * ML engineers / researchers.
 ### 8.2 Self-Service Interfaces
 * APIs and/or CLI for:
  * Provisioning compute/storage.
  * Requesting GPUs.
  * Viewing usage and costs.
 * Templates:
  * App scaffolding.
  * CI/CD pipelines.
  * Helm charts or deployment manifests.
 ### 8.3 Documentation & Onboarding
 * Single entrypoint: "How to use the platform" guide.
 * Checklists for:
  * New service onboarding.
  * Adding monitoring & alerting.
  * Security reviews.
 ---
 ## 9. Talent & Remote Culture (When Active)
 ### 9.1 Role Design
 * Clear levels (e.g., SRE I/II/III, Principal).
 * Example responsibilities per level.
 ### 9.2 Hiring Process
 * Standard loop:
  * Recruiter/intro.
  * Technical screen (practical, scenario-based).
  * Systems design / deep dive.
  * Culture & collaboration interview.
 ### 9.3 Remote-First Norms
 * Async communication guidelines.
 * Incident handling across time zones.
 * Documentation as a first-class artifact.
 ---
 ## 10. Change Management & How to Update This Doc
 ### 10.1 Updating the Playbook
 * This doc lives in version control (`/docs/platform/sre-devops-council.md` or similar).
 * Changes follow the same **PR + review** flow as infra changes.
 * At least one of: Principal SRE, Platform Engineer, or DevEx Partner must approve.
 ### 10.2 Versioning & Changelog
 Maintain a short changelog here:
 * `v0.1` – Initial council-based structure for EUstartup.
 * `v0.x` – (add entries as you refine architecture, processes, etc.).
 ---
 ## 11. Open Questions & TODOs
 Use this section as a backlog of design and process questions for the council.
 * [ ] Choose and document canonical logging stack (ELK vs Loki vs Graylog).
 * [ ] Finalize SLOs for core user-facing services.
 * [ ] Define GPU allocation policy per team.
 * [ ] Decide on single primary IaC repo vs multiple domain repos.
 * [ ] Document exact incident severity matrix & response SLAs.
 Add more as they come up in council meetings.