From 1611ca9f7d98e3684d4c27cdef642299f444723c Mon Sep 17 00:00:00 2001
From: sbanszky <sbanszky@noreply.localhost>
Date: Thu, 4 Dec 2025 18:26:26 +0000
Subject: [PATCH] update, 14 roles at the round table

---
 sre-devops-council.md | 492 ++++--------------------------------------
 1 file changed, 37 insertions(+), 455 deletions(-)

diff --git a/sre-devops-council.md b/sre-devops-council.md
index 962dd36..40c5e07 100644
--- a/sre-devops-council.md
+++ b/sre-devops-council.md
@@ -1,456 +1,38 @@
-# EUstartup SRE / DevOps Council — Living Design Doc & Playbook
+graph LR
+    Council["AI Council OS — Man in the Middle\n(14-Seat Round Table)"]
+
+    Council --- SRE_Principal
+    Council --- BareMetal
+    Council --- VirtArch
+    Council --- OpenStackArch
+    Council --- NetArch
+    Council --- IaCLead
+    Council --- GitOps
+    Council --- Observability
+    Council --- SRE_Reliability
+    Council --- Security
+    Council --- Sovereign
+    Council --- Facility
+    Council --- CapacityPerf
+    Council --- LifecycleOps
+
+    %% Roles
+
+    SRE_Principal["Principal SRE/DevOps Architect"]
+    BareMetal["Bare-Metal Provisioning Lead\n(MAAS/Ironic/PXE)"]
+    VirtArch["Virtualization Architect\n(Proxmox/ESXi/KVM)"]
+    OpenStackArch["OpenStack Cloud Architect\n(Kolla/Neutron/Nova)"]
+    NetArch["Network Architect\n(Spine/Leaf/BGP/EVPN)"]
+    IaCLead["Automation & IaC Lead\n(Ansible/Terraform/Python SDK)"]
+    GitOps["CI/CD & GitOps Governance Lead"]
+    Observability["Observability & Telemetry Architect"]
+    SRE_Reliability["SRE Reliability Engineering Lead"]
+    Security["Security Architect\n(Zero Trust, Compliance)"]
+    Sovereign["Sovereign Compliance & Sustainability Lead\n(GDPR/EU Green)"]
+    Facility["Physical Infrastructure & Facility Engineering Lead\n(Power/Cooling/EN 50600)"]
+    CapacityPerf["Capacity & Performance Engineer"]
+    LifecycleOps["Platform Lifecycle & Operations Lead"]
+
+    %% Special close collaboration
+    Sovereign <--> Facility
 
-> **Status:** Draft v0.1
-> **Scope:** All infrastructure, SRE, platform, and DevEx work for **EUstartup**.
-> **Owner:** Principal SRE (or acting)
-> **Last Updated:** <!-- update date here -->
-
----
-
-## 1. Purpose & North Star
-
-The **SRE / DevOps Council** is the decision and coordination forum for everything related to:
-
-* Infrastructure (bare-metal, virtualization, cloud, networking).
-* Reliability, scalability, and performance of EUstartup’s products.
-* Developer & ML engineer experience on the platform.
-* Cost, capacity, and operational excellence.
-
-**North Star:**
-
-> "Provide a secure, reliable, and cost-aware platform that enables EUstartup teams to ship and operate products quickly and safely."
-
-We use this document as a **living playbook**:
-
-* To clarify **who decides what**, and **how**.
-* To capture **current architecture** and **desired evolution**.
-* To document **standard operating procedures (SOPs)** and **runbooks**.
-* To onboard new engineers into how EUstartup runs infrastructure.
-
----
-
-## 2. Council Structure & Seats
-
-The council is a **virtual table** with the following seats. One person may hold multiple seats; a seat may be empty but is still a clearly defined responsibility.
-
-### 2.1 Seats at the Table
-
-1. **Principal SRE & Bare-Metal Architect**
-
-   * Owns end-to-end infra architecture & reliability.
-   * Focus: MAAS/Ironic, PXE/Preseed/cloud-init, Debian/Ubuntu image lifecycle, multi-site bare-metal design & failure modes.
-
-2. **Network & Security Architect**
-
-   * Owns network and security architecture.
-   * Focus: VLANs, L2/L3 routing, VPN/WAN/UniFi, segmentation, SSH/API hardening, remote admin security.
-
-3. **Platform & Virtualization Engineer**
-
-   * Owns virtualization and platform abstraction.
-   * Focus: OpenStack (Kolla-Ansible), Proxmox, ESXi, resource quotas, integration of bare-metal with virtualization.
-
-4. **Automation & GitOps Lead**
-
-   * Owns automation standards and GitOps workflows.
-   * Focus: Ansible, CI/CD, repos and branching models, secrets, Bash/Python tooling.
-
-5. **Observability & Incident Analyst**
-
-   * Owns telemetry stack and incident analysis.
-   * Focus: Prometheus/Grafana, ELK/Loki/Graylog, SLOs/SLIs, alerting, postmortems.
-
-6. **FinOps & Capacity Planner**
-
-   * Owns capacity planning & cost visibility.
-   * Focus: compute/GPU tracking, power & licenses, amortization, unit economics.
-
-7. **Product & Developer Experience Partner**
-
-   * Represents product teams and ML/dev users of the platform.
-   * Focus: self-service APIs, UX of tooling, docs, templates.
-
-8. **Talent & Remote-Culture Partner (conditional)**
-
-   * Added when hiring or changing roles.
-   * Focus: role definitions, interview loops, remote-first norms.
-
-### 2.2 Current Seat Assignments (EUstartup)
-
-> **Fill this table with actual names/aliases.**
-
-| Seat                                 | Primary | Backup | Notes                   |
-| ------------------------------------ | ------- | ------ | ----------------------- |
-| Principal SRE & Bare-Metal Architect | TBD     | TBD    |                         |
-| Network & Security Architect         | TBD     | TBD    |                         |
-| Platform & Virtualization Engineer   | TBD     | TBD    |                         |
-| Automation & GitOps Lead             | TBD     | TBD    |                         |
-| Observability & Incident Analyst     | TBD     | TBD    |                         |
-| FinOps & Capacity Planner            | TBD     | TBD    |                         |
-| Product & DevEx Partner              | TBD     | TBD    |                         |
-| Talent & Remote-Culture Partner      | TBD     | TBD    | Only active when hiring |
-
----
-
-## 3. Council Operating Model
-
-### 3.1 Cadence & Rituals
-
-**Weekly Council Sync (30–45 min)**
-
-* Goal: Review changes, risks, incidents, and upcoming work.
-* Attendees: All seats or their delegates.
-* Agenda template:
-
-  1. Quick round: changes shipped last week.
-  2. Incidents & reliability review (SLOs, recurring alerts).
-  3. Capacity & cost updates (any red flags?).
-  4. Upcoming infra changes / RFCs.
-  5. Developer experience feedback.
-
-**Monthly Architecture Review (60–90 min)**
-
-* Goal: Step back and review medium/long-term architecture and roadmap.
-* Focus: new cluster designs, major migrations, deprecations.
-
-**Postmortems (per major incident)**
-
-* Goal: Learn and adapt. No blame.
-* Outcome: updated runbooks, alerts, docs, or architecture decisions.
-
-### 3.2 Decision Types & Authority
-
-**Type A – Safety & Reliability** (e.g., SLOs, incident response, rollback policies)
-
-* Primary: Principal SRE & Observability.
-* Others consulted: Platform, Network/Security.
-
-**Type B – Architecture & Platform** (e.g., choosing OpenStack vs. Proxmox for a new workload)
-
-* Primary: Principal SRE & Platform Engineer.
-* Others consulted: Network/Security, Automation/GitOps, DevEx.
-
-**Type C – Process & Ways of Working** (e.g., Git branching model, incident process)
-
-* Primary: Automation/GitOps, Principal SRE.
-* Others consulted: DevEx, Talent & Remote-Culture.
-
-**Type D – Cost & Capacity** (e.g., when to buy more GPUs, rightsizing strategy)
-
-* Primary: FinOps & Capacity Planner.
-* Others consulted: Principal SRE, Platform, Product.
-
-### 3.3 RACI Overview (Simplified)
-
-> Customize as needed.
-
-| Area                            | R (Responsible)   | A (Accountable) | C (Consulted)           | I (Informed)      |
-| ------------------------------- | ----------------- | --------------- | ----------------------- | ----------------- |
-| Bare-metal architecture         | Principal SRE     | Principal SRE   | Platform, Network/Sec   | FinOps, DevEx     |
-| Network & security baseline     | Network/Sec       | Network/Sec     | Principal SRE           | Everyone          |
-| Virtualization platform choices | Platform          | Principal SRE   | FinOps, DevEx           | All product teams |
-| CI/CD & GitOps workflow         | Automation/GitOps | Principal SRE   | DevEx, Security         | All engineers     |
-| Observability stack             | Observability     | Principal SRE   | Platform, DevEx         | All teams         |
-| SLOs / SLIs                     | Observability     | Principal SRE   | Product, DevEx          | All               |
-| Capacity planning & purchases   | FinOps            | CTO / VP Eng    | Principal SRE, Platform | Finance           |
-| Developer self-service API      | DevEx             | Principal SRE   | Platform, Automation    | Product teams     |
-
----
-
-## 4. Platform & Architecture Overview
-
-> Keep this section updated as the platform evolves.
-
-### 4.1 High-Level Environment
-
-* **Regions / sites:**
-
-  * Primary DC: `eu-1` (location: …)
-  * Secondary DC / DR: `eu-2` (location: …)
-  * Optional: cloud provider (e.g., `eu-central-1`), used for burst / specific services.
-
-* **Primary workloads:**
-
-  * User-facing web/API services.
-  * Data pipelines and batch jobs.
-  * ML/AI training and inference with GPUs.
-
-### 4.2 Bare-Metal Layer (Principal SRE)
-
-* Provisioning stack:
-
-  * MAAS / Ironic for hardware management.
-  * PXE boot → Preseed / cloud-init for OS provisioning.
-  * Standard OS: Debian/Ubuntu LTS images.
-
-* Image lifecycle:
-
-  * Base image repo and versioning.
-  * Hardening baseline (SSH config, packages, security updates).
-  * Golden images for specific roles (compute, storage, GPU nodes, control-plane, infra-services).
-
-* Multi-site considerations:
-
-  * Separation of control (per-site MAAS controllers or shared?).
-  * Cross-site failover strategy and RTO/RPO targets.
-
-### 4.3 Network & Security (Network/Sec Architect)
-
-* Network layout:
-
-  * Core VLANs: `mgmt`, `storage`, `public`, `tenant`, `backup`, `out-of-band`.
-  * Routing: L3 boundaries and where firewalls apply.
-
-* Security:
-
-  * SSH & bastion host policy.
-  * VPN topology for remote admins.
-  * Zero-trust-ish: per-service authN/Z where feasible.
-
-### 4.4 Virtualization / Platform Layer (Platform Engineer)
-
-* Stacks in use:
-
-  * OpenStack (Kolla-Ansible) for multi-tenant compute/network/storage.
-  * Proxmox for infra VMs / special workloads.
-  * ESXi (if applicable) for legacy / vendor-specific needs.
-
-* Tenancy model:
-
-  * Projects per product team / environment.
-  * Resource quotas: CPU, RAM, storage, GPU.
-  * Naming, tags/labels, and chargeback/finops integration.
-
-### 4.5 Control Plane & Core Services
-
-List and describe:
-
-* Configuration management (Ansible structure, repos, roles).
-* CI/CD tooling (e.g., GitLab CI, GitHub Actions, Jenkins).
-* Secrets management (HashiCorp Vault / SOPS / KMS / etc.).
-* Artifact/image registry.
-* Observability stack.
-
----
-
-## 5. Automation & GitOps Standards
-
-### 5.1 Repositories & Branching
-
-* Infra as Code repos (examples):
-
-  * `infra-baremetal` — MAAS/Ironic configs, PXE, images.
-  * `infra-network` — network definitions, firewall rules (where possible).
-  * `infra-platform` — OpenStack/Proxmox configs, Kolla-Ansible.
-  * `infra-observability` — dashboards, alert rules.
-
-* Branch naming and policies:
-
-  * `main` is always deployable.
-  * Feature branches: `feat/<area>-<shortdesc>`.
-  * Hotfix branches: `fix/<incidentid>-<shortdesc>`.
-
-### 5.2 Ansible & Idempotency
-
-* Role structure:
-
-  * Roles grouped by domain (e.g., `base_os`, `k8s_node`, `openstack_compute`).
-  * Strict idempotency: plays can be run repeatedly without side effects.
-
-* Inventories:
-
-  * Dynamic inventories from MAAS/OpenStack.
-  * Grouping by role and site.
-
-### 5.3 GitOps Flow
-
-* All infra changes:
-
-  1. PR with description, risk, and rollback plan.
-  2. Peer review by at least one council seat.
-  3. CI validation (lint, syntax checks, dry-runs where possible).
-  4. Merge → automated apply (or controlled pipelines with approval gates).
-
-* Secrets:
-
-  * Never stored in plain text.
-  * Clear guidance on rotation and access.
-
----
-
-## 6. Observability & Incident Management
-
-### 6.1 Telemetry Stack
-
-* Metrics: Prometheus (+ exporters), visualized in Grafana.
-* Logs: ELK / Loki / Graylog (pick one and document).
-* Traces (if applicable): OpenTelemetry / Jaeger / Tempo, etc.
-
-### 6.2 SLOs & SLIs
-
-For each important service (API, internal platform, etc.):
-
-* Define **SLIs** (e.g., availability, latency, error rate).
-* Define **SLOs** (targets over 30d windows).
-
-### 6.3 Alerting Strategy
-
-* Principles:
-
-  * Alerts must be actionable.
-  * Tie alerts to SLOs and clear runbooks.
-
-* Categories:
-
-  * Page: wake someone up (critical, user-impacting).
-  * Ticket: needs attention during working hours.
-  * Dashboard-only: informational.
-
-### 6.4 Incident Response
-
-* **Severity levels:** SEV-1, SEV-2, SEV-3 (define examples).
-
-* **Roles during incident:**
-
-  * Incident Commander.
-  * Communications lead (internal & external).
-  * Domain experts (network, platform, etc.).
-
-* **Timeline:**
-
-  1. Detection & triage.
-  2. Containment / mitigation.
-  3. Root cause analysis.
-  4. Postmortem within X business days.
-
-### 6.5 Postmortems
-
-* Always blameless.
-* Template includes:
-
-  * What happened.
-  * Timeline.
-  * Impact.
-  * Contributing factors.
-  * What worked, what didn’t.
-  * Follow-up actions (with owners & due dates).
-
----
-
-## 7. FinOps & Capacity Planning
-
-### 7.1 Asset & Usage Tracking
-
-* Compute & GPU inventory (per node, per cluster, per tenant).
-* Storage usage by project/team.
-* Power consumption and rack utilization where possible.
-
-### 7.2 Cost Model (Even On-Prem)
-
-* Components:
-
-  * Hardware amortization.
-  * Power & cooling.
-  * Licenses & support.
-  * Staff time (approximate).
-
-* Map costs to:
-
-  * Projects / teams.
-  * Environments (prod, staging, dev).
-
-### 7.3 Capacity Planning Cycle
-
-* Monthly/quarterly review:
-
-  * Utilization vs. headroom.
-  * Forecast upcoming projects.
-  * Decide on new purchases vs cloud burst.
-
----
-
-## 8. Product & Developer Experience
-
-### 8.1 User Profiles
-
-* Backend engineers (APIs, services).
-* Data engineers.
-* ML engineers / researchers.
-
-### 8.2 Self-Service Interfaces
-
-* APIs and/or CLI for:
-
-  * Provisioning compute/storage.
-  * Requesting GPUs.
-  * Viewing usage and costs.
-
-* Templates:
-
-  * App scaffolding.
-  * CI/CD pipelines.
-  * Helm charts or deployment manifests.
-
-### 8.3 Documentation & Onboarding
-
-* Single entrypoint: "How to use the platform" guide.
-* Checklists for:
-
-  * New service onboarding.
-  * Adding monitoring & alerting.
-  * Security reviews.
-
----
-
-## 9. Talent & Remote Culture (When Active)
-
-### 9.1 Role Design
-
-* Clear levels (e.g., SRE I/II/III, Principal).
-* Example responsibilities per level.
-
-### 9.2 Hiring Process
-
-* Standard loop:
-
-  * Recruiter/intro.
-  * Technical screen (practical, scenario-based).
-  * Systems design / deep dive.
-  * Culture & collaboration interview.
-
-### 9.3 Remote-First Norms
-
-* Async communication guidelines.
-* Incident handling across time zones.
-* Documentation as a first-class artifact.
-
----
-
-## 10. Change Management & How to Update This Doc
-
-### 10.1 Updating the Playbook
-
-* This doc lives in version control (`/docs/platform/sre-devops-council.md` or similar).
-* Changes follow the same **PR + review** flow as infra changes.
-* At least one of: Principal SRE, Platform Engineer, or DevEx Partner must approve.
-
-### 10.2 Versioning & Changelog
-
-Maintain a short changelog here:
-
-* `v0.1` – Initial council-based structure for EUstartup.
-* `v0.x` – (add entries as you refine architecture, processes, etc.).
-
----
-
-## 11. Open Questions & TODOs
-
-Use this section as a backlog of design and process questions for the council.
-
-* [ ] Choose and document canonical logging stack (ELK vs Loki vs Graylog).
-* [ ] Finalize SLOs for core user-facing services.
-* [ ] Define GPU allocation policy per team.
-* [ ] Decide on single primary IaC repo vs multiple domain repos.
-* [ ] Document exact incident severity matrix & response SLAs.
-
-Add more as they come up in council meetings.