update, 14 roles at the round table
This commit is contained in:
@@ -1,456 +1,38 @@
|
|||||||
# EUstartup SRE / DevOps Council — Living Design Doc & Playbook
|
graph LR
|
||||||
|
Council["AI Council OS — Man in the Middle\n(14-Seat Round Table)"]
|
||||||
|
|
||||||
|
Council --- SRE_Principal
|
||||||
|
Council --- BareMetal
|
||||||
|
Council --- VirtArch
|
||||||
|
Council --- OpenStackArch
|
||||||
|
Council --- NetArch
|
||||||
|
Council --- IaCLead
|
||||||
|
Council --- GitOps
|
||||||
|
Council --- Observability
|
||||||
|
Council --- SRE_Reliability
|
||||||
|
Council --- Security
|
||||||
|
Council --- Sovereign
|
||||||
|
Council --- Facility
|
||||||
|
Council --- CapacityPerf
|
||||||
|
Council --- LifecycleOps
|
||||||
|
|
||||||
|
%% Roles
|
||||||
|
|
||||||
|
SRE_Principal["Principal SRE/DevOps Architect"]
|
||||||
|
BareMetal["Bare-Metal Provisioning Lead\n(MAAS/Ironic/PXE)"]
|
||||||
|
VirtArch["Virtualization Architect\n(Proxmox/ESXi/KVM)"]
|
||||||
|
OpenStackArch["OpenStack Cloud Architect\n(Kolla/Neutron/Nova)"]
|
||||||
|
NetArch["Network Architect\n(Spine/Leaf/BGP/EVPN)"]
|
||||||
|
IaCLead["Automation & IaC Lead\n(Ansible/Terraform/Python SDK)"]
|
||||||
|
GitOps["CI/CD & GitOps Governance Lead"]
|
||||||
|
Observability["Observability & Telemetry Architect"]
|
||||||
|
SRE_Reliability["SRE Reliability Engineering Lead"]
|
||||||
|
Security["Security Architect\n(Zero Trust, Compliance)"]
|
||||||
|
Sovereign["Sovereign Compliance & Sustainability Lead\n(GDPR/EU Green)"]
|
||||||
|
Facility["Physical Infrastructure & Facility Engineering Lead\n(Power/Cooling/EN 50600)"]
|
||||||
|
CapacityPerf["Capacity & Performance Engineer"]
|
||||||
|
LifecycleOps["Platform Lifecycle & Operations Lead"]
|
||||||
|
|
||||||
|
%% Special close collaboration
|
||||||
|
Sovereign <--> Facility
|
||||||
|
|
||||||
> **Status:** Draft v0.1
|
|
||||||
> **Scope:** All infrastructure, SRE, platform, and DevEx work for **EUstartup**.
|
|
||||||
> **Owner:** Principal SRE (or acting)
|
|
||||||
> **Last Updated:** <!-- update date here -->
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 1. Purpose & North Star
|
|
||||||
|
|
||||||
The **SRE / DevOps Council** is the decision and coordination forum for everything related to:
|
|
||||||
|
|
||||||
* Infrastructure (bare-metal, virtualization, cloud, networking).
|
|
||||||
* Reliability, scalability, and performance of EUstartup’s products.
|
|
||||||
* Developer & ML engineer experience on the platform.
|
|
||||||
* Cost, capacity, and operational excellence.
|
|
||||||
|
|
||||||
**North Star:**
|
|
||||||
|
|
||||||
> "Provide a secure, reliable, and cost-aware platform that enables EUstartup teams to ship and operate products quickly and safely."
|
|
||||||
|
|
||||||
We use this document as a **living playbook**:
|
|
||||||
|
|
||||||
* To clarify **who decides what**, and **how**.
|
|
||||||
* To capture **current architecture** and **desired evolution**.
|
|
||||||
* To document **standard operating procedures (SOPs)** and **runbooks**.
|
|
||||||
* To onboard new engineers into how EUstartup runs infrastructure.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 2. Council Structure & Seats
|
|
||||||
|
|
||||||
The council is a **virtual table** with the following seats. One person may hold multiple seats; a seat may be empty but is still a clearly defined responsibility.
|
|
||||||
|
|
||||||
### 2.1 Seats at the Table
|
|
||||||
|
|
||||||
1. **Principal SRE & Bare-Metal Architect**
|
|
||||||
|
|
||||||
* Owns end-to-end infra architecture & reliability.
|
|
||||||
* Focus: MAAS/Ironic, PXE/Preseed/cloud-init, Debian/Ubuntu image lifecycle, multi-site bare-metal design & failure modes.
|
|
||||||
|
|
||||||
2. **Network & Security Architect**
|
|
||||||
|
|
||||||
* Owns network and security architecture.
|
|
||||||
* Focus: VLANs, L2/L3 routing, VPN/WAN/UniFi, segmentation, SSH/API hardening, remote admin security.
|
|
||||||
|
|
||||||
3. **Platform & Virtualization Engineer**
|
|
||||||
|
|
||||||
* Owns virtualization and platform abstraction.
|
|
||||||
* Focus: OpenStack (Kolla-Ansible), Proxmox, ESXi, resource quotas, integration of bare-metal with virtualization.
|
|
||||||
|
|
||||||
4. **Automation & GitOps Lead**
|
|
||||||
|
|
||||||
* Owns automation standards and GitOps workflows.
|
|
||||||
* Focus: Ansible, CI/CD, repos and branching models, secrets, Bash/Python tooling.
|
|
||||||
|
|
||||||
5. **Observability & Incident Analyst**
|
|
||||||
|
|
||||||
* Owns telemetry stack and incident analysis.
|
|
||||||
* Focus: Prometheus/Grafana, ELK/Loki/Graylog, SLOs/SLIs, alerting, postmortems.
|
|
||||||
|
|
||||||
6. **FinOps & Capacity Planner**
|
|
||||||
|
|
||||||
* Owns capacity planning & cost visibility.
|
|
||||||
* Focus: compute/GPU tracking, power & licenses, amortization, unit economics.
|
|
||||||
|
|
||||||
7. **Product & Developer Experience Partner**
|
|
||||||
|
|
||||||
* Represents product teams and ML/dev users of the platform.
|
|
||||||
* Focus: self-service APIs, UX of tooling, docs, templates.
|
|
||||||
|
|
||||||
8. **Talent & Remote-Culture Partner (conditional)**
|
|
||||||
|
|
||||||
* Added when hiring or changing roles.
|
|
||||||
* Focus: role definitions, interview loops, remote-first norms.
|
|
||||||
|
|
||||||
### 2.2 Current Seat Assignments (EUstartup)
|
|
||||||
|
|
||||||
> **Fill this table with actual names/aliases.**
|
|
||||||
|
|
||||||
| Seat | Primary | Backup | Notes |
|
|
||||||
| ------------------------------------ | ------- | ------ | ----------------------- |
|
|
||||||
| Principal SRE & Bare-Metal Architect | TBD | TBD | |
|
|
||||||
| Network & Security Architect | TBD | TBD | |
|
|
||||||
| Platform & Virtualization Engineer | TBD | TBD | |
|
|
||||||
| Automation & GitOps Lead | TBD | TBD | |
|
|
||||||
| Observability & Incident Analyst | TBD | TBD | |
|
|
||||||
| FinOps & Capacity Planner | TBD | TBD | |
|
|
||||||
| Product & DevEx Partner | TBD | TBD | |
|
|
||||||
| Talent & Remote-Culture Partner | TBD | TBD | Only active when hiring |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 3. Council Operating Model
|
|
||||||
|
|
||||||
### 3.1 Cadence & Rituals
|
|
||||||
|
|
||||||
**Weekly Council Sync (30–45 min)**
|
|
||||||
|
|
||||||
* Goal: Review changes, risks, incidents, and upcoming work.
|
|
||||||
* Attendees: All seats or their delegates.
|
|
||||||
* Agenda template:
|
|
||||||
|
|
||||||
1. Quick round: changes shipped last week.
|
|
||||||
2. Incidents & reliability review (SLOs, recurring alerts).
|
|
||||||
3. Capacity & cost updates (any red flags?).
|
|
||||||
4. Upcoming infra changes / RFCs.
|
|
||||||
5. Developer experience feedback.
|
|
||||||
|
|
||||||
**Monthly Architecture Review (60–90 min)**
|
|
||||||
|
|
||||||
* Goal: Step back and review medium/long-term architecture and roadmap.
|
|
||||||
* Focus: new cluster designs, major migrations, deprecations.
|
|
||||||
|
|
||||||
**Postmortems (per major incident)**
|
|
||||||
|
|
||||||
* Goal: Learn and adapt. No blame.
|
|
||||||
* Outcome: updated runbooks, alerts, docs, or architecture decisions.
|
|
||||||
|
|
||||||
### 3.2 Decision Types & Authority
|
|
||||||
|
|
||||||
**Type A – Safety & Reliability** (e.g., SLOs, incident response, rollback policies)
|
|
||||||
|
|
||||||
* Primary: Principal SRE & Observability.
|
|
||||||
* Others consulted: Platform, Network/Security.
|
|
||||||
|
|
||||||
**Type B – Architecture & Platform** (e.g., choosing OpenStack vs. Proxmox for a new workload)
|
|
||||||
|
|
||||||
* Primary: Principal SRE & Platform Engineer.
|
|
||||||
* Others consulted: Network/Security, Automation/GitOps, DevEx.
|
|
||||||
|
|
||||||
**Type C – Process & Ways of Working** (e.g., Git branching model, incident process)
|
|
||||||
|
|
||||||
* Primary: Automation/GitOps, Principal SRE.
|
|
||||||
* Others consulted: DevEx, Talent & Remote-Culture.
|
|
||||||
|
|
||||||
**Type D – Cost & Capacity** (e.g., when to buy more GPUs, rightsizing strategy)
|
|
||||||
|
|
||||||
* Primary: FinOps & Capacity Planner.
|
|
||||||
* Others consulted: Principal SRE, Platform, Product.
|
|
||||||
|
|
||||||
### 3.3 RACI Overview (Simplified)
|
|
||||||
|
|
||||||
> Customize as needed.
|
|
||||||
|
|
||||||
| Area | R (Responsible) | A (Accountable) | C (Consulted) | I (Informed) |
|
|
||||||
| ------------------------------- | ----------------- | --------------- | ----------------------- | ----------------- |
|
|
||||||
| Bare-metal architecture | Principal SRE | Principal SRE | Platform, Network/Sec | FinOps, DevEx |
|
|
||||||
| Network & security baseline | Network/Sec | Network/Sec | Principal SRE | Everyone |
|
|
||||||
| Virtualization platform choices | Platform | Principal SRE | FinOps, DevEx | All product teams |
|
|
||||||
| CI/CD & GitOps workflow | Automation/GitOps | Principal SRE | DevEx, Security | All engineers |
|
|
||||||
| Observability stack | Observability | Principal SRE | Platform, DevEx | All teams |
|
|
||||||
| SLOs / SLIs | Observability | Principal SRE | Product, DevEx | All |
|
|
||||||
| Capacity planning & purchases | FinOps | CTO / VP Eng | Principal SRE, Platform | Finance |
|
|
||||||
| Developer self-service API | DevEx | Principal SRE | Platform, Automation | Product teams |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 4. Platform & Architecture Overview
|
|
||||||
|
|
||||||
> Keep this section updated as the platform evolves.
|
|
||||||
|
|
||||||
### 4.1 High-Level Environment
|
|
||||||
|
|
||||||
* **Regions / sites:**
|
|
||||||
|
|
||||||
* Primary DC: `eu-1` (location: …)
|
|
||||||
* Secondary DC / DR: `eu-2` (location: …)
|
|
||||||
* Optional: cloud provider (e.g., `eu-central-1`), used for burst / specific services.
|
|
||||||
|
|
||||||
* **Primary workloads:**
|
|
||||||
|
|
||||||
* User-facing web/API services.
|
|
||||||
* Data pipelines and batch jobs.
|
|
||||||
* ML/AI training and inference with GPUs.
|
|
||||||
|
|
||||||
### 4.2 Bare-Metal Layer (Principal SRE)
|
|
||||||
|
|
||||||
* Provisioning stack:
|
|
||||||
|
|
||||||
* MAAS / Ironic for hardware management.
|
|
||||||
* PXE boot → Preseed / cloud-init for OS provisioning.
|
|
||||||
* Standard OS: Debian/Ubuntu LTS images.
|
|
||||||
|
|
||||||
* Image lifecycle:
|
|
||||||
|
|
||||||
* Base image repo and versioning.
|
|
||||||
* Hardening baseline (SSH config, packages, security updates).
|
|
||||||
* Golden images for specific roles (compute, storage, GPU nodes, control-plane, infra-services).
|
|
||||||
|
|
||||||
* Multi-site considerations:
|
|
||||||
|
|
||||||
* Separation of control (per-site MAAS controllers or shared?).
|
|
||||||
* Cross-site failover strategy and RTO/RPO targets.
|
|
||||||
|
|
||||||
### 4.3 Network & Security (Network/Sec Architect)
|
|
||||||
|
|
||||||
* Network layout:
|
|
||||||
|
|
||||||
* Core VLANs: `mgmt`, `storage`, `public`, `tenant`, `backup`, `out-of-band`.
|
|
||||||
* Routing: L3 boundaries and where firewalls apply.
|
|
||||||
|
|
||||||
* Security:
|
|
||||||
|
|
||||||
* SSH & bastion host policy.
|
|
||||||
* VPN topology for remote admins.
|
|
||||||
* Zero-trust-ish: per-service authN/Z where feasible.
|
|
||||||
|
|
||||||
### 4.4 Virtualization / Platform Layer (Platform Engineer)
|
|
||||||
|
|
||||||
* Stacks in use:
|
|
||||||
|
|
||||||
* OpenStack (Kolla-Ansible) for multi-tenant compute/network/storage.
|
|
||||||
* Proxmox for infra VMs / special workloads.
|
|
||||||
* ESXi (if applicable) for legacy / vendor-specific needs.
|
|
||||||
|
|
||||||
* Tenancy model:
|
|
||||||
|
|
||||||
* Projects per product team / environment.
|
|
||||||
* Resource quotas: CPU, RAM, storage, GPU.
|
|
||||||
* Naming, tags/labels, and chargeback/finops integration.
|
|
||||||
|
|
||||||
### 4.5 Control Plane & Core Services
|
|
||||||
|
|
||||||
List and describe:
|
|
||||||
|
|
||||||
* Configuration management (Ansible structure, repos, roles).
|
|
||||||
* CI/CD tooling (e.g., GitLab CI, GitHub Actions, Jenkins).
|
|
||||||
* Secrets management (HashiCorp Vault / SOPS / KMS / etc.).
|
|
||||||
* Artifact/image registry.
|
|
||||||
* Observability stack.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 5. Automation & GitOps Standards
|
|
||||||
|
|
||||||
### 5.1 Repositories & Branching
|
|
||||||
|
|
||||||
* Infra as Code repos (examples):
|
|
||||||
|
|
||||||
* `infra-baremetal` — MAAS/Ironic configs, PXE, images.
|
|
||||||
* `infra-network` — network definitions, firewall rules (where possible).
|
|
||||||
* `infra-platform` — OpenStack/Proxmox configs, Kolla-Ansible.
|
|
||||||
* `infra-observability` — dashboards, alert rules.
|
|
||||||
|
|
||||||
* Branch naming and policies:
|
|
||||||
|
|
||||||
* `main` is always deployable.
|
|
||||||
* Feature branches: `feat/<area>-<shortdesc>`.
|
|
||||||
* Hotfix branches: `fix/<incidentid>-<shortdesc>`.
|
|
||||||
|
|
||||||
### 5.2 Ansible & Idempotency
|
|
||||||
|
|
||||||
* Role structure:
|
|
||||||
|
|
||||||
* Roles grouped by domain (e.g., `base_os`, `k8s_node`, `openstack_compute`).
|
|
||||||
* Strict idempotency: plays can be run repeatedly without side effects.
|
|
||||||
|
|
||||||
* Inventories:
|
|
||||||
|
|
||||||
* Dynamic inventories from MAAS/OpenStack.
|
|
||||||
* Grouping by role and site.
|
|
||||||
|
|
||||||
### 5.3 GitOps Flow
|
|
||||||
|
|
||||||
* All infra changes:
|
|
||||||
|
|
||||||
1. PR with description, risk, and rollback plan.
|
|
||||||
2. Peer review by at least one council seat.
|
|
||||||
3. CI validation (lint, syntax checks, dry-runs where possible).
|
|
||||||
4. Merge → automated apply (or controlled pipelines with approval gates).
|
|
||||||
|
|
||||||
* Secrets:
|
|
||||||
|
|
||||||
* Never stored in plain text.
|
|
||||||
* Clear guidance on rotation and access.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 6. Observability & Incident Management
|
|
||||||
|
|
||||||
### 6.1 Telemetry Stack
|
|
||||||
|
|
||||||
* Metrics: Prometheus (+ exporters), visualized in Grafana.
|
|
||||||
* Logs: ELK / Loki / Graylog (pick one and document).
|
|
||||||
* Traces (if applicable): OpenTelemetry / Jaeger / Tempo, etc.
|
|
||||||
|
|
||||||
### 6.2 SLOs & SLIs
|
|
||||||
|
|
||||||
For each important service (API, internal platform, etc.):
|
|
||||||
|
|
||||||
* Define **SLIs** (e.g., availability, latency, error rate).
|
|
||||||
* Define **SLOs** (targets over 30d windows).
|
|
||||||
|
|
||||||
### 6.3 Alerting Strategy
|
|
||||||
|
|
||||||
* Principles:
|
|
||||||
|
|
||||||
* Alerts must be actionable.
|
|
||||||
* Tie alerts to SLOs and clear runbooks.
|
|
||||||
|
|
||||||
* Categories:
|
|
||||||
|
|
||||||
* Page: wake someone up (critical, user-impacting).
|
|
||||||
* Ticket: needs attention during working hours.
|
|
||||||
* Dashboard-only: informational.
|
|
||||||
|
|
||||||
### 6.4 Incident Response
|
|
||||||
|
|
||||||
* **Severity levels:** SEV-1, SEV-2, SEV-3 (define examples).
|
|
||||||
|
|
||||||
* **Roles during incident:**
|
|
||||||
|
|
||||||
* Incident Commander.
|
|
||||||
* Communications lead (internal & external).
|
|
||||||
* Domain experts (network, platform, etc.).
|
|
||||||
|
|
||||||
* **Timeline:**
|
|
||||||
|
|
||||||
1. Detection & triage.
|
|
||||||
2. Containment / mitigation.
|
|
||||||
3. Root cause analysis.
|
|
||||||
4. Postmortem within X business days.
|
|
||||||
|
|
||||||
### 6.5 Postmortems
|
|
||||||
|
|
||||||
* Always blameless.
|
|
||||||
* Template includes:
|
|
||||||
|
|
||||||
* What happened.
|
|
||||||
* Timeline.
|
|
||||||
* Impact.
|
|
||||||
* Contributing factors.
|
|
||||||
* What worked, what didn’t.
|
|
||||||
* Follow-up actions (with owners & due dates).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 7. FinOps & Capacity Planning
|
|
||||||
|
|
||||||
### 7.1 Asset & Usage Tracking
|
|
||||||
|
|
||||||
* Compute & GPU inventory (per node, per cluster, per tenant).
|
|
||||||
* Storage usage by project/team.
|
|
||||||
* Power consumption and rack utilization where possible.
|
|
||||||
|
|
||||||
### 7.2 Cost Model (Even On-Prem)
|
|
||||||
|
|
||||||
* Components:
|
|
||||||
|
|
||||||
* Hardware amortization.
|
|
||||||
* Power & cooling.
|
|
||||||
* Licenses & support.
|
|
||||||
* Staff time (approximate).
|
|
||||||
|
|
||||||
* Map costs to:
|
|
||||||
|
|
||||||
* Projects / teams.
|
|
||||||
* Environments (prod, staging, dev).
|
|
||||||
|
|
||||||
### 7.3 Capacity Planning Cycle
|
|
||||||
|
|
||||||
* Monthly/quarterly review:
|
|
||||||
|
|
||||||
* Utilization vs. headroom.
|
|
||||||
* Forecast upcoming projects.
|
|
||||||
* Decide on new purchases vs cloud burst.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 8. Product & Developer Experience
|
|
||||||
|
|
||||||
### 8.1 User Profiles
|
|
||||||
|
|
||||||
* Backend engineers (APIs, services).
|
|
||||||
* Data engineers.
|
|
||||||
* ML engineers / researchers.
|
|
||||||
|
|
||||||
### 8.2 Self-Service Interfaces
|
|
||||||
|
|
||||||
* APIs and/or CLI for:
|
|
||||||
|
|
||||||
* Provisioning compute/storage.
|
|
||||||
* Requesting GPUs.
|
|
||||||
* Viewing usage and costs.
|
|
||||||
|
|
||||||
* Templates:
|
|
||||||
|
|
||||||
* App scaffolding.
|
|
||||||
* CI/CD pipelines.
|
|
||||||
* Helm charts or deployment manifests.
|
|
||||||
|
|
||||||
### 8.3 Documentation & Onboarding
|
|
||||||
|
|
||||||
* Single entrypoint: "How to use the platform" guide.
|
|
||||||
* Checklists for:
|
|
||||||
|
|
||||||
* New service onboarding.
|
|
||||||
* Adding monitoring & alerting.
|
|
||||||
* Security reviews.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 9. Talent & Remote Culture (When Active)
|
|
||||||
|
|
||||||
### 9.1 Role Design
|
|
||||||
|
|
||||||
* Clear levels (e.g., SRE I/II/III, Principal).
|
|
||||||
* Example responsibilities per level.
|
|
||||||
|
|
||||||
### 9.2 Hiring Process
|
|
||||||
|
|
||||||
* Standard loop:
|
|
||||||
|
|
||||||
* Recruiter/intro.
|
|
||||||
* Technical screen (practical, scenario-based).
|
|
||||||
* Systems design / deep dive.
|
|
||||||
* Culture & collaboration interview.
|
|
||||||
|
|
||||||
### 9.3 Remote-First Norms
|
|
||||||
|
|
||||||
* Async communication guidelines.
|
|
||||||
* Incident handling across time zones.
|
|
||||||
* Documentation as a first-class artifact.
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 10. Change Management & How to Update This Doc
|
|
||||||
|
|
||||||
### 10.1 Updating the Playbook
|
|
||||||
|
|
||||||
* This doc lives in version control (`/docs/platform/sre-devops-council.md` or similar).
|
|
||||||
* Changes follow the same **PR + review** flow as infra changes.
|
|
||||||
* At least one of: Principal SRE, Platform Engineer, or DevEx Partner must approve.
|
|
||||||
|
|
||||||
### 10.2 Versioning & Changelog
|
|
||||||
|
|
||||||
Maintain a short changelog here:
|
|
||||||
|
|
||||||
* `v0.1` – Initial council-based structure for EUstartup.
|
|
||||||
* `v0.x` – (add entries as you refine architecture, processes, etc.).
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## 11. Open Questions & TODOs
|
|
||||||
|
|
||||||
Use this section as a backlog of design and process questions for the council.
|
|
||||||
|
|
||||||
* [ ] Choose and document canonical logging stack (ELK vs Loki vs Graylog).
|
|
||||||
* [ ] Finalize SLOs for core user-facing services.
|
|
||||||
* [ ] Define GPU allocation policy per team.
|
|
||||||
* [ ] Decide on single primary IaC repo vs multiple domain repos.
|
|
||||||
* [ ] Document exact incident severity matrix & response SLAs.
|
|
||||||
|
|
||||||
Add more as they come up in council meetings.
|
|
||||||
|
|||||||
Reference in New Issue
Block a user