
The Service Ownership Problem Nobody Wants to Talk About
There is a phrase that almost every engineering organization has adopted in some form:
"You build it, you run it."
It sounds empowering. It sounds modern. It sounds like the kind of thing Netflix would say.
And in practice, it fails in most organizations. Not because the principle is wrong, but because the support structures are missing.
The gap between principle and practice
"You build it, you run it" assumes several things:
- teams have the skills and tools to operate their services
- ownership is clearly defined and discoverable
- on-call rotation is sustainable and fairly distributed
- escalation paths exist when problems cross service boundaries
- the platform makes operational work manageable
In reality, most organizations adopt the slogan without building the infrastructure to support it. The result is predictable:
- ownership is unclear or disputed
- some services have no owner at all
- on-call engineers lack the context to debug unfamiliar services
- incidents escalate slowly because nobody knows who to contact
- operational work piles up on the most conscientious teams while others ignore it
- platform gaps force teams to solve the same infrastructure problems independently
This is the service ownership problem, and almost nobody wants to talk about it honestly.
Why ownership breaks down
Ownership does not break down because engineers are irresponsible. It breaks down because organizations change faster than ownership records.
Teams get reorganized. People leave. Services get transferred without context. New services are created for short-term projects and never decommissioned. Shared libraries and infrastructure components have no clear owner because multiple teams depend on them.
Over time, the ownership graph becomes stale. And stale ownership is worse than no ownership, because it creates a false sense of accountability.
I have seen organizations where the ownership spreadsheet says Team A owns a critical payment gateway, but Team A was reorganized six months ago and its members are now on three different teams. When that payment gateway has an incident at 2 AM, the on-call page goes to someone who has never seen the codebase.
The metadata problem
Service ownership is fundamentally a metadata problem.
Every service in your organization should have discoverable answers to these questions:
- Who owns this service? — Which team is responsible for its operation and evolution?
- How do I reach the owner? — What is the on-call rotation? What is the team's communication channel?
- What does this service do? — What is its purpose, who are its consumers, and what are its dependencies?
- What are its operational characteristics? — What is its SLO? How should it be diagnosed? What are its known failure modes?
- What is its lifecycle status? — Is it actively maintained, in maintenance mode, or deprecated?
If these questions cannot be answered within seconds, your ownership model is not working.
The solution is to make ownership metadata a first-class, machine-readable part of your engineering system. Not a wiki page. Not a spreadsheet. A structured, version-controlled, validated record that is part of the service itself.
I typically recommend a service catalog approach where each service declares its ownership metadata in a structured format (YAML, TOML, or a registry) that is:
- co-located with the service code or in a central registry
- validated in CI (missing or stale ownership fails the build)
- consumed by observability, alerting, and incident management tools
- reviewed and updated as part of regular team health checks
The platform support gap
Even with clear ownership, "you build it, you run it" fails if the platform does not support operational work.
Consider what a service-owning team needs to do:
- respond to incidents
- diagnose production issues
- deploy safely
- manage database migrations
- monitor service health
- handle dependency failures
- manage secrets and configuration
- comply with security requirements
If each of these requires bespoke tooling, deep infrastructure knowledge, or manual processes, the operational burden becomes unsustainable. Teams that are supposed to be building features spend most of their time fighting the platform.
This is where platform engineering becomes critical. A good internal platform provides:
- golden path deployments — a standard, safe way to ship code
- built-in observability — structured logging, metrics, and tracing that work out of the box
- incident response tooling — automated escalation, ownership lookup, and runbook access
- self-service infrastructure — databases, queues, caches that can be provisioned without a ticket
- compliance guardrails — security scanning, policy enforcement, and audit trails built into the workflow
The platform does not eliminate operational work. It reduces the cognitive load and tribal knowledge required to do it well.
Escalation paths matter more than you think
One of the most overlooked aspects of service ownership is escalation.
In a microservices architecture, most incidents cross service boundaries. A user-facing error might originate in the API gateway, be caused by a downstream service timeout, which is itself caused by a database connection pool exhaustion in a third service.
If your escalation model requires an on-call engineer to manually identify the owning team of each involved service, find their contact information, and coordinate a war room, you have already lost critical minutes.
I recommend:
- automated ownership lookup in alerting — when an alert fires, the alert metadata should include the owning team and their contact channel
- dependency-aware escalation — if Service A depends on Service B and both are unhealthy, the incident tool should suggest involving both teams
- clear escalation tiers — L1 responds, L2 has deep domain knowledge, L3 involves architecture or leadership for cross-cutting issues
- blameless incident reviews — ownership should never become a blame vector
Making ownership sustainable
The goal is not to make ownership feel like punishment. It is to make ownership feel manageable.
Here is my recommended approach:
-
Start with a service catalog. Define a structured format for ownership metadata and require it for every service. Validate it in CI.
-
Integrate ownership with your tools. Alerting, dashboards, incident management, and deployment systems should all consume ownership data. When an engineer sees an alert, they should immediately know who owns the affected service.
-
Set ownership review cadence. Ownership should be reviewed quarterly, or whenever a team reorganization occurs. Make it part of your engineering health metrics.
-
Reduce operational burden through platform investment. Every hour spent on platform capabilities is an hour saved across every service-owning team.
-
Define and enforce lifecycle states. Services that are deprecated should be explicitly marked and have a decommission plan. Orphaned services should be flagged automatically.
-
Make on-call sustainable. On-call rotations should be sized appropriately, compensated fairly, and supported with runbooks and diagnostic tooling.
The organizational design angle
Service ownership is ultimately an organizational design problem, not just a technical one.
Conway's Law tells us that system architecture mirrors organizational structure. If your teams are organized in ways that do not align with your service boundaries, ownership will always be awkward.
The best engineering organizations I have worked with align team boundaries with service boundaries and give teams genuine autonomy over their domain. That autonomy comes with responsibility, but it also comes with the platform support and organizational backing to make that responsibility sustainable.
Closing thought
"You build it, you run it" is a good principle. But a principle without support infrastructure is just a slogan.
If your organization has adopted service ownership but struggles with unclear owners, slow escalation, unsustainable on-call, and duplicated operational effort, the problem is not your engineers. The problem is that the system around them was not designed to make ownership work.
That is a solvable problem. And it starts with treating ownership as a first-class engineering concern rather than an assumption.