Imagine your biggest product launch of the year is finally here. Marketing has built the hype, and thousands of eager users are hitting your landing page, but suddenly, everything freezes and the site goes dark.
Effective site operations product management ensures the underlying platform stays alive while developers push new code into the wild. In Marty Cagan’s Inspired, he explains that this specialized team is the backbone of any 24/7 internet service.
Without these experts, even the most innovative features are worthless because your customers can't access them. You'll learn why the product manager (PM) must treat site reliability as a core feature rather than an afterthought.
Site operations is the dedicated group responsible for keeping an internet service running smoothly around the clock. In his book Inspired, Marty Cagan notes that while some small startups let engineers handle this as a side task, successful companies realize it requires a distinct, specialized set of skills.
This role is far too important to be a secondary responsibility for someone focused on building new features. It involves managing servers, monitoring performance, and ensuring that the site can scale as the user base grows.
In the real world, this concept matters because technical failure is expensive. A widely cited Gartner report indicates that the average cost of IT downtime is $5,600 per minute. For a PM, this means the operational health of your product is just as vital as its UI.
Uptime isn't a side project that happens by accident. Cagan argues that high availability and scalability must be built into the product's DNA from the very first day.
If the platform can't handle a sudden spike in traffic, your brand's reputation will vanish faster than your servers. In the book, Cagan recommends that product teams allocate at least 20% of their engineering capacity to what he calls "headroom."
This 20% isn't for new bells and whistles. It's a mandatory investment in infrastructure to ensure you never slam into a technical ceiling that forces a total system shutdown.
Decisions shouldn't be made in a vacuum that ignores the people running the servers. Cagan suggests the head of site operations should be a core member of the Product Council to provide a reality check on new initiatives.
When operations leaders are involved early, they can identify if a proposed feature will create a massive database load. This prevents the common "throwing it over the wall" syndrome where PMs ignore infrastructure costs until the site crashes in production.
This collaboration ensures that the product strategy directly supports the business strategy without breaking the pipes. It turns operations into a strategic partner rather than a cleanup crew for poor planning.
Operational stability depends on predictable, rhythmic releases rather than chaotic, last-minute changes. High churn—untested, late-stage shifts in requirements—is the greatest enemy of site reliability.
Cagan advocates for the "train model" of releases where site operations act as the conductors. In this model, the software train leaves the station at a set time, and if a feature isn't ready, it simply waits for the next train.
This creates a stable environment where the operations team can maintain high availability without being blindsided by half-baked code. It shifts the focus from "launching at any cost" to "releasing with confidence."
In 1999, eBay faced a crisis that almost collapsed the entire company. The site suffered a massive outage that lasted nearly 22 hours, resulting in millions of dollars in lost fees and a plummeting stock price.
Marty Cagan notes that the company had neglected its infrastructure for too long while chasing new features. They eventually had to rewrite the entire site multiple times to handle the massive scale of their growing community without impacting users.
This serves as a cautionary tale for any PM who thinks they can ignore their operations team. eBay survived because they eventually prioritized "headroom" and architectural integrity, turning a potential disaster into a world-class platform.
Friendster is another example of what happens when operations fail. It was the first social media giant, but its servers couldn't handle the growth, leading to 40-second page load times. Users didn't wait; they migrated to MySpace and Facebook, which offered a more reliable experience.
Invite a site operations leader to every product discovery session. They'll tell you if your "brilliant" new feature will crash the database before you waste a single developer hour on a solution that isn't feasible.
Dedicate exactly 20% of your roadmap to infrastructure and scaling. Cagan insists this "tax" is the only way to avoid the dreaded conversation where engineering tells you they have to stop all new development for a total rewrite.
Define success metrics for stability and performance, not just user engagement. If your product has high engagement but only 95% uptime, you're failing your users; you should target "four nines" (99.99%) to ensure a professional standard.
Critics of Cagan’s strict 20% rule argue that early-stage startups don't have the luxury of paying taxes on infrastructure they might never use. If a company hasn't found its product-market fit yet, spending 20% of their time on high-scale architecture is arguably a waste of precious capital.
In the pre-revenue phase, speed is often more important than perfect uptime. Some experts also argue that the rise of serverless computing and managed cloud providers like AWS or Google Cloud reduces the need for a dedicated site ops role in smaller organizations.
While these tools automate much of the heavy lifting, someone still needs to own the responsibility. Even with the best cloud tools, a poorly designed query can still take down a service, making the principles of site operations relevant for teams of any size.
Site operations product management is about ensuring your service stays reliable as it grows. Uptime and scalability are non-negotiable features that define your brand's trust with every click. Audit your current roadmap to ensure you're dedicating enough resources to infrastructure today.
DevOps is a cultural philosophy that emphasizes collaboration between developers and operations. Site operations is the specific functional group responsible for the 24/7 health, performance, and scalability of the service. While DevOps is a way of working, Site Ops is the team that ensures the lights stay on and the servers can handle traffic spikes.
Customer experience isn't just about pretty buttons; it's about reliability. If a site is slow or frequently down, users lose trust and leave. Site operations ensures high availability and fast page loads, which are foundational to user satisfaction. As Marty Cagan notes, a product must be valuable, usable, and feasible—and feasibility includes being able to run it reliably.
The 20% 'headroom' rule is a preventive measure against technical debt. By dedicating one-fifth of your resources to infrastructure and scaling, you avoid the 'house of cards' scenario where a system becomes so unstable that all feature development must stop for a total rewrite. This 'tax' ensures the product can grow alongside the user base without collapsing.
In a very early startup, the lead engineer often handles operations. However, as soon as the service gains traction, a dedicated operations lead or site reliability engineer (SRE) should be hired. This ensures that someone is wakefully responsible for the service 24/7, allowing the developers to stay focused on the innovation and discovery of new features.
Site Ops The Unsung Heroes of Your Internet Service
Deputy PMs Finding the Smartest People in Your Company
The One Thing Peter Thiel’s Management Hack to Eliminate Conflict
Gentle Deployment How to Stop Abusing Your Users with Changes
10 Keys to Building a Massive Consumer Internet Service
Does Your PM Really Need Industry Experience?
The Swedish Fika Why Building Team Culture Through Shared Breaks Is Good for Business
Rigor vs. Ruthlessness The Secret to High-Performance Teams