Steve: Developing on the Edge - Operations, Development and the dreaded 3am phone call
Steve: Developing on the Edge
Thoughts on development, Web-services, technology and mountains.
9Aug
Wed2006
Operations, Development and the dreaded 3am phone call

This interesting operations versus development debate is still going on, with various people putting their comments in.

Dare Obsanjano, PM over at MSN/Windows Live weighs in first. He likes an ops team because they understand about how to spread server load for for best power budget, how to plan ahead for capacity changes, etc, and implies that saying "developers do operations" are like saying "with TDD, you don't need testers". Hmm. The nice thing about TDD is that it makes everyone in the dev team think about testing, leaving only the architects and the WS-* specifications as the sources of untestable content. With good deployment integration you get stuff you can deploy.

But the really interesting post is a trackback from Amazon Land, by Justin Rudd, who lives the dream :).

He makes some excellent points

  1. The people who say "merge dev and ops usually don't get the 3am pages, because it's never a power point outage at that time in the morning
  2. At 0300, you do whatever the most short term solution is, then at 0900 you have the next crisis to deal with, so rarely get back to a good one.
  3. It's a barrier to recruitment of senior people --welcome to amazon, here is your pager.

Yeah, I kind of concur. I am glad I aint on call no more, put it that way. But the hard part is managing to avoid being called, even when you aren't the operations team. And that's where the "throw it over to operations" tactic fails.

Justin makes a good point: working operations on a live system is a very good experience, something everyone should have. We live it on some of the big HPL projects, like the SE3D render farm. But when we do live services as a research lab, we don't give 7x24 support. That doesn't mean we write bad code, I've switched my public deployapi endpoint off for a fortnight after 8 weeks of no-reboot liveness, but that we route the phone calls to voicemail till after our third espresso of the day.

So, to reiterate: I don't think you want to get rid of operations and leave the dev team to lay down CAT5 cabling or field the problems when the FBI impounds every box on your colo site. But I also think that unless you have a process for involving whoever will be supporting your app in the early design and phase, you will be creating a lot of pain for the ops/support team. If you are your own support team, you get to share that pain, and maybe come up with ways to avoid it.

The other thing is that dev and ops need to move up a notch with configuring and deploying systems and apps. You want to be left alone? Use tools that let you declare that the policy on a DNS failure to the database is to try restarting the app server with an exponential backoff and mild jitter in the delay between restart. Don't try doing it at home over an SSH link with only three hours sleep.

Finally, back to the article that started all this, Bill de hOra comes in with good comment In SaaS maintenance *is* development.. You dont' have an old branch (maintenenance) and a new branch (possible future revenue). You have yesterday's, and today's. And when do the two switch over? At 3 am, that's when.

P.S. Notice that the 3:00 phone call is the recurrent enemy of anyone who looks after systems. There's something about the time that means it comes up in ever horror story. Maybe it's because for west coast teams its 0800 in Boston, and the first customers in the US are suddenly complaining. Maybe its just the worst time to be woken up. Whatever, it is the support-call-nightmare-hour.

Comments