This interesting operations versus development debate is still
going on, with various people putting their comments in.
Dare Obsanjano, PM over at MSN/Windows Live
weighs in first. He likes an ops team because they understand
about how to spread server load for for best power budget, how to
plan ahead for capacity changes, etc, and implies that saying
"developers do operations" are like saying "with TDD, you don't
need testers". Hmm. The nice thing about TDD is that it makes
everyone in the dev team think about testing, leaving only the
architects and the WS-* specifications as the sources of untestable
content. With good deployment integration you get stuff you can
deploy.
But the really interesting post is a trackback from Amazon Land,
by Justin Rudd, who
lives the dream :).
He makes some excellent points
- The people who say "merge dev and ops usually don't get the 3am
pages, because it's never a power point outage at that time in the
morning
- At 0300, you do whatever the most short term solution is, then
at 0900 you have the next crisis to deal with, so rarely get back
to a good one.
- It's a barrier to recruitment of senior people --welcome to
amazon, here is your pager.
Yeah, I kind of concur. I am glad I aint on call no more, put it
that way. But the hard part is managing to avoid being called, even
when you aren't the operations team. And that's where the "throw it
over to operations" tactic fails.
Justin makes a good point: working operations on a live system
is a very good experience, something everyone should have. We live
it on some of the big HPL projects, like the SE3D render farm. But when we
do live services as a research lab, we don't give 7x24 support.
That doesn't mean we write bad code, I've switched my public
deployapi endpoint off for a fortnight after 8 weeks of no-reboot
liveness, but that we route the phone calls to voicemail till after
our third espresso of the day.
So, to reiterate: I don't think you want to get rid of
operations and leave the dev team to lay down CAT5 cabling or field
the problems when the FBI impounds every box on your colo site. But
I also think that unless you have a process for involving whoever
will be supporting your app in the early design and phase, you will
be creating a lot of pain for the ops/support team. If you are your
own support team, you get to share that pain, and maybe come up
with ways to avoid it.
The other thing is that dev and ops need to move up a notch with
configuring and deploying systems and apps. You want to be left
alone? Use tools that let you declare that the policy on a DNS
failure to the database is to try restarting the app server with an
exponential backoff and mild jitter in the delay between restart.
Don't try doing it at home over an SSH link with only three hours
sleep.
Finally, back to the article that started all this, Bill de hOra
comes in with
good comment In SaaS maintenance *is* development.. You
dont' have an old branch (maintenenance) and a new branch (possible
future revenue). You have yesterday's, and today's. And when do the
two switch over? At 3 am, that's when.
P.S. Notice that the 3:00 phone call is the recurrent enemy of
anyone who looks after systems. There's something about the time
that means it comes up in ever horror story. Maybe it's because for
west coast teams its 0800 in Boston, and the first customers in the
US are suddenly complaining. Maybe its just the worst time to be
woken up. Whatever, it is the support-call-nightmare-hour.