In October my team released a new version of our system, which represents several months of extremely hard work and is the culmination of everything we've learned so far. (It's also the reason I haven't posted for a long time.) We were all very proud of what we'd created, and confident that it was going to solve all of our (and our users') problems.
But from only 2 weeks after going live, we started having a string of major problems - almost every day something horrible happened. It was almost unbelievable, and I'm sure one day we'll look back and laugh, but right now it's an absolute nightmare.
After yet another disastrous week, I spent a weekend thinking about "quality" in software development, and trying to figure out where we went wrong. As I have done before, I made a pilgrimage to Joel Spolsky's site to get some perspective. I read and re-read The Joel Test - a simple list of 12 yes/no questions that works like a scorecard for software development teams. Joel says, "A score of 12 is perfect, 11 is tolerable, but 10 or lower and you've got serious problems."
We scored 3.
It was depressing, sure, but I was already so depressed anyway, I was actually just happy that here at least was a list of things I could do to try and correct the situation. Also, it was nice to know what we are already doing well.
The first thing that jumped out at me was #5 - "Do you fix bugs before writing new code?" I think this is really a fundamental part of where we went wrong. If you have bugs, that means you're not perfect (nobody is!), which means any new code you write will have bugs, too. So by not stopping dead and fixing the bugs you already have, the number of bugs can only increase. Obvious when you think about it like that, but easy to forget when there is a lot of pressure to implement new features. I can only imagine what it must feel like to know that there are no bugs (that your best efforts could find) in the production code. I'm sure it's a lot better than knowing that there are!
So the first action item: Stop all new development. Fix all the bugs we know of, then go & find some more and fix those too.
The next big one was #7 - "Do you have a spec?" To tell the truth, we have almost zero documentation. As Joel says, "Writing specs is like flossing: everybody agrees that it's a good thing, but nobody does it." I would expand that to cover all documentation. So far our team has been operating under a sort of modified XP credo, where anything that even faintly smells of an old-school waterfall methodology was implicitly rejected. But the reason #7 stood out to me was that in a couple of incidents, the users were asking us what the system does in this case, and we didn't know. The users didn't know, and neither did we. That's bad. If we had a spec, they wouldn't even have had to ask, and anyway there wouldn't have been a problem, because the system would have done what they wanted (instead of what it was doing!)
The worst thing I found when examining our methods was that we had all but abandoned our sacred Law One anyway. So it wasn't that our philosophy was wrong, it was that we had forgotten it, to our peril. But how do you know when you have enough testing? Code coverage tools are fantastic, if a little fiddly to set up, but you really need a human brain to imagine scenarios in which the code you're looking at might fail.
So the second action item: Build documentation and test coverage reviews into the process that puts code into production.
Speaking of reviews, our code review process was extremely weak. Basically it consisted of just me having a look at the code changes as I packaged them for release. But even then, with so much (perceived) pressure on time, I must admit there are huge swaths of code which has never been seen by anyone other than the developer who wrote it. Uncool.
So the third action item: Build a 2-level code review into the process that puts code into production.
The combination of documentation review, test review, and code review I've called the "critical review", meaning both that the reviewer should try to be as critical as possible, and that it's of critical importance. It's difficult to be critical of the work done by your friend, especially if they are relatively senior to you, but the important thing to remember is that every problem you find is saving them from the crippling shame and panic that grips you when code you wrote has caused a production problem. So I've told everyone to try as hard as they can to find a problem when it's their turn to review.
The final test that made me shudder was #6 - "Do you have an up-to-date schedule?" I've always hated giving estimates to the users. It's never much more than a wild guess, and yet they really hold you to it. You can't get away with saying, "it'll be done when it's done!" either.
I read Joel's article about Evidence-Based Scheduling (EBS), which quite frankly struck me as a shameless plug for his bug-tracking product, FogBugz, since it's virtually impossible to implement with anything else. I like his previous idea much better, mostly because of its simplicity, but also because it lends itself easily to constructing a burn-down chart to catch potential delays early.
Basically, the big message for me was we have to generate a new estimate every day. The moment before we start working on something is the moment we know the least about the work involved. Every day we work on it we know a little bit more, so it makes sense to re-invest that knowledge into the quality of the estimate - and if this is done for every in-progress task, the reliability of the whole schedule increases every day too. By giving the users a new schedule every day/week, they don't get any big shocks from huge and sudden delays, and every little slip can be explained in detail.
A key requirement for this is that the estimates are provided by the developer who will do the work. Again this is because nobody knows the work better, but it also prevents unrealistic schedules from being handed down from above.
Another important rule is that estimates and work logged must include all the time doing non-coding stuff - code reviews, talking to people, helping other developers, setting up environments, testing, documentation, picking your nose - everything. In essence the estimate and work logged for a task must be the total time spent in the office not working on a different task. That way everything adds up to a number of calendar days, which in the end is the only thing that matters to the schedule.
So the fourth action item: Estimate everything, and update the estimates every day.
The actual implementation of all this is in our issue tracking system (JIRA). I've added a few custom fields and worked out a process flow for each issue/task, so it gets estimated and re-estimated, reviewed, and scheduled. I'm sure that if we can stick to it, our quality problems will disappear - the trick is now to make sure the extra burden it places on us is more than repaid by a reduction in production panic.
At least it will get us a score of 6...