Key risk indicators in large-scale software projects (a brief discussion)
A number of times over the past 13 years I have been asked to review an existing (and usually large-scale) IT project, figure out where it stands, and make recommendations as to how to get it back on track. I have found that a small number of questions will usually get me to the heart of the problem quickly. These questions include:
Who is the chief architect? Fred Brooks famously raised the issue and role of the chief (software) architect over 30 years ago in The Mythical Man-Month, and yet many IT organizations still fail to see the critical need for conceptual unity in a given software project. On one project I reviewed, I asked this question of the 40 or so people involved and got several different answers, the most common of which was “no-one”.
Who is the project manager? This is usually more easy to answer — but if it’s the same person as the chief architect, watch out. I know from personal experience that it is nearly impossible to do a good job both as a project manager and a chief architect for at least two reasons. First, for a software project of any significant size, each job is a full-time job, and someone who tries to do both will do neither well. Second, there are inherent conflicts between the two roles — the architect wants to do things perfectly, and the project manager wants to do things quickly. One person carrying out both roles will inevitably (and privately) lean one way or the other.
Who is the chief quality engineer? Remarkably, many IT organizations will not appoint a quality assurance lead for a given project. I was stunned when I reviewed a multi-hundred-million-dollar project and found that there was no separate, formal QA department and no head of QA. It also explained why this project was more than 100% over schedule and over budget. The chief quality engineer is the gatekeeper — s/he stands between the development group and production/shipping. Without that key role, a large IT project can go into a unstable state that never goes into production.
How do you define “quality assurance”? If the answer is simply or mostly “testing”, then you have big, big problems. Testing, while critical, is only a small part of the full cycle of QA activities, including: defining (and enforcing) standards and guidelines; finding project personnel with appropriate expertise; reusing relevant and tested deliverables; setting up and enforcing appropriate configuration management; setting up and using a defect management system, along with an appropriate change control process; gathering useful metrics; conducting appropriate reviews, inspections, and walkthroughs; and having a formal software release process.
Where is your current baseline specification for the project? Years ago, I was brought in a (consulting) chief architect for a subcontractor in a global project coordinated by Motorola. The same day I showed up at the subcontractor’s offices, Bob Millar also showed up as the new program director from Motorola for the scheduled weekly meeting with the subcontractor. Almost the first question Millar asked was: “Where’s your current baseline specification?” The people around the conference table looked a bit confused, so Millar repeated: “Where is the current baseline specification for your subsystem in this global project?” No one could answer him, so Millar continued: “If you don’t have a current change-controlled baseline specification for your subsystem, then how do you know when you’re done?” Millar than made it clear that he was going to raise this issue each and every week until that baseline specification was sitting in the middle of that conference table. Within about four weeks, the baseline was there, in binders, on the table.
What metrics are you tracking, and how are you tracking them? Accurately predicting the progress of a large-scale IT project is almost as difficult and error-prone as estimating it in the first place. There’s an old saw in IT project management that “the first 90% of a project takes 90% of the time, and the remaining 10% of the project takes the other 90% of the time.” The most common metric — # of lines of source code — is only vaguely informative at best and wildly misleading at worst (particularly in object-oriented projects, where rearchitecting and refactoring is the norm). But most other metrics used tend to be subjective metrics — “I [the developer] think we’re about 70% done with this module” — and so largely useless. To be useful, a metric has to be: objective (anyone collecting it would arrive at the same value); automated (to help with “objective” and to allow it to be collected at any time); relevant (actually related to the completion of the project); and predictive (able to predict with some degree of accuracy when the project will be completed).
What is driving the planned completion date, and how is it being managed? A completion date given by fiat from above will almost certainly not be met unless there is a corresponding ruthlessness in cutting features and reducing scope. On the other hand, the lack of any planned completion date almost always result in an open-ended project that never comes to completion. Tying into the QA question above, you need to see if there is a regular, effective, and (when necessary) ruthless change-control process in place, both for prioritizing bug fixes and for controlling or even reducing scope (in the way of implemented or abandoned features).
What is the history of the schedule and budget? Some large-scale software projects get into a pattern that I’ve previously described as “the never-ending story” — a series of schedule slips and budget overruns, typically occurring as the project nears its latest deadline, with the project never really getting any closer to going into production. It may seem impossible to you that a firm could spend years and millions (or even hundreds of millions) of dollars and yet never produce something that can actually be used, but it happens all the time.
What do the engineers in the trenches say? My project reviews almost always include a series of confidential, not-for-attribution interviews with the engineers down in the trenches — and while they may or may not have a global picture of the project, they usually have a pretty good idea of how things are going in their specific area, and they’re usually honest about it when asked. Bad news tends to get filtered as it moves up the food chain. Often the result is that at the highest levels the project appears to be doing just fine — until three weeks before completion, it suddenly slips another 3-6 months.
No big secrets or fancy analysis. These few questions, honestly answered, will quickly flush out most of the trouble areas (if any exist) in a large-scale IT project.
I could not agree more, especially on the need for a chief architect. I am not a professional software developer, and I don’t work for IT, but I am a fairly experienced programmer and I do have some intuitive understanding of how software architecture is done. So when I was allocated on the project team for a sizeable IT project, I foolishly looked forward to having stimulating and interesting discussions with the professionals. It took me about a year (!) to figure out that where I had presumed the existence of at least moderate architectural skills, in fact a gaping void was present. Actually nobody on the project team dared to claim any architectural skills, or had ever designed a system of that scale. We had wasted an impossible amount of time on pointless discussions before we finally hired a consultant to help us. He now has to design an architecture in two months; it will probably be OK as a technological platform but there will be a gaping void where the conceptual structure ought to be.
But I would like to add one more serious issue which tends to arise, perhaps because of the very nature of the IT business. That is the tendency of software teams to live in “data world” where everything is in computer memory, and real things have lost their existence. That is perhaps understandable enough for a web application, and in our modern times even for financial systems which in the end most often do deal with numbers in computer memory. It gets rather more troublesome when applications do have to interact with objects in the real world. Symptomatic for this is that the software developers have only ever seen their own office, the IT manager’s office, and of course all the meeting rooms; but never the work floor, production plant or laboratories. They have identifiers for things and devices in their systems, but they can’t image what the actual objects look like. As if one would try to write the system software for a car assembly plant, without knowing what a car is. It does happen. And it makes it almost impossible to arrive at a “baseline specification”.
Excellent observations.
I hadn’t thought that much about the ‘controlling the real world’ issue because so many of my projects during the first four years after I graduated from college were just that: software that controlled real-world objects. These included: an oscillation-dampening system using a hybrid analog/digital computer (proof of concept for large-space structures); the Space Shuttle Flight Simulators at NASA/JSC; an HP graphics terminal emulator to drive a large-bed (6′) Houston Instruments plotter; and embedded software for various data acquisition devices. So I just take such things for granted. ..bruce..