Stop the Line and Defect Prevention

Of all the activities that your team does, which do you think is the most important? Writing code for new features? Fixing bugs found in testing? Fixing bugs found in production? Speeding up your features?

Sadly, test maintenance does not come near the top of most software teams’ list of priorities. If the elevators are broken in your office building, you can be sure that someone will be on the phone to the facilities team straightaway. When your tests are slow or brittle, the problem is invisible to everyone but the programmers and testers who rely on them. If you do test maintenance at all, you generally do it when things have gotten so bad that you can’t stand it any longer, or you simply can’t get a release out because the tests are so broken. There just always seems to be something more important to do.

Team members who think in this way about their tests have got it all wrong. The automated tests are the heartbeat of the team that relies on them, and they need meticulous care and attention to keep them healthy.

Stop the Line at Toyota

In Toyota’s manufacturing plants, every shop-floor worker has the authority and the responsibility to stop an entire production line whenever a problem arises. The problem is then given immediate and focused attention by experienced staff, and the line is restarted only once the problem has been resolved. Once the line has restarted, a team is tasked with performing a root-cause analysis on the problem to understand why it happened so that the source of the problem can be resolved.

When Taiichi Ohno[35] first introduced this idea, his managers thought he was crazy. At the time, it was taken for granted within the manufacturing industry that the most important thing you could do was keep your assembly lines running, day and night if necessary.

When Ohno first told his managers to implement this new system, some of them listened and some of them didn’t. At first, the managers who had implemented the policy saw their productivity drop. Stopping to deal with each problem immediately was slowing them down, and when they compared their output numbers with the managers who had ignored their boss, it looked like the boss had got it wrong.

Gradually, however, those managers who allowed their lines to stop and deal with every problem started to see their lines stopping less frequently. Because each problem was being dealt with using defect prevention, those lines had been investing in continuously improving the quality of the machines and processes that ran the line. That investment started to pay off, and their lines started to get faster and faster. Soon their output was much greater than that of the lines controlled by the managers who had ignored their apparently crazy boss. Their production lines were still clunking along at the same old rate, suffering the same old problems.

Defect Prevention

Toyota’s counterintuitive but hugely successful policy of stopping the line works because it’s part of a wider process, known as defect prevention, that focuses on continuously improving the manufacturing system. Without this wider process, stop the line itself would have very little effect. There are four steps to this process:

  1. Detect the abnormality.

  2. Stop what you’re doing.

  3. Fix or correct the immediate problem.

  4. Investigate the root cause and install a countermeasure.

This fourth step is crucial because it seizes the opportunity offered by the problem at hand to understand something more fundamental about your process. It also means that fixing things becomes a habit, rather than something you put off to do someday later when you’re not in such a hurry.

For example, suppose the build has broken with a failing test. It turns out that the guy who pushed the commit with the failing test didn’t run all the tests before he pushed. Why not? Well, it turns out that he thinks the tests take too long to run, so he ran just what he thought were the ones covering the change he made and then crossed his fingers and pushed his commit anyway. So, the underlying cause is that the features are slow. Now that we understand the root cause, we can work to fix it.

Some teams keep a log of build failures, recording the root cause each time. When they have sufficient evidence that a particular root cause is worth tackling, they can put some concentrated effort into tackling it properly.

Imagine your team as a production line, cranking out valuable features for your users. If you spot a problem that’s slowing the production line down, stop the line and fix the problem for good. Implementing stop the line means you’ve decided to make a fast, high-quality, reliable test run your whole team’s top priority, second only to dealing with production issues that are affecting your customers. When there’s a problem with the tests—whether that’s an urgent problem like a failing test or a nagging annoyance like a flickering scenario—put your best people on it, and fix it forever.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.144.194