1 INTRODUCTION

Rex Black

I’ve always loved the ocean. When I’m close to an ocean – provided it’s not too cold or the weather too inclement – I like to swim or SCUBA dive in it. One of the things about oceans, though, is that oceans have waves and currents. Some are gentle, some a little more sizeable and some are massive. The massive ones can be dangerous if you don’t know what you’re doing, but serious surfers search these massive waves out and have the time of their lives in them. I’ve always envied those surfers, flying down the face of an enormous wave, though I’ve never learned to do it myself.

The software industry is like an ocean: there are always waves of change coming, of various sizes. The big ones can be exciting if you have the skills to catch them, but they can also swamp your career, as lots of software testing professionals who pooh-poohed the Agile wave have learned in a painful fashion.

Another big wave – which oddly enough has been decades in coming – is artificial intelligence (AI). I took a class in AI as a senior at UCLA, and worked on a proof-of-concept project for a professor to use AI in stock trading. This was in 1988. Slow wave, but now it’s finally here, bringing real change to the real world.

So, in this book, you’re going to read about the skills you need to ride this wave as a test professional. As a test professional already, you probably know that one key part of your job is to ask questions and understand risks. So, what are some of the questions, risks and skills that this book will raise and enhance?

One key question is whether we can trust AI, especially given some of the crucial roles it will play (e.g. self-driving cars). Any time you have objects moving in the physical world – beyond just electrons whirring around in silicon and circuitry – you have the possibility of damage, injury or death. Yes, software has long been involved in making potentially dangerous objects move in the real world – think avionics software or implantable medical devices – but AI promises to make encounters with software-driven moving objects a daily, if not hourly, experience for all but those who choose to live as hermits. As software testing professionals, how can we help ensure that society can trust these systems to be more beneficial than risky?

Of course, we would do that by testing the AI systems; but how can we test AI systems, especially since the most common form, machine learning, will change its behaviours in response to our tests? This is a marked difference from traditional software that, under most circumstances, will give the same output for the same inputs over and over again provided the software is not changed. This book gives some ideas on how to attack this challenge.

The point of running a test is, of course, to learn something, to get a result. We always prefer that result to be definitive, to be able to say the test passed or failed, not to say, ‘Well, maybe that worked.’ But what does ‘passing a test’ even mean for an AI? There may not be a clear specification of correct behaviour, or correct behaviour may change over time, or we may not even know what correct behaviour is beyond what the software tells us. At one time, the solution to this kind of problem was to create a parallel system, an approach that was once favoured for certain high-criticality systems, but is no longer in wide use. In this book, you’ll read about some ways to approach this challenge as well.

This change in what we get in terms of test results means that test metrics will be different, too. For example, functional testing of traditional systems often involves looking at the percentage of tests that pass versus those that fail. For AI systems, the correct questions for functional tests are likely to be, ‘How often does each test give a result that appears correct?’ or ‘How far from the expected result values are the results of each test?’

THE CHALLENGES OF TESTING AI

To some extent, testing AI systems will be harder because the problem space is harder than many of us are used to dealing with as test professionals.

AI systems are being used to deal with complex, chaotic, messy realities, where the number of possibilities is huge, even compared to existing software, such as software that plays chess (which has somewhere on the order of 10123 moves) or Go (which has more than 10360 moves; see Koch, 2016). As both of these numbers are greater than the number of atoms in the universe, it’s not like we are used to solving only trivial problems. However, in thinking about self-driving cars consider the number of possible driving routes from any location in the United Kingdom (UK) or Germany or the United States to any other location within the same country. Obviously, that’s a much less constrained set of possibilities than moves on a chess board or a Go board, so the number of possible outcomes is much larger.

Not only are the problems harder, but AI systems are different. AI systems change in response to stimuli, unlike other software that only changes when updated deliberately. AI’s change in behaviour is driven by the stimuli, not predetermined like other software updates. So, the testing itself will influence how the system will behave next.

Historically, we have used computers to automate activities that humans are bad at (doing the same thing the exact same way over and over again) or that take too long to do manually (complex maths or accounting). Now, we are trying to use computers to automate activities that have complexities that don’t easily lend themselves to mathematical formulas but which humans learn to do as children.

People and data have biases, and these can become embedded in AI systems. For example, the relative number of women in IT is smaller than the number of men, which can lead an AI to be biased in favour of men when deciding who is more likely to succeed in an IT role. Broad-based use of AI systems resulting in the calcification and reinforcement of such biases is a significant societal risk that must be addressed, and test professionals must be aware of and responsive to managing that risk.

This risk is compounded by the fact that people trust computers too much. That may seem odd to you and me, because, as test professionals, we have learned to be very sceptical of software. We tend to expect it to fail. However, many people don’t have that outlook, but rather assume, ‘Well, if the computer says so, that must be right.’

Another challenge to the test professional arises because the world constantly changes, which means AI systems will change too. Consider the COVID-19 pandemic. When I first heard of a strange respiratory disease in China, I thought, ‘Yeah, I bet this will be like SARS and MERS, something that will be contained and maybe a little freaky but not a huge deal.’ Well, I was completely wrong about that. When testing AI systems, we’ll need to think about not just small, incremental changes, but big, fast, disruptive changes like pandemics, otherwise we risk missing important tests.

Stepping back a bit to consider the objectives of testing, one typical objective is to reduce risks to the quality of the software to an acceptable level. However, traditional quality models don’t adequately capture quality for AI systems. For example, traditional software either gives the correct answer for a given set of inputs or it doesn’t. AI systems may give correct answers sometimes and not others for the same inputs, or may give an answer that is different from the expected but still correct, or may give different results for inputs that are in the same equivalence partitions and thus would (traditionally) be expected to be handled the same way. This means that we will need to re-think our testing techniques.

We will also need to re-think how we measure test coverage. Just based on what has been said so far, you’ve probably guessed that requirements, design and other specification coverage measurements clearly won’t work, and that will be reinforced in a moment when I get to the probabilistic rather than deterministic behaviour of AI systems. Other common dimensions of coverage, such as risk coverage and supported configurations, may be relevant, but do they take the place of specification coverage? Further, since we aren’t using traditional programming, code coverage – always of limited use for measuring completeness of testing in any case – is even less useful, if not utterly useless. This book will help you to understand how to approach this critical problem for testing AI systems.1

Because AI systems often change in response to inputs, it’s important that such changes be desirable. For example, deliberate malice or simply exposure to the wrong data could result in an AI becoming racist, as happened on one occasion with a Twitter bot (Vincent, 2016). Of course, the very idea of an artificial intelligence becoming racist is surreal, almost Kafkaesque, on multiple levels. ‘Not intelligent enough to pass a Turing test but capable of directing hateful comments at others who could pass a Turing test’ is a statement that perhaps could describe more than just racist bots, but also some people one might have the misfortune to meet, but that’s a question outside the scope of this book.

How do we test for ethical and unethical behaviour? In a situation where grey areas or known ethical conundrums exist, how to handle it? For example, a runaway street car could plough into a crowd of people where it may kill over a dozen, unless a person (or in this case an AI) acts on it to shunt it onto another track where it will only hit one person but will certainly kill them. If that seems simple, consider a self-driving car where a dozen careless bicyclists recklessly swerve in front of it.2 Should it avoid hitting them by turning onto a sidewalk where a single law-abiding pedestrian will be struck? What constitutes correct behaviour here? What constitutes ethical behaviour? How do we program such behaviour? How do we test for it?

Over recent years, regulators in the UK, EU and the United States (US) have struggled with issues of data privacy. However, when behaviour can vary from one set of inputs to another, how do we test for compliance with regulations, such as data privacy? For example, in the US, access to patient information is regulated under a law called Health Insurance Portability and Accountability Act (HIPAA). Testers must be able to test for compliance or non-compliance with such laws, but can we be confident that the results of our tests will not change as the AI evolves?

So, what is our role as quality professionals in social issues? Of course, as individuals, we may choose to donate to one cause or another, or participate in demonstrations for or against something or someone, but those are personal choices. With AI systems, we may find our work thrust into the middle of some very thorny matters. For example, the market for home ownership in the United States has some extremely fraught social history revolving around race.3 If you have ever worked as a software tester, software engineer, business analyst or other software professional in banking, you may be aware of the regulations associated with ensuring the banks no longer perpetuate the damage that was done to racial minorities who were systematically disadvantaged in home loans in the United States. Outside that domain, your professional involvement in this area may have been limited. Now, with AI systems, as this book will explain, to the extent that the systems you are working on can influence social outcomes (for good or evil), you may find yourself professionally engaged in evaluating whether those systems are having malign effects, which may be both inadvertent and quite subtle.

As this book will further explain, to the extent that your work testing AI systems has an intersection with social issues, it will be complicated by various biases. ‘I’m not biased’, you might protest, and you might well be right, but are the data that were used to train your AI biased? Is your AI biased through some other means? In what way? How can you test to ascertain whether such biases exist?

In testing these AI systems, the hard-won test design techniques that we have accumulated over the years, especially in the work of pioneers like Glenford Myers and Boris Beizer, such as equivalence partitioning, boundary value analysis, decision tables, state diagrams and combinatorial testing, may lose some of their power, because of what Beizer referred to as the ‘bug assumption’ behind each technique.4 The bug assumptions are the types of bugs each technique was particularly powerful at finding, and those types of bugs are the types of bugs that occur in traditional procedure and object-oriented programming. In AI systems, other types of bugs exist, and some of the bugs that we find in traditional programs are less likely. In this book, you’ll gain insights into new test design techniques, to augment the traditional techniques, for testing AI systems.

We are used to software working the same way (at least functionally) every time it is used to solve the problem with the same set of inputs. However, for non-functional behaviours like reliability and performance, we often see probabilistic behaviour, where reliability can be expressed in terms of percentage likelihood of the system failing under a given level of load or the percentage of responses that are received within a given time target under a given level of load. For AI systems, functional behaviours can also be probabilistic, in addition to evolving over time. This is another factor that makes it difficult to find reliable test oracles for functional testing of AI systems.

Of course, one of the bright shiny objects in testing has been, for decades, test automation. Tool vendors have made large amounts of money, often by deploying trendy buzzwords and promising easy success and quick return on investment (ROI), but my clients and I have found that test automation, in the long run, is less likely to succeed than open-heart surgery. Over 80 per cent of otherwise-healthy people 70 years or older who have open-heart surgery are still alive five years later (Khan et al., 2000), but less than half of major test automation efforts I’ve seen with my clients are still achieving a positive ROI, using the same strategy and technologies, after five years. In this book, you will learn how AI will affect test automation. Just as importantly, you’ll learn how it won’t affect test automation, and the obstacles that stand in the way of certain AI benefits for test automation in the short term, so that you are less likely to get snowed by a buzzy sales pitch from a tool vendor.

Automated tests can work at multiple levels and through various interfaces. The level of testing and the interface of automation change the challenges associated with applying AI systems to the test automation problem. However, the fundamental challenges associated with test automation at each level and through a particular interface often do not change just because an AI is being applied, though it is – of course – very much in the interests of the boosters of the test automation tools to assert the contrary. This book will help you to define the right criteria for test automation tool evaluation, which is critical in any test automation project. As always, a strong business case and demonstrable ROI is essential for any major endeavour, and test automation – whether done with AI or not – will almost always be a major endeavour. Remember, too, that return on investment must be measured against clearly defined objectives, and those must be the right objectives.

As should be clear by now, all these challenges and differences associated with testing AI systems have implications for skills. For example, suppose you are testing an AI system that helps make high-frequency stock trades. In addition to needing serious domain expertise in terms of financial systems, financial markets and financial regulations – all skills necessary for testing such systems implemented with traditional technologies – you may also need serious data science and mathematical skills, due to the probabilistic nature of the system.

These skills are necessary to deal not only with the more complicated test oracle problem associated with an AI-driven high-frequency trading system (which is a hard enough problem with a traditional implementation), but also with the thorny problem of test data design. In fact, test data design is complicated enough that, while test professionals should understand the process, it often must be done by a professional data scientist. The test professional’s grasp of the test data design issues serves primarily for them to act as a reviewer for the work of the data scientists in this regard, to deal with various biases that could affect the validity of their work.

Just because testing of AI systems abounds with new challenges, new risks and new skill requirements, doesn’t mean that the old risks and skills are obsolete. For example, when assembling a System of Systems – and, in this Internet-of-Things world, just about everything talks to everything else and thus just about everything is a system-of-systems – we must ensure that each component system has been properly tested in a manner reflective of the way in which they will be used. Testing of the data flows across the interfaces is a place where the traditional test design skills of equivalence partitioning, boundary value analysis and combinatorial testing will be necessary.

There are other things that won’t change, too. It’s long been established that the longer a defect exists, the more it will cost, both in terms of impact on the project, the system and its users and in terms of cost to remove. These costs are not primarily associated with the discrete act of changing a few lines of code – after all, once the defect is found in the code, the effort to change the code doesn’t change that much – but rather with finding the defect to begin with, and mitigating the consequences associated with the failures the defect causes. To address this, smart software engineers for decades have talked about shift left, meaning finding and removing defects as close as possible to the point of introduction.

However, another way to reduce the time from introduction to discovery of defects can include shifting right. Shifting right has to do with making the discovery of defects that do escape into production as quick, painless and cheap as possible. How can AI systems help us do this? Read on to find out.

SUMMARY

So, you are holding in your hand – or on your favourite electronic reading device – a book that will introduce you to a number of new ideas and new challenges related to testing and quality of AI systems. This book can start you on your journey, but it can’t provide easy solutions to all the challenges, because challenges don’t always have easy answers.

But, as I’ve noted above, there are places where the tried-and-true still apply. Beware the person who comes to you saying, ‘This changes everything’, and listen to their words with inherent scepticism. However, when we’re talking about AI, it is hard to imagine a situation where AI doesn’t prove highly disruptive to all aspects of software engineering, including software testing. After all, some inventions really do change everything, or at least change a lot of things. Many people on this planet would not be here if it weren’t for a breakthrough by Fritz Haber, a man almost no one has heard of but whose idea touches about half the people in the world every day, every time they eat food.5

By reading this book, you’ll gain some insights into the AI-driven disruptions that are headed your way – if not already on top of you – and, if I may modify my metaphor mid-sentence, give you ideas about how to ride that wave of disruption rather than being swamped by it. I wish you success as a software test professional as you ride the wave into the world AI will change, possibly as much as Haber’s invention did. Let’s hope that, unlike Haber’s invention, we don’t end up using AI – artificial intelligence, non-human intelligence – to create new, more efficient ways of being inhumane to each other. Part of that is up to each of us.

1 For more on possible ways to think about test coverage, see my presentation ‘Dimensions of Test Coverage’ (Black, 2015).

2 This is discussed by Malcolm Gladwell (2021) in an episode of the podcast Revisionist History.

3 For one account of this history, see Richard Rothstein’s The Color of Law (2017).

4 See Boris Beizer’s book Software Testing Techniques (1990) for more on the bug assumptions behind each technique.

5 Fritz Haber invented a process used to create ammonia, which is essential to the manufacture of chemical fertilisers. Sadly, Haber also was central to the invention of chemical warfare, which makes his legacy of change mixed. Tragically, in spite of Haber’s contributions to the German war effort in the First World War, members of his family died in concentration camps during the Second World War because of his Jewish heritage. A quick summary of Haber’s complex life, career and contributions to science is found at: https://en.wikipedia.org/wiki/Fritz_Haber

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.219.197