Chapter 2. Leveraging AI for Test Automation

Did you know that autonomous and intelligent agents, commonly referred to as bots, are already running tests on major applications today? That’s right: leveraging AI for software testing is not a thing of the future; AI for test automation is already here. The bots are in the building, and they’re not testing just one app but many apps in various application domains.1 In fact, don’t be surprised if you find out that AI is already testing your own app! Chances are that, if you publish your app in one of the major app stores, AI bots are already testing it. In this chapter, I’ll walk you through how AI tests software and demystify how this technology really works when testing applications at different levels or for various quality attributes.

AI for UI Testing

Just because certain tasks have historically required human effort does not mean that we won’t be able to automate them someday. Once upon a time, we believed that tasks such as voice and image recognition, driving, and musical composition were too difficult for computers to simulate. However, many of these tasks are now being automated using the power of AI and machine learning. In some cases, AI is outperforming human experts in tasks such as medical diagnosis, legal document analysis, and aerial combat tactics, among others. With that in mind, it really shouldn’t surprise you that we’re leveraging AI for functional testing tasks that previously relied on the expertise of human testers. Figure 2-1 illustrates how to train AI bots to perceive, explore, model, test, and learn software functionality. It is important to note that even though learning through feedback is explicitly called out at the end, the bots leverage machine learning at each stage of the process.

Figure 2-1. Training AI to do functional UI testing

Perceive

A foundational step in functional UI test automation is having the ability to interact with the application’s screens, controls, labels, and other widgets. Recall that traditional automation frameworks use the application’s DOM for locating UI elements and that these DOM-based location strategies are highly sensitive to implementation changes. Leveraging AI for identifying UI elements can help to overcome these drawbacks. Just like humans, AI bots recognize what appears on an application screen independently of how it is implemented. In fact, there need not be a DOM at all, as the interaction can be based solely on image recognition. AI, and more specifically a branch of AI known as computer vision, gives us the test automation superpower of being able to perceive anything with a screen. Furthermore, since we train the bots on hundreds and thousands of images, UI design changes do not result in excessive test script maintenance. In many cases, AI-based test automation requires zero maintenance after visual updates and redesigns.

Explore and Model

Testers frequently explore the application’s functionality to discover its behavior, confirm specific facts, and look for bugs. While exploring, they create mental models and refer back to these models to deal with uncertainty when the application or its environment changes. Similarly, AI bots explore and build models of the application under test. You give them a goal and they attempt to reach it by trial and error, a process known as reinforcement learning. An easy way to understand how reinforcement learning works is to think about how you train a pet dog. To start, you decide on the task you want the dog to accomplish—let’s say “stay.” You also need a bag of treats. If by chance you say “stay” and the dog sits in place, you give it a treat. However, if the dog continues to move around, you keep the treat. You could even take it a step further by emphasizing “bad dog” and showing your discontent.

Figure 2-2 provides an illustrative example of how to leverage goal-based reinforcement learning for exploring, modeling, and testing a software application. To start, all the bot needs is the location of the application’s home screen and for us to give it an objective. Let’s task the bot with navigating to the shopping cart. In our initial state (a), the bot is on the HOME screen and has the goal of reaching the checkered flag on the CART screen. Don’t let this visualization deceive you; for now, the bot only knows about the HOME screen and as a result has only this single state in its application model. Recall from the previous subsection that it can recognize the screen and its various widgets. Another prerequisite I introduce here is that the bots must be able to stimulate via input actions such as keystrokes, taps, clicks, and swipes. Prior to its taking any input actions, I give the bot an initial score of zero. Scoring represents the reward system, a bag of treats, so to speak, which can be positive or negative. Now the bot takes its first action. It randomly clicks a link and transitions to the PRODUCT screen (b). The bot has now seen two screens, HOME and PRODUCT, and updates its application model with this information. My response to the bot’s actions is that this isn’t really what I want—I am really looking for the CART—so I deduct 1 from the bot’s score. From PRODUCT, the bot takes another random action, and this time it lands on the CART (c). Excellent! This is exactly where I want the bot to be, so I reward the bot with 100 points. Following this path, the bot ends up with a score of 0 – 1 + 100 = 99 points and a complete model of the application. Let’s call this exploration scenario Episode 1.

Figure 2-2. Illustrative example of using bots to explore an app using reinforcement learning

Consider a second exploration scenario, where the bot once again starts on the HOME screen (a). However, instead of navigating to PRODUCT, the bot takes an action that takes it directly to the CART (c). Applying the scoring system, the bot earns 0 + 100 = 100 points for Episode 2. The bot essentially finds a path with a higher reward and, moving forward, will follow that path to accomplish its task. In short, the bot learns by combining trial and error with past experiences rather than taking a brute force approach, which as you may recall is computationally infeasible. Goal-based reinforcement learning is practically applicable to a variety of testing tasks, making it an extremely powerful technique for test automation.

Test

Now that the bots can perceive, explore, and model the application, these capabilities come together for the greater good of software testing. The bots are trained how to generate specific types of input data and can recognize what expected and unexpected behaviors look like in given contexts. Note that it may be easier for bots to identify some types of issues than others. For example, an HTTP 404 error is an obvious indication that the application has thrown an error. However, it is significantly harder to know that someone’s pay stub is incorrect because their taxes weren’t calculated appropriately. Nonetheless, several researchers and practitioners are applying AI/ML research to automatic functional UI test generation. This work ranges from generating inputs for individual fields to conducting complete end-to-end test cases, including oracles.2 Although we have only scratched the surface in this area, AI-based test generation approaches are slowly narrowing the test automation gap.

Learn

One of the most notable characteristics of AI- and ML-driven applications is the ability of the system to improve how it performs a given task based on feedback. Having humans provide direct feedback on the bots’ actions—for example, recognizing UI elements, generating inputs, or detecting bugs—makes the system better. Feedback mechanisms allow humans to reinforce or rewrite the AI brain that drives the bots’ behavior. As a result, the more feedback your teams provide to the bots on the quality of their testing, the better the bots become at testing the product, and ultimately the more value they provide to the testing team. Typically, feedback is incorporated into the product UI itself. However, depending on the level of testing, feedback may come via mechanisms for updating datasets more directly.

AI for Service/API Testing

Automated service or API testing validates the way systems communicate via sequences of requests and responses. For example, a typical communication exchange between two services, A and B, could be as follows:

  1. Service A sends an HTTP GET request to retrieve some data from Service B.

  2. Service B processes the request and returns an HTTP status of 200, indicating success along with a response body containing the requested data.

A common approach to automated API testing with AI is to record the service traffic from manual testing and then use that information to train an ML-based test generation system. One of the key goals for this type of testing is to verify that the communication sequences do not result in unhandled errors. This approach stems from the once popular record-and-playback feature found in early functional test automation tools, so you’ll see this pattern in other areas such as performance testing. However, although there are several available open source and commercial API test automation tools on the market,3 few of them offer these new “record, learn, generate, and play” capabilities. Since automated tests, like those for APIs, map naturally to interleaving sequences of inputs and expected outputs, another AI-based approach that applies to API testing involves using long short-term memory machines (LSTMs) for generating test sequences.4 Under this approach, you would train the machine on string sequences of abstract test cases that contain example service requests and their respective responses. Figure 2-3 depicts the workflow for developing and validating such a test flow generator using neural networks.

Figure 2-3. Automatic test case generation using neural network models

These are the major steps of the workflow in the context of API testing:

  1. Model the API testing flow as a sequence problem.

  2. Develop an abstract test language to support the API test flow model.

  3. Create a test set to validate the adequacy of the language in describing API tests.

  4. Curate and/or gather example handcrafted API test flows.

  5. Train a neural network to generate valid test flow sentences that belong to the grammar of the abstract test language.

With an abstract test case generator for your APIs in place, you can now develop an engine that transforms those abstract, platform-independent tests into platform-specific tests for execution using any given communication protocol or technology.

AI for Unit Testing

Researchers and practitioners are training AI to automatically write unit tests in high-level programming languages like Java.5, 6 Much of the work in this area builds on advances in AI for generating text and natural language. Initiatives like Open AI’s GPT-3 combine natural language processing (NLP) and deep learning models to generate text that is difficult to distinguish from human-written text.7 Here are two popular features of AI for unit-testing tools and frameworks:

  • Automatic source code analysis to generate unit tests that reflect program behavior and help to reduce gaps in coverage.

  • Integration with version control systems to monitor source code changes and keep tests up-to-date.

Of the three automation levels—UI, service, and unit—the unit level appears to be getting the least amount of attention. In fact, the attention level seems to be the inverse of what you would expect from a community following automation best practices like the testing pyramid.8 For example, after taking a quick product offering survey in the space, I found only one commercial product that uses AI for unit test generation. This pales in comparison to more than 10 for functional UI test automation and 5 for API testing. However, AI technology to support improvements to automated unit testing is definitely available, and I hope to see more progress in this area in the near future.

AI for Performance Testing

Tooling for performance test automation has been relatively stable for many years. Several organizations still follow legacy load-testing practices that have a steep learning curve and involve a lot of manual steps. AI and ML are propelling us into a future where you can rapidly gather and correlate performance test metrics and even generate complete end-to-end performance tests that normally would require human experts. In this section, I summarize two promising directions in the use of AI for performance testing.

Application Performance Benchmarking

With an AI-driven framework for functional UI automation in place, you can extend the bots’ capability to tracking key application performance metrics. While the bots are exploring and testing, they collect data on the number of steps taken, load times, CPU utilization, and more. With AI, not only can you see how your app performs in key scenarios, but you can also compare its performance to similar applications from your competitors once the app is available, for example, in an app store. This is possible because AI-driven tests are highly reusable across different apps within the same domain.

Figure 2-4 provides a sample report that compares the test results of a retail application with those of other applications in the domain. The way this works is that the bots run a set of goal-based tests on key use case scenarios for each app within a given category. The bots compute a percentile rank for each performance result so that you can compare the results for your app with all other apps in your category.

Figure 2-4. Sample application benchmarking test report for a major app in the retail category

Toward End-to-End Performance Testing with AI

Although practically useful, application performance benchmarking is not the same as end-to-end performance testing. Recall from “Nonfunctional Test Automation” that system-level performance testing emulates production traffic using hundreds and thousands of virtual concurrent users conducting real-world transactions. In a presentation at the STAREAST 2020 testing conference, performance testing expert Kaushal Dalvi shared his team’s experience building a tool to generate end-to-end performance tests.9 Figure 2-5 depicts a vision of their ambitious goal of developing a self-service system that automatically produces LoadRunner scripts complete with parameterization, correlation, and logic.

Figure 2-5. A vision of automated end-to-end performance testing using ML10

The internal tool eliminates the need for manual rescripting, and now there is ongoing work that uses ML to drive smart rules for parameterization and correlation. Several application performance-monitoring vendors are also touting features that include AI-based performance, scalability, and resiliency testing.

AI for Design Testing

A quick internet search on the topics of manual and automated testing is likely to return some blog posts, articles, and presentations describing why test automation will never replace manual testing. One of the frequently used points to support this argument is that it is not possible to automate things that require human judgment. Such qualitative assessments rely on people’s experiences, perceptions, opinions, or feelings. UI design quality attributes like usability, accessibility, and trustworthiness all fall under this category. However, advances in ML are demonstrating that it is possible for machines to simulate human judgment for specific tasks, including UI design testing.

AI for Mobile Design Testing

Google and Apple publish guidelines to help developers ensure that Android and iOS mobile apps are well designed and deliver a consistent user experience. Here is an example guideline:

When Possible, Present Choices

Make data entry as efficient as possible. Consider using a picker or table instead of a text field, for example, because it’s easier to choose from a list of predefined options than to type a response.

You’ve probably already noticed that, although these guidelines are written with good intentions, parts of them are vague and open to interpretation. For example, how efficient is “as efficient as possible”? Is there a rule to know when you can stop? What about widgets other than text fields? While this may seem like nitpicking, the subjective nature of the guidance and the resulting designs is what makes this problem so difficult. Furthermore, even when the guidelines are clear and precise, there are so many variants to check that doing so in an application-specific way is extremely tedious.

AI is a great way to catch these issues because you can train the bots to examine the screen just like a designer, customer, or reviewer. They don’t look at code or have app-specific checks but instead check all the visual elements on the screen against an AI trained on example guideline violations labeled by humans. They find these issues almost instantly and in a repeatable manner that avoids the error of human memory and interpretation. With AI enabling the automatic validation of UI design guidelines, there really is little reason for humans to look for the issues that machines can now identify.

AI for Web Accessibility Testing

In an effort to promote universal access to web technologies, the World Wide Web Consortium (W3C) developed a set of Web Content Accessibility Guidelines (WCAG). These guidelines provide criteria to make software accessible to people with physical, visual, cognitive, and learning disabilities. Not only do web development companies have a moral obligation to construct web applications that provide universal access, but in most countries they have a legal obligation. Although several tools support static WCAG web page analysis, present tools fall short in evaluating an entire application for accessibility. Furthermore, current test automation techniques are capable of discovering only about 30% of WCAG Level A and Level AA conformance issues.11

AI is proving to be an effective way to extend the capabilities of current accessibility-testing tools. By combining an AI-driven testing platform with open source tools, you can train the bots to explore a website and evaluate its WCAG compliance. As the bots explore the site, they conduct static accessibility checks using the open source tools and generate dynamic tests that mimic users with disabilities. An interesting project, code-named Agent A11y,12 that employs this approach appears in the 2019 proceedings of the Pacific Northwest Software Quality Conference. A notable feature of Agent A11y is that, due to the large set of WCAG checks the bots perform, the authors even use ML to correlate and coalesce the accessibility test results. Talk about turning a problem on itself!

AI for UI Trustworthiness Testing

While coteaching a testing workshop, I had the opportunity to engage the participants in playing an intriguing game of human testers versus AI bots. If you think about it, AI can beat humans in games like chess, Jeopardy, and go, so why not software testing? Let’s take a look at this game of testing and how it played out.

Envision 70 testers in a classroom. However, these testers are no ordinary testers. These are professional, technical testers, who work in roles where their company has chosen to send them for a week of training at an international testing conference held in the United States. These testers are confident enough to brave a full day of learning about AI and ML algorithms. This room is full of great testers. Their opponent is a neural network—AI bots trained on data related to the questions that are about to come.

Now let’s go a step further and pretend that you are one of those testers and see if AI can beat you at your own game. We ask you the following qualitative testing question: If you were looking at an application’s login screen, how would you know if you could trust it or not? In other words, solely by looking at the user interface, could you rate an app’s trustworthiness? Take a moment to think about it and then look at some of the example mobile login screens in Figure 2-6.

Figure 2-6. Rating the trustworthiness of an application based on its UI design

In Figure 2-6, the screens on the left are data samples of some of the least trusted apps, while those on the right are some of the most trusted. Any thoughts? If it’s any consolation, the other 69 testers in the room are taking quite some time to think about it too. There are no quick answers. A woman in the front row exclaims, “Foreign languages!” She explains that if the primary region of the app store is the US, but the app is written in a foreign language, she wouldn’t trust it because she wouldn’t understand what it was saying. Not a bad start, but we’ve already spent three minutes with 70 human minds thinking about this problem in parallel. Now granted, it is not typical that you would engage 70 testers on this one problem, and there may be several biases at play here. However, even if only one highly skilled tester produced the same result, it may not be a good use of their time.

A couple more minutes go by, and then a hand goes up. A gentleman suggests that if there is a recognizable brand or logo on the screen, he would probably trust the app more than apps without these features. So now, 70 people have spent five minutes of their time, and we have two ideas for how to measure UI trustworthiness. That’s progress, but the room quickly becomes quiet again. This is the point where the law of diminishing returns sets in. There are no more ideas past the 10-minute mark, and in fact there are no more ideas until that group has to move on.

But how did the AI bots perform in the challenge? Prior to the class, back at my company’s headquarters, ML engineers trained a neural network using trustworthiness data from real users. The data was the result of asking individuals to rate the trustworthiness of a large set of login screens on a scale of 1 to 10. Once trained, the bots had the ability to simulate the human raters while working on previously unseen samples. As part of the experiment, the engineers inspected the neural network to understand the AI’s answer to the same question I asked the human testers. Here is the AI’s explanation:

Foreign language

If the screen has foreign words/characters, it’s less trustworthy.

Brand recognition

If the screen has a popular brand image, it’s more trustworthy.

Number of elements

If the screen has a high number of elements, it’s less trustworthy.

Interestingly, for this task, the AI appears to be “smarter” than any one person. The bots produce three UI design factors that relate to UI trustworthiness, whereas no single person came up with more than one factor. Furthermore, not only did the AI discover an additional aspect of the application UI that correlates to trustworthiness, but it gave a precise score of how trustworthy it thought the screen was using a scale of 1 to 10.

With machine learning, the bots truly learn to simulate the judgment of the humans that provide training data for a given task. In this testing experiment, the bots directly reflect how real users view trustworthiness, while as human testers we have to pretend and emulate that empathy indirectly. How much better is it to have the oracle be real-world users versus testers trying to reverse engineer and guess what the end user will think or feel?

Conclusion

AI-driven test automation is causing quite a stir in the software community due to its applicability to multiple levels and dimensions of software testing. AI is testing user interfaces, services, and lower-level components and evaluating the functionality, performance, design, accessibility, and trustworthiness of applications. With all of the activity and buzz around AI for software testing, it feels like the beginning of a new era of test automation. AI is giving testing some much-needed superpowers to help tackle challenges like automatic test generation.

1 Test.ai, “Case Study App Store Provider,” 2020.

2 Dionny Santiago, “A Model-Based AI-Driven Test Generation System” (master’s thesis, Florida International University, September 9, 2018).

3 Joe Colantonio, “Top API Testing Tools for 2020,” Test Guild, May 16, 2017.

4 Dionny Santiago, “A Model-Based AI-Driven Test Generation System.”

5 Laurence Saes, “Unit Test Generation Using Machine Learning” (master’s thesis, Universiteit van Amsterdam, August 18, 2018).

6 Diffblue

7 “GPT-3 Powers the Next Generation of Apps,” OpenAI, March 25, 2021.

8 Mike Cohn, “The Forgotten Layer of the Test Automation Pyramid.”

9 Kaushal Dalvi, “End to End Performance Testing—Automated!” (paper presented at the STAREAST 2020 Conference, Orlando, Florida, May 2020).

10 Kaushal Dalvi, “End to End Performance Testing—Automated!”

11 Aleksander Bai, Heidi Mork, and Viktoria Stray, “A Cost-Benefit Analysis of Accessibility Testing in Agile Software Development Results from a Multiple Case Study,” International Journal on Advances in Software 10, nos. 1–2 (2017): 96–107.

12 Keith Briggs et al., “Semi-Autonomous, Site-Wide A11Y Testing Using an Intelligent Agent,” PNSQC Proceedings, 2019.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.182.45