Additional Stress-Testing Goals

When you have made the initial investment in a stress test, and have the infrastructure and scripts in place to accurately simulate how your production environment will behave after Go-Live, it’s very tempting to take this investment one step further and begin doing some additional value-add or just plain out-of-scope stress testing. In my experience, the most common additional test-runs performed outside of the core goal-driven testing include

  • Playing “what if,” like losing an application server to see the impact that its absence has on load balancing

  • Verifying that your failover solution works as advertised, and ensuring that specific failover scenarios operate as expected.

  • Ramping up the system to reflect excessive loads; this might include increasing the number of online users, reducing the think time of these online users, increasing the number or intensity of batch processes, and so on.

Certainly other scenarios can be tested. I suggest that these be discussed well before test week if possible, in case any prior preparation or research would prove beneficial in setting up or executing these test runs.

Playing “What If”

Taking a “what if” approach to additional stress testing is a lot of fun. If your company is considering an acquisition, in the middle of bringing more users into the system, or simply interested in the impact an MRP run has on the daily production load, these tests are valuable. I believe the key to a good “what if” test is guidance and some level of assistance or preparation from the business units. In other words, you want to understand exactly what might be valuable to script, simulate, or test manually. This implies working with the functional or business teams to understand their business processes. You can then add additional “layers” of testing by folding in some of the tests discussed in the next few sections. For example, understanding the impact of an MRP run during a typical day might just be the beginning for you. To really gain valuable insight, you might further want to see the system fail over from one database node to another, or from the primary data center site to the DR site, while in the middle of all of this processing activity. Additional hardware-centric or high-availability-specific “what if” scenarios are discussed next.

Verify System Redundancy and Failover Perform as Expected

I have had a lot of personal experience with this type of value-add stress testing. The possibilities are endless, but the following scenarios are typical:

  • Knock out power to a database cluster node, redundant application server, no-single-point-of-failure disk subsystem, or a dual-redundant network switch, to verify that the system remains up and available, or reacts as expected.

  • Simulate a failed disk drive. Speak to your hardware partner for supported ways to perform this test, keeping in mind that unplugging an actively running disk drive, even if it’s redundant or protected via RAID, may not be the best or supported method of testing disk drive failures.

  • Actively fail your system over to the DR site, and then back to the primary site.

  • If supported by your solution stack, hot-replace a “failed” component like a power supply or network card to ensure the process is both flawless and that it is documented clearly.

Finally, test the impact of losing an entire application server, which reflects how well your logon load balancing scheme works, how long users take to reconnect to the remaining application servers, and other performance and availability metrics.

Ramp Up to Excessive Loads

Ramping up your work load to a point beyond what you “expect” to see in production after Go-Live is a common goal of stress testing after the core goals of Go-Live viability have been met. Many of my customers look at this in terms of month-end, quarter-end, and year-end processing. In each case, additional online users and/or batch processes are added to the test’s distribution mix.

Measuring specific technology metrics like disk queue lengths or average CPU utilization can be useful too. For one of my customers, for example, we determined that the average database CPU load would be between 20–40%. Our goal was then to see how many users the system could host before average CPU utilization across the production landscape exceeded 80%. Other tests were crafted, focused specifically on driving the load on the database server, and then the central instance (which ran on a dedicated server by itself), then specific application servers, and finally a pair of dedicated batch servers.

In another large-scale SAP stress test, I monitored the disk queue lengths observed by different disk subsystem designs. Their legacy FCAL disk subsystem served as the baseline. A new SAN-based disk subsystem located at a different customer site acted as a target of sorts. Finally, after they received their new virtual array, we configured it consistent with different recommendations pulled from SAP’s SAP Notes and from white papers published by the hardware vendor’s SAP Competency Center. All of this resulted in three different configurations that we finally tested. We settled on the design and configuration that resulted in the lowest average disk queue lengths while still meeting the customer’s high-availability and budget requirements.

In yet another case, I executed a set of stress test runs that pushed the customer’s pre-production environment almost 180% beyond what would be observed in their typical day-to-day system load. Specifically, their stress test target was to hit 80 concurrent processes. After we achieved this goal, I reduced the think times in the functional scripts and then increased the number of virtual users executing these scripts. By the time I was done, SM66 revealed that nearly all (220 of 240) of their dialog work processes were actually being used (an amazing sight, by the way), and the quantity of work being performed was tremendous by all measurements.

The business value provided in each of these scenarios was the same, in that I proved that each customer’s SAP Solution Stack was not only optimally configured (through iterations of tuning and testing), but also that each system exhibited a certain amount of scalability, or head room. Further, the testing revealed which subsystems or layers of the technology stack enjoyed the head room, and which didn’t. By doing so, I also identified future performance bottlenecks that might one day become problematic. Certainly, as funds became available to perform upgrades over the next few years, each client would be in a position to more judiciously allocate those funds to reduce or eliminate future performance bottlenecks.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.