About 98 Percent Done: The Future of Load Testing

First of all, I won't pretend I actually know what the future of Load Testing will look like, but I want to describe some of the different ideas I have seen and done. Some of these things I have not seen or heard of by anyone else, certainly not on the internet. So hopefully these will expand your thinking around Load Testing.

What is the Purpose of Load Testing?

Functional testing is designed to exercise the functionality of the system. Frequently people talk about using Selenium or QTP to exercise a particular piece of functionality, often in the UI. With the Test Pyramid it is suggested that these sorts of tests should try to hit below the UI level for various reasons. No matter, if you test an API, a UI, Console App or some other hook below the UI, if your concern is around the Testing Pyramid, it is likely you are trying to test the functionality. You are often interested in large sets of behaviors and the way the system responds. When you do Load Testing, in most cases, few of the broad tests are as important. You are not trying to see if the system will handle all the corner cases and often Load Tests don't even check anything besides a response code. The raison d'etre for a Load Test on the other hand often has less to do with how the system functions but in ways the system reacts to many different inputs occurring in a relatively quick fashion. Granted some Load Tests are less about the number of inputs and more about the style of the input (E.G. large files) or other types of constraints (e.g. with less RAM). To sum it up in a general statement, Wikipedia describes it this way:

Load testing is the process of putting demand on a system or device and measuring its response. - http://en.wikipedia.org/wiki/Load_testing

However, the majority of Load Tests simply want to understand how many inputs a system can handle given a certain profile of inputs.

All of that sounds rather abstract, but if you go reading Microsoft's very hand guide on types of performance tests, you will see that the underlying purpose of this sort of testing varies. They use terms like load test, performance test and stress test for different meanings. I think all of this data is really useful and valid. I however, am going to use Load Test (in capitals) to describe any number of different purposes. Instead, what I want to look at is what sorts of ways we can organize our testing to make for a better long term experience. These ideas could be applied to many of the various purposes of Load Testing, so assuming you understand your Load Testing goals you can tailor these ideas to your organization.

35-50% of the Internet Traffic

The first idea I think worth exploring is the Netflix model of Load Testing. Effectively, trying to have a production-like QA system is silly for Netflix because that would be like having a second set of the internet for QA. In fact, if you were like Netflix, you would then have to have an insane number of additional systems to generate a load anything like your customers... or you could just have production traffic mirrored between the two systems. Having a second prod-like environment is of course not going to work. So they came up with a radical set of strategies, but I think this sums it up nicely:

We have found that the best defense against major unexpected failures is to fail often. By frequently causing failures, we force our services to be built in a way that is more resilient. - http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html

The best method they found was basically to attack production and see how it would respond. This however still leaves the question of how do they specifically test their code will handle new loads. While they do some Load Testing, their biggest defenses are the scaled down production traffic mirrors they run, the ability to go back to old code and the fact that AWS allows you to spin up new instances all the time. In effect, they turn the problem on its head, but this really only is helpful if you can use AWS to grow quickly and if your traffic is relatively uniform. Also, at Netflix scale, you can hire a lot of engineering talent to build this up.

The Limits of Load Test Systems

When trying to leverage user traffic doesn't work, you have to start looking for other options. Using something like JMeter is interesting. I call out JMeter because that is what Netflix uses above and beyond their production traffic mirror. A tool like JMeter takes traffic from a proxy and feeds it into a script (I guess you could hand code it if you are crazy). Then you edit the script and parameterize it. You run the script over and over again using multiple threads and try to create load. These tools might instrument the systems under test or you might have to do that yourself. In either case, the data gathered is outputted and left for some poor soul to try to understand. After having been in this position several times, let me say that it truly is difficult to understand these results, in particular because you had to use record-playback from a proxy. Just as it is a bad idea to use record-playback in automating your functional tests, I think it is a bad idea to do so with such tools.

In one of my former companies, they had one specialist whose only job was the deal with these proxy-recorded scripts. They are a mess and I'm not going to pretend I know how to fix them. However, I do have a few ideas and all of them involve your already created friends, the functional tests.

Functional Tests == Load Tests: With or Without UI

When you have a functional test, you might have a complex setup and tear down. However, the test itself is often fairly simple. There are two major ways you can create functional tests, one is using a UI and the other is to hit just below the UI, perhaps at an API level. So logically you can do two things to create load. One is to scale up your UI tests. I have seen this done and know of others who have tried to do it. It is a fairly big engineering feat to create a Load Test using Selenium, but I know it can be done. Be warned, this can be very expensive, as it requires one OS per hand full of threads you want to run plus overhead for selenium hub nodes. The other option is to use your existing API tests and create a Load Test overtop of that. You might have to simplify the data creation and you might have to remove the validation if those take too long, but this is a very easy method of Load Testing the system. I have personally built several Load Tests around this idea.

Now we have talked for years about how functional tests should be run in a CI environment. If you are writing your Load Tests like you write your functional tests, using the same systems, then why not run your Load Tests nightly? Obviously there are some questions you want to ask up front, like if someone will be alerted because of it or what impact that has on your functional tests. Another question you might consider is what sort of load do you want? If you need to actively watch to make sure the system stays up, a nightly Load Test would not make much sense. On the other hand, what if you did a small load for a short period of time? You could capture the resulting data and plot it on a chart. This way as you gathered more nightly data, you would have a rough understanding of what you expect. Now, you aren't running a Load Test once a sprint with little idea of what changes might cause impact, with specialized scripts that take a lot of effort to maintain. Instead, you have a trend line and will notice changes. Now this won't tell you when you will fall over or some other data points, but it does give you a change detector. Furthermore, when you have to fix your functional test, your fix automatically goes into the Load Test.

The next piece is you can run multiple 'threads' at once, not all of which are load related. If your Load Test can't do validation, while running the Load Test you can run some functional tests to see if the system still appears to function. Since this is a job in your CI manager, it should be easy to kick off. You can even manually test while your Load Test runs.

Eventually, you might realize that your functional and load tests only vary in how much load there is and how complex the setup/validation is. You might realize, like the Load Test, you can metrickitize the functional tests so that you see if a functional test start taking longer.

Load Test? What is that?

If you notice my description, my efforts are to make the code base easier to maintain and have as much data as possible. This ultimately makes your Load Testing efforts and your functional efforts look very similar. They live in the same code base, call the same functions, effectively the concepts merge. The only differences are in the design, setup, cleanup and how heavy the validation is. I do think there is value in having different words to describe the intent, but merging the code allows you to get more done with less resources. One of the biggest advantages I have personally seen is that I actually understand what the test does, where as when I was using tools like JMeter, I often had no idea how it worked. I have learned a ton about HTTP and HTTPS just by building my own tools. Now not everyone has time for that and I think ultimately we will want tooling to help make this easier. However, I am not sure if the costs of having a different tech stack and code base is worth the value the current tools provide, so you might have to make your own for now.

If you have found your Load Testing tools are working well today, then feel free to ignore this. I know not everyone has the same needs as we have. I know that the tools we have today do serve some purposes, but my experience hints that they frequently are as much trouble as the value they add.

"In the Year 2000"

- Conan O'Brien, et al.

In trying to predict the future it is really difficult to say what will or will not happen. Conan O'Brien has been predicting what will happen in the year 2000 for more than 15 years, but unlike him, I have not had the benefit of seeing the future. With that said, I suspect that in the future we will see more machine learning style systems that take our Load Test data and create load profiles based upon real time data. Such a system would be able to adaptively adjust based upon metrics in the systems under test and would also detect what changes happened and what code appears to have caused a particular slowdown. It will correlate this and help find what is causing systems to fail.

I also suspect that we will have a better set of load profiles that can push or break systems. Load Test systems of the future might go through those profiles on a daily or build-basis and inform you when a build/day has uncharacteristic results, again based upon some form of machine learning. It will start looking a lot more like functional tests which you only examine when something strange occurs or when you are auditing your tests.

Obviously all of this takes some fair amount of work and effort to produce. We presently don't have the tooling to do this and while bits and pieces have been worked at, I have heard of no one actually doing this.

What sorts of things would you like to see in future load testing systems? What areas have you struggled with? I have found very little experiential data around load testing. I'd love to hear from others on this topic, even if you just dump a link to your blog post.

About 98 Percent Done

Thursday, April 9, 2015

The Future of Load Testing