Analysis presented in this post are mostly suitable to analyse performance of web pages and services (especially web services) but also can give a hint how to approach performance report analysis of different application types.
If you have configured your performance tests well you will finish with rich report having many cool graphs and numerical data. This post will explain how to work with such a report well. It will also show what data should such a report contain for good application analysis. You can use it to know what extra widgets should be added to your report.
In the last post I've describe how to create decent performance tests resulting with pretty report. Let's use this loadosophias report as an example of a good performance tests report and describe the parts I find the most useful in everyday work.
(In screens below I am using old loadsophias report layout as it has all of the graphs on one tab which makes legend below easier. All graphs described below are present on the new layout and can be found easily.)
If you have configured your performance tests well you will finish with rich report having many cool graphs and numerical data. This post will explain how to work with such a report well. It will also show what data should such a report contain for good application analysis. You can use it to know what extra widgets should be added to your report.
In the last post I've describe how to create decent performance tests resulting with pretty report. Let's use this loadosophias report as an example of a good performance tests report and describe the parts I find the most useful in everyday work.
(In screens below I am using old loadsophias report layout as it has all of the graphs on one tab which makes legend below easier. All graphs described below are present on the new layout and can be found easily.)
Although the whole report is filled with interesting and important statistical data we should analyse, there are five things that should be analysed first (according to numbers on the picture above).
Summary and why this is not enough for creating a reliable SLA
Test summery information (1) is the data you will normally end after running tests with typical test tools and should only be a start in well configured report. Those information are only really useful to compare the results to previous runs. It gives some overview of the system, although it doesn't explain why those values are as such - so it is not sufficient to create an SLA. Why? Such document should be a contract about what conditions we guarantee to client and with what probability.
For example SLA should state that we ensure that 99% of requests will end in at most 100 ms. Saying that avg response time will be 100 ms says nothing, because it can mean that 50% will be 190 ms and 50% will be 10 ms. If client would then set timeout to 110 ms he will get problems in 50% of cases - which is in most cases unacceptable and against SLA we've agreed to.
Response time diagnosis
To create a meaningful SLA or to check application performance in case of response time we need to see how those values were behaving through the time of test and what is its distribution.
First big input for creating an SLA is Overall Response time distribution & Quantiles section (2). This graph can literally fill our "X% will end in Y ms" section of such a document.
For example basing on our test, Google could say that 90% requests will end in 400 ms. Or that 87% of requests will end in 300 ms. Or even, 92% of requests will end in 500 ms, but 50% will end in less than 250 ms.
The question of how to write it in documents itself depends on the client profile. This data lets you to adjust SLA for the clients needs, but also gives you important information about your application performance.
In this case one could ask why 7.5% of requests are so far from the general distribution. Most requests end in 200 - 400 ms here, but the other group suddenly take 500 - 600 ms. Such a big deviation can mean application, hardware or even net problems. It's not rare that diagnosing such fact can improve general performance, not only performance of 7.5% requests.
In this case one could ask why 7.5% of requests are so far from the general distribution. Most requests end in 200 - 400 ms here, but the other group suddenly take 500 - 600 ms. Such a big deviation can mean application, hardware or even net problems. It's not rare that diagnosing such fact can improve general performance, not only performance of 7.5% requests.
I'm not saying that you must fight about performance of those 7.5% of requests. maybe even 600 ms is absolutely acceptable for your application? However you should always be able to answer why those 7.5% have that deviation in response time.
Additional input in response time analysis brings Response Times Distribution Over Time graph (3). It can additionally say when (according to test time) response time deviations happened. In our example we can see that the long response times appeared at the start of the test and than diapered. It can be an important tip, that whatever was the cause of such response times it disappered over time or is only appearing from time to time.
Response codes for the rescue
Response code over time graph (4) is a simple, but powerful tool for diagnosing web applications. First it can tell if your application is working correctly (200 response code). If error pages appear, it will tell you when. It also should be compared to transactions per second (TPS), which can be really helpful.
For example we can imagine that errors (for example code 500) can appear when test exceeds 15 TPS and it overloads our system (for example backed timeouts).
In other case our application can experience occasional error codes. This can represent temporary backend problems or web instability. This can sometimes be absolutely acceptable, but sometimes can be a prediction of future troubles. Therefore again it is really important to know why those occasional errors appear to predict if they will grow and at what scale.
If we predict that some % response codes would be errors (for example when we do some cyclic backend operation that will make our services unavailable for couple of hours) we need to write it in our SLA also.
Measuring response codes is extremely important in web applications performance tests and unfortunately often ignored. I've seen more then once a situation when developers measured performance of an application and were extremely happy of an unexpected performance grow. They didn't notice that it was caused by a big percentage of errors caused by backed malfunction. Simple report they were using (avg TPS and response time) had no indication of response code problems.
If we predict that some % response codes would be errors (for example when we do some cyclic backend operation that will make our services unavailable for couple of hours) we need to write it in our SLA also.
Measuring response codes is extremely important in web applications performance tests and unfortunately often ignored. I've seen more then once a situation when developers measured performance of an application and were extremely happy of an unexpected performance grow. They didn't notice that it was caused by a big percentage of errors caused by backed malfunction. Simple report they were using (avg TPS and response time) had no indication of response code problems.
The story of TPS
Transactions per second graph (5) is the cherry on top of those graphs.
In most cases SLA covers not only response times, but also is stating how many concurrent requests is our application able to handle in any given time.
For such SLA basing on this graph we need to pick a safe number. In this example probably 8-10 TPS would be a safe pick (having in mind that its end is just caused by threads finishing up testing).
Remember to not over-advertise your service. If you guarantee 9 TPS and the customer gets 15 he will be positively surprised and might hold you in high regard for quality of service. On the other hand if you would promise 15 TPS and deliver 9 you would probably deal with lots of bug reports and customer dissatisfaction.
In our example Response Count By Type (5) gives you additionally information about Virtual Users number at the given time. You can use that and give a big ramp up time in your test: For example you can set the test to run 64 threads, and give a rump up time of 640 seconds - every 10 seconds new virtual user will be added showing you how next concurrent user affects TPS.
This graph should also be used to determine TPS stability. If the VU number is stable, is TPS stable as well? Its obvious that TPS will be jumping in some range (like +/- 5%). If we would notice some more drastic jumps like working at pace of 100 TPS but suddenly dropping to 5 TPS we would be sure that we deal with something serious (network issues, software optimization, some process killing backend from time to time).
Of course no matter of how wide the range of TPS is you need to ask yourself yet again a simple question why.
For such SLA basing on this graph we need to pick a safe number. In this example probably 8-10 TPS would be a safe pick (having in mind that its end is just caused by threads finishing up testing).
Remember to not over-advertise your service. If you guarantee 9 TPS and the customer gets 15 he will be positively surprised and might hold you in high regard for quality of service. On the other hand if you would promise 15 TPS and deliver 9 you would probably deal with lots of bug reports and customer dissatisfaction.
In our example Response Count By Type (5) gives you additionally information about Virtual Users number at the given time. You can use that and give a big ramp up time in your test: For example you can set the test to run 64 threads, and give a rump up time of 640 seconds - every 10 seconds new virtual user will be added showing you how next concurrent user affects TPS.
This graph should also be used to determine TPS stability. If the VU number is stable, is TPS stable as well? Its obvious that TPS will be jumping in some range (like +/- 5%). If we would notice some more drastic jumps like working at pace of 100 TPS but suddenly dropping to 5 TPS we would be sure that we deal with something serious (network issues, software optimization, some process killing backend from time to time).
Of course no matter of how wide the range of TPS is you need to ask yourself yet again a simple question why.
Real treasury
The real fun begins, when you create such reports systematically (for example every week on the same hour). Such test should be also run after every big (or not) software or hardware change. Having such historical data will open you a door to grand new possibilities - to observe how changes in application, infrastructure, information stored in backend and even user load affects your application performance.
This report gives you great tools to detect abnormalities of your application. It also gives you realistic data for creating SLA for your clients.
This short article is only a tip of an iceberg of possibilities that good reports after performance tests brings. I hope it will be a good start for you and an inspiration to dig further.