[Webinar Recording] Applying a Methodical Approach to Website Performance

by Iveta Moldavcuk on 03 Nov 2016

When addressing website performance issues, developers typically jump to conclusions, focusing on the perceived causes rather than uncovering the real causes through research.

Mitchel Sellers will show you how to approach website performance issues with a level of consistency that ensures they're properly identified and resolved so you'll avoid jumping to conclusions in the future.

Watch the webinar to learn:

  • What aspects of website performance are commonly overlooked

  • What metrics & standards are needed to validate performance

  • What tools & tips are helpful in assisting with diagnostics

  • Where you can find additional resources and learning

Applying a methodical approach to website performance on Vimeo.

You can find the slide deck here: https://www.slideshare.net/sharpcrafters/applying-a-methodical-approach-to-website-performance

Video Content

  1. Why Do We Care About Performance? (3:20)
  2. What Indicates Successful Performance (6:06)
  3. Quick Fixes & Tools (27:27)
  4. Diagnosing Issues (37:10)
  5. Applying Load (40:49)
  6. Adopting Change (50:30)
  7. Q&A (52:14)

 

Webinar Transcript

Mitchel Sellers:

Hello, everybody. My name's Mitchel Sellers and I'm here with Gael from PostSharp.

Gael:Hi.

Mitchel:

And we're here to talk about website performance, looking at taking a little bit of a unique look at performance optimization to get a better understanding of tips and tricks to be able to quickly diagnose and address problems using a methodical process. As I mentioned, my name's Mitchel Sellers. I run a consulting company here in the Des Moines, Iowa area.

My contact information's here. I'll put it up at the very end for anybody that has questions afterwards. During the session today, feel free to raise questions using the question functionality and we will try to be able to address as many questions as possible throughout the webinar today, however if we're not able to get to your question today, we will ensure that we are able to get those questions answered and we'll post them along with the recording link once we are complete with the webinar and the recording is available.

Before we get into all of the specifics around the talk today, I did want to briefly mention that I will be talking about various third party tools and components. The things that I am talking about today are all items that I have found beneficial in all of the time working with our customers. We'll be sure to try to mention as many alternative solutions and not necessarily tie it to any one particular tool other than the ones that have worked well for us in the past. With that said, we're not being compensated by any of the tools or providers that we're going to discuss today.

If you have questions about tooling and things afterwards, please again feel free to reach out to me after the session today and we'll make sure that we get that addressed for you. What are we going to talk about? Well, we're going to talk a little bit about how and why we care about performance. Why is it something that is top of mind and why should it be top of mind in our organizations, but then we'll transition a little bit into what are identifiers of successful performance. I'm trying to get around and away from some of the common things that will limit developers.

We will start after we have an understanding of what we're trying to achieve. We'll start with an understanding of how web pages work. Although fundamental, there's a point to the reason why we start here and then work down. Then what we'll do is we'll start working through the diagnostic process and a methodical way to look at items, address the changes and continue moving forward as necessary to get our performance improvements. We'll finish up with a discussion around load testing and a little bit around adopting change within your particular organization to help make things better, make performance a first class citizen rather than that thing that you have to do whenever things are going horribly wrong.

Why Do We Care About Performance?

One of the biggest things that we get from developers is that sometimes we have a little bit of a disconnect between developers and the marketing people or the business people, anybody that is in charge of making the decision around how important is performance to our organization. One of the things that we always like to cover is why do we care? One of the biggest reasons is we look at trends in technology is that for publicly facing websites, Google and other search engines are starting to take performance into consideration within the search engine placement. We don't necessarily have exact standards or exact considerations for this is good versus this is bad, but we have some educated guesses that we can make.

We want to make sure that we're going to have our site performing as good as possible to help optimize search engine optimization aspects. The other reason that it is important to take a look at performance is user perception. There's a lot of various studies that have been completed over the years that link user perception of quality and or security to the performance of a website. If something takes too long or takes longer than they're expecting, it starts to either A, distract them and get them to think, "Oh, you know what, I'll go find somebody else," but then they start asking questions about the organization and should I really trust this business to perform whatever it is that we're looking for.

One of the last things that is very important with the differences in user trends within the last 24 months or so is device traffic. We have a lot of customers. We work with a lot of individuals where their traffic percentages for mobile is as high as 60 to 80% depending on their audience. What that means is we're no longer targeting people with a good desktop internet connection and large, wide-screen monitors. We are actually working with people that have various network abilities, anything from still being attached to that high speed home internet, but they could be all the way down to a low end cellular edge network that isn't fast.

We have to make sure that we take into consideration what happens with throttle connections, page sizes, all of those things that we used to really take an emphasis on back ten years ago, become front and center with what we're doing today. 

What Indicates Successful Performance 

With this said, one of the things we find a lot of cases is that nobody can put a tangible indication on a site that performs well. That's the hardest thing. Most of the time, when we work with development groups, when we work with marketing people or anyone really to take an existing site and resolve a performance issue, the initial contact is our website performance is horrible.

Although it gives us an idea of where their mind is at, it does not necessarily give us a quantitative answer because if we know it's horrible and we want to make it better, how can we verify that we really made it better? This is where we have to get into the process of deciding what makes sense for our organization. It's not an exact science and it's not necessarily the same for every organization. Almost all of the attendees on the webinar today, I'm going to guess that each of you will have slightly different answers to this indication and it's important to understand what makes sense for your organization.

A few metrics that we can utilize to help us shape this opinion, Google provides tools to analyze page speed and will give you scores around how your site does in relation to performance. If your site responds in more than 250 milliseconds, we would start to receive warnings from Google's page speed tools that mention your website is performing slowly. Improve performance to increase your score. That starts at 250 milliseconds and there's another threshold that appears to be somewhere around the 500 millisecond range.

We can take another look and start looking at user dissatisfaction surveys. Usability studies have been completed by all different organizations and a fairly consistent trend across those studies that we've researched show that user dissatisfaction starts in the two to three second mark for total page load and what we're talking about here from a metric would be a click of a link, a typing in of a URL, and the complete page being rendered and interactive to the visiting user.

We can also look at abandonment studies. There's various companies that track abandonment ecommerce. We can see that abandonment rates start to increase by as much as 25 to 30% after six seconds. This gives us a general idea that we want initial page loads to be really fast to keep Google happy and we want the page load to be fairly fast to keep our users happy. There are exceptions to these rules, things such as login, checkouts, et cetera where the user threshold for exceptionable performance is going to be slightly different, but it's all about finding a starting point and something that we can all utilize to communicate effectively, this is what we're looking for with a performant website.

Gael:

Mitchel, you mentioned that Google will not be happy. What does it mean? What would Google do with my website if the page is too slow?

Mitchel:

Right now there is no concrete answer as to what ramifications are brought on a website if you exceed any of these warning levels within the Google page speed tools. It's a lot like your other Google best practices and things. We know that we need to do them. We know that if we do them, we will have a better option. We'll have better placement, but we don't necessarily know the concrete if we do this, it's a 10% hit, or if we don't meet this threshold, we're no longer going to be in the top five or anything like that. We don't have anything other than knowing that it is a factor that plays into your search rank.

Gael:Okay, thanks.

Mitchel:

Which really gives us a few options. Google looks at individual page assets, so they look at how long it takes your HTML to come back. Another way to look at this that may make sense within an organization is to focus not necessarily on true raw metrics per se, but focus on the user experience. This is something that is much harder to analyze, but it allows you to utilize some different programming practices to be able to optimize your website, to make things work well. Examples of this would be things like Expedia.com, Kayak.com, the travel sites.

Oftentimes, at various events, I've asked people how long they think Expedia.com or Kayak.com takes to execute a search, whether you're searching for hotels, whether you're searching for airfare, anything in between. The general answer is that people think it takes only a few seconds when in all reality, from when you click on search to when the full results are available and all of the work is completed, it's oftentimes 20 to 35 seconds before the whole operations been completed.

However, these sites employ tactics to keep the user engaged, whether it is going from when you click that search to showing a, "Hey we're working hard to find you the best deal," and then taking you to another page, which is the way that Expedia handles it, basically distracts the user during the process that takes some time. Or Kayak will utilize techniques to simply show you the results as they come in. You get an empty results page, we're working on this for you, and then the items start to come in.

These are ways to handle situations where we can't optimize for whatever reason. Case of Expedia or Kayak, we have to ensure that we have appropriate availability of our resources. If we have flights available, we have to make sure that we still have those seats. We can't necessarily cache it. This is going to take some time. It's a great way to be able to still allow the user experience to be acceptable, but we then optimize in other manners.

The last thing from an indication perspective that is fairly absolute across to everybody is we can focus on reducing the number of requests needed to render a website. That is the one metric that in all circumstances, if we reduce the number of things that we need to load, we will improve the performance. It is a metric that regardless of if that is a target for your organization, we really recommend that you track it and the reason for that is as we step into understanding how web pages work, we'll see that the impact here can be fairly exponential. Regardless of if you're a developer, a designer, a front end person, a back end person or even a business user, it's important to understand the high level process of how a web page works before we start actually trying to optimize it.

The reason for this is that we have oftentimes encountered situations were users are trying to optimize something based on an assumption that they have. It has to be the database. It has to be our web code. It may not even be that. It may simply be your server. It may be user supplied content or something else. When we look at a webpage, the process is fairly logical and top down in terms of how the web browsers will render a page. We start out by requesting our HTML document. That is the initial request that the user has initiated. The server, after doing whatever processing that it needs to do, will return an HTML document. The web browser then processes that document and it looks for individual assets that it needs to load.

This will be your CSS, your JavaScript, any additional resource, including images, that needs to be downloaded to be able to properly render. Those items are all cataloged and they start to download. We then may have, in the case of certain development practices have a repeated chain. If we have an HTML document that links to a CSS file, once that CSS file is downloaded, if we find out that CSS file was referencing another CSS file, we can continue that chain. That's when we start to see some of our processing take additional time. The other aspect that's important to understand here is that we are forced with limitations with our web browsers so that we can only request between four and ten items per domain at the same time.

Most of the modern, modern web browsers in the last year or so will get you all the way to that ten item limit, but what that means is that if we have an HTML document that then references 30 JavaScript and CSS files to render a page, we have at least three full sets of round trip processing utilizing those ten concurrent requests, which will push out our page load. If we could take those 30 and make it down into ten resources, we could go out and grab one batch of responses and be ready to render our pages and that's something that will really help with overall performance.

To help illustrate this, I have taken some screenshots of the network view of a local news station's website. This was taken from their website about a month ago or maybe a little bit more. They've since redesigned their website, so it's not quite as extreme as it is in this example, but what we see here is I have two screenshots showing the HTTPS requests to load this web page. In all reality, there are 18 screenshots, if I were to show you the entire timeline view to load their home page. In the end, a total of 317 HTTP requests were necessary to render this web page. Now, obviously this is not something that we want to be doing with our webpages, but it's something that allows us to see what happens as we work to optimize.

We can see in this trend here that we start out by retrieving the main KCCI website, which then redirects us to the WWW. This case, user typed KCCI.com and we do a redirect. We can see here that it took 117 milliseconds to get to the redirect and then 133 milliseconds to actually get the HTML document. We've already consumed 250 milliseconds just retrieving the HTML document. Then we start to see the groups of information being processed. We can see here that we have a first group of resources up through this point being processed. We then see the green bars indicate the resources that are waiting to be able to complete. We'll see that these get pushed out, so the additional resources along the way, keep getting pushed out as the rest of the items are working.

Then the timeline expands and expands. We can see here, even by our second section, we're downloading a bunch of small images, things maybe even as small as 600 byes, two bytes, those types of things and they push out our timeline. We get all the way out here and it keeps going. You can get these types of views for your own websites, regardless of which toolset you want to use by utilizing the web developer tools in your browser of choice. On Windows machines, it's the F12 key, brings up the developer tools and you get a network tab. That network tab will show you this exact same timeline here, so you'll be able to see this in your own applications.

What we want to do is we want to be able to reduce this. If we simply cut out half of the items that we're utilizing, we can then simply move on and not have to do anything specific to optimize by simply reducing the request. Moving forward from this, let's take a look at what a content management system driven website in a fairly default configuration ends up looking like.

In this case, we see that we have a much longer initial page response time here, at 233 milliseconds, but we have a far smaller number of HTTP requests and the blue line here indicates the time at which the webpage was visible to the requesting user. It's at that point in time that sure, there may be additional processing, but that's not typically stopping somebody from being able to see the content that they want to be able to see.

As we look at optimization, this is one of those things that we want to keep a close eye on to make sure that we have improvement. Metrics, and how can we validate this quickly to be able to do a compare and contrast. There's a couple key metrics that will help you identify your HTTP requests to maybe give you an idea of if it's something that you want to do. We can look at the total page load. We compared the KCCI site that we talked about. We compared an out of the box content management system solution and then we took that same content management solution and simply minified our images and minified and bundled together our CSS and JavaScript. What we were able to do is we were able to see improvements in total page size, the total number of HTTP request.

KCCI, we're sitting at an almost five and a half megabyte download. This could cause problems for certain users because if I'm on a mobile network, that's a slightly larger download, so we definitely would want to look at ways to potentially minimize that. Same thing goes here with the out of the box content management system solution where we're at almost two megabytes. The same content, just with properly sized and compressed images in CSS, we were able to drop that down all the way to 817 kilobytes, making it so that we have less information to transfer over the wire.

Same thing goes from an HTTP request perspective. 26 is way better than 59 and 59 is still ridiculously better than 317. We see a fairly decent trend. The one thing that's a little bit misleading is sometimes this page load speed, if you utilize third party tools, that number may not always be exactly what you want it to be because it factors in things such as network latency and some of the tools even add an arbitrary factor to the number. Well, at this point-

Gael:

I'm wondering what is the process, what did you actually do to go from column one to column three. How do I know what I need to do or what is step one, two, three?

Mitchel:

What we use in a lot of these cases is one of two tools or sometimes both to help us identify more quickly some of the things. There's Google Page Speed Insights. Sorry, they renamed it and the Google Page Speed Insights tool is one that allows us to point it at any website and it will give us a score out of 100. That score out of 100 is going to then look at things such as how many HTTP requests do you have? Are your images properly sized? Are your images properly compressed? They talk a little bit about bundling and minification of JavaScript files as well.

What we can do with that tool specifically is it gives us a number we can use to help identify some of those quick hit common problems. It goes as far as if you have images that are not properly scaled or compressed for the web, they actually give you a link that you can click on and you can download properly scaled and sized images. Then you can simply replace the images on your own site with those optimized ones that come from Google.

Gael:Okay, that's cool.

Mitchel:

It's a fantastic tool, but one of the things that we find is that its score is not necessarily indicative, so one of the things that developers are often led to believe is if there's a score out of 100, I have to get 100. With the Google Page Speed Insights tool, it's not exactly that clear cut and the reasoning behind that is just looking at these three examples, it gave a very slow new site a score of 90 out of 100 and the reason for that is that they checked all of those little boxes. They checked the boxes to keep Google happy, but maybe they didn't actually keep their users happy.

We typically cross check this tool with another tool provided by Yahoo called YSlow. This tool takes a slightly different approach to scoring, but it looks at similar metrics and similar things with your website. The KCCI site received an E of 58%. Our CMS that wasn't horrible was 70, but after slight modifications on our side, we were able to get it all the way up to a 92% or an A. We find that a lot of times you want to utilize these two different tools or two other tools of your choice to help aggregate and come to an overall conclusion to what may need to change.

Furthermore, with Google's Page Speed Insights, anybody utilizing Google Analytics will actually never be able to get above a score of 98 because the Google Analytics JavaScript violates one of the Page Speed Insights' rules. I call that out to people just because I don't want to see people get tunnel vision and lunge forward all the way to, "Oh, we have to do this because this is what, we have to get 100."

Quick Fixes & Tools

To help with some of this, one of things that we often will use is a site called GTMetrics.com. It is a free tool. It is actually the tool that was utilized to grab all of the metrics on that table that I just showed and what's nice about it is you can generate a PDF report. You can even do some comparison and contrasting between your results before and after a test, which makes it a lot easier.

What does this give us? At this point, we've looked at how a webpage is structured. We've looked at what it takes to load a webpage and one of the things that we get from this is a lot of quick fixes that we may be able to improve our websites dramatically. Images that aren't compressed. JavaScript files that aren't necessarily combined. These two add to our content quite greatly. Images that aren't compressed or images that aren't properly sized are going to bloat your page load. That's why we may have a five megabyte page instead of a two.

A couple quick examples of this would be in the responsive era, we'd have a known maximum of how big our images are going to be. Uploading anything larger than that size is simply a waste of server space and it's a waste of bandwidth because we don't need to work with it. We worked with a customer that had the initial argument of, "Our website performance is horrible." We went to look at their site and we agreed. It was. The pages were loading incredibly slow and it was consistent across all of the pages. What we found is that for the overall look and feel and design of the website, the design firm never prepared web ready imagery. The homepage was somewhere along the lines of 75 to 80 megabytes in size to download and interior pages were 30 to 40.

Well, the reason why we start here is that would have been a lot harder to catch on the server side because it's not anything wrong with the coding. It's not anything wrong with the database and our web server looked really good performance wise. When we start here, we can catch those kinds of things. This one is really important in terms of image sizes if you allow user uploaded content into your web properties. The other thing that we can look at is static image, hidden images or hidden HTML elements, for those of you ASP.NET developers, older web forms projects that are utilizing use state can often add some overhead. The other big thing that is pointed out by PageSpeed Insights as well as by YSlow is a lack of static file caching.

There's a common misconception with ASP.NET projects that static files will automatically decache. That's not necessarily the case, so you may need to tweak what gets responded as that cache header value, all things that both of these tools will point out and direct you on the fix. The goal here is to take care of the low hanging fruit. Take care of the things that may be stressing an environment in a way we don't know and see what we can do. We've found typically with websites that have never had a focused approach to making these kinds of changes that we've had a 40 to 80% improvement over time as it relates to these changes alone. For those of you who don't have public facing websites, GTMetrics.com and PageSpeed Insights will not work for you because they do have to have access.

Otherwise, they don't have access to get to the website. Yahoo YSlow is available for anybody to use because it's a browser based plugin as well, so you'll be able to utilize that on internal networks. One of the things that we get a lot of times is okay, we've optimized, but we still have a lot of static files and it's adding a lot of activity to our server. Another quick fix would be to implement a content delivery network. A content delivery network's purpose is to reduce the load on your server and to serve content more locally to the user if at all possible.

The situations that we're looking for here would be things where maybe we can take the content, load it to a CDN and when we have a customer visiting our website from California, they can get the content from California. When they are in Virginia, maybe the content delivery network has something closer to them regionally. These types of systems would definitely help. It's a little bit of a moving the cheese kind of game where we've moved potential problems to leverage something else. The difference is the content delivery networks are designed to provide incredibly high throughput for static content.

They're a great way to start improving things. I'll talk a little bit more about a couple implementations specifics in just a second. Another thing that often gets overlooked is we've reduced our JavaScript files. We've reduced our CSS. We've got everything good, but now we have these pesky images where we have 25 little images that are being used as part of our design. Most commonly we see them in things such as Facebook, Twitter, LinkedIn, Instagram. All of them as little one kilobyte images.

Some people opt to use font libraries to be able to handle that type of thing, but one of the things we don't want to lose sight on is that we can utilize an image sprite, the simple combination of multiple images into one larger image and then using CSS to display the appropriate image is definitely an option to help minimize the number of HTTP requests. That does require developer implementation of it, something that can make things work fairly easily.

If we've made these changes, we want to go and roll out to a content delivery network solution, we have two options primarily and we have pull type CDNs, which are things that we don't have to make a bunch of changes to our organization. Service providers such as CloudFlare and Incapsula, I believe that Amazon also has the support to do this. Basically you point your DNS to the CDN provider and then the CDN provider knows where your web server is. Basically they become the middle man to your application.

What's nice about it is you can make the change pretty quickly, simple DNS propagation and they usually give you on/off buttons, so you can turn on the CDN, everything is good. If you notice a problem, you can turn it off until you're able to make changes to resolve things. They're great, however, unusual situations typically will arise at least at some point in your application because you have introduced something, so if it accidentally thinks something is static content when it's not, you may need to add rules to change behaviors. The implementation benefits are massive. We'll go through an example here with some experience I have with Incapsula that makes things easy.

The selling point here though is you can make this change in an afternoon, barring any unusual situations. Another option would be to integrate a more traditional CDN, which basically involves rather than loading content for static content directly to your web server, you utilize a CDN and store that information elsewhere. Rather than linking to an images folder inside of your application, you would link to an images directory on a CDN or something of that nature. It's definitely more granular, easier in some ways to manage, but it requires manual configuration. You have to actually put the content where you want that content to be and that's the only way that it'll work out as expected.

Diagnosing Issues 

We've talked about quick fixes. We've talked about things such as images and tools to validate and quantify things, but that just didn't get us enough. We need to now dig into the server side. To dig into the server side, we need to make sure that prior to us starting anything, we want to make sure that we're monitoring the health of our environments. What we'll want to make sure of is that at all times, whenever possible, we want to have information on our web server and database server, CPU and memory usage. We want to know how many web transactions are actually happening, how long are they taking, what kind of SQL activity is going on?

On top of that, we want to know exactly what our users are doing in our app if at all possible. Regardless of if you're diagnosing a problem today or if you're just simply supporting a web property that you want to make sure that you're keeping an eye on performance overall, you want to make sure that you are collecting these metrics and the reason why this is so important is sometimes you may not be able to jump on the problem a second that it comes up, but by having metrics, we'll be able to go back in a point of time and understand exactly what has happened, what was changed, what was going on yesterday at 2:00 PM in the afternoon when the website was slow?

How you go about doing that, there's a number of different tools out there. I personally am a fan of NewRelic, but we've worked with CopperEgg. We've worked with Application Insights and many other vendors. The key is to collect this information as much as humanly possible to make sure that you really truly have a picture of what's going on with your environment. Most of these tools operate either with minimal or almost no overhead to your server. We typically see a half percent to one percent of server capacity being consumed by these monitoring tools.

Gael:

Mitchel, so how long do you recommend to keep data? How much history should we keep?

Mitchel:

My personal recommendation is to keep as much history as humanly possible, understanding that that's not exactly a practical answer. Really what I like to see is at minimum 30 days, but ideally, six to 12 months is preferred, simple because sometimes you have things that only happen once a year, or twice a year. We encounter a lot of situations such as annual registration for our members happens on October 25th and our website crashes every October 25th, but we're fine 364 other days of the year.

In situations like those where having some historical information to be able to track definitely makes things a little bit easier because we can try to recreate what happened and then validate that, "Oh, well, we recreated this and now we're not using as much CPU. We're not using as much memory as we do our testing."

Gael:Okay, thanks.

Applying Load

With this, we now know that we maybe need to start taking a look at our website under load. One developer clicking around on your dev site just isn't recreating the problems that we're seeing in production, so how do we go about doing this? Well, what we need to do is we need to take all of the information that we were talking about earlier and apply this to the way that we do load testing and what we mean by that is that the load examples need to not be static content examples. In other worlds, we can't go after just our HTML file.

We have to make sure that whatever our load test does recreates real traffic behaviors, including making all of the HTTP requests, waiting between times where it goes from one page to the next, making sure that everything that the user's browser does, we do in our load test. We also need to make sure that we're being realistic. If we know that we get 200 people logging in and a five hour window and that's the most we ever see, we don't want to test our system necessarily to 500 people logged in, in ten minutes because we aren't necessarily going to be able to be successful in that regardless. The other key and this is the hardest part for most organizations is that our hardware environments need to be similar.

If we're testing against a production load, we need to be utilizing production grade hardware. There are certain scenarios where we can scale, but it doesn't always work that way and the general recommendation is that we want to make sure that we really are working in a similar environment. What should happen when we do a load test?

When we do a load test, we should see that everything kind of continues in a linear fashion. If we do a load test and we add more users, the number of requests that we process should go up. The number of, the amount of bandwidth that we use should go up and those really go up in lockstep with each other because more users equals more traffic and more traffic equals more bandwidth. What tools can we utilize for this? Well, there's a large number of tools out there. If you Google load testing tools, you're going to come up with a whole bunch of them.

In my situation, I found that LoadStorm.com is a great tool. The reason why I like it is it does what browsers do. You actually record the script that you want to run using Google Chrome, Internet Explorer, et cetera and the other thing is it gives me some data center aspects, so I can choose with LoadStorm that I want my traffic to come from the central part of the US. I want it to come from Europe or I want it to split between the two. The benefit here is that some of these load testing tools will send you European traffic when you have no European traffic at all, so your numbers re going to look a little bit different because of internet latency.

Really what you need to do is just make sure that whatever tool you're utilizing matches your expectations of what the user should be doing and what your users are doing. When we do load testing, we also want to track some additional metrics. In addition to all of the things that we should be doing, our load testing tool or our server monitoring tool should be giving us some additional things. How many requests per second were we issuing? How many concurrent users were on the site? This is a metric that's very important to understand. A concurrent user may not mean the two users clicking on a button at the same time. A concurrent user is typically referred to as a user that's visited the site whose session is still active, so they could be as much as ten to 20 minutes apart.

We want to make sure we understand that metric so we can get a gauge as to what's going on with our users. We want to look at the average server response time and we want to look at that from the client side. We want to look at the percentage of failed requests and we want to look at how many requests are being queued. The goal here in terms of what number do you have to hit or how do you hit it is going to depend on your environment. Each server configuration, each application is going to be capable of performing to a different level than something else.

Now, with that in mind, one of the things that we do want to make sure of is that our load testing tools, as part of being realistic need to be realistically located. If we have our data center in our office, doing a load test from our office to that data center is not necessarily going to give us a realistic example because we factor out the entire public internet. Keep that in mind as you're configuring things. With this, we got about ten minutes left. We'll get through a quick scenario here. Then we can take some time for a couple quick questions.

One of our success stories was working with a customer that had a large number of, they had six virtual web servers, 18 gigs of RAM a piece, a SQL server cluster with 64 gigs of RAM, 32 cores and load tests resulted basically in failures at six to 700 concurrent users or about 100 users per server and we had a goal. We needed to reach 75,000 concurrent users and we needed to do it in two weeks. We did exactly what we've been talking about here today. We started with optimizing images, CSS, minified things, bundled things. We implemented a CDN and then we have a before and after load test scenario.

Really, the actual numbers don't matter, so I've even taken the scales off of these images. The differences is the behavior that we see. In both of these charts, the top one is a before. The bottom one is an after. The green and purple lines are requests per second and throughput. The light blue line that goes up in a solid line and then levels out, that's the number of concurrent users and the blue line that is completely bouncing around in the top chart is maximum page response time and then the yellow one that's a little bit hard to see is average page response time.

What we saw with the initial test in this situation was that the maximum response time was bouncing around quite a bit. As load was added, we'd see a spike, it stabilized. Then we'd add more and it would spike. Our responses weren't necessarily where we wanted them to be and it just wasn't doing things in quite as linear or smooth of a manner. In the bottom here, we get a much more realistic example of what we'd like to see and that is as our users go up, throughput and bandwidth go up and they continue to exceed. They should never dip below that line of users. As they dip below this line of users in the top example, that's where we start to see that the server's being distressed and not able to handle the load because we have more users, so it should always going further and further away from that line.

We can see at the very bottom, we had average response time was great, error rates were totally acceptable, all the way until this magical failure at 53 minutes. The reason why I bring this up and why we utilize this as an example is to showcase why metrics are so important. In this case, we're doing a load test with 75,000 concurrent users. We are using over 21 gigabytes of bandwidth going in and out of the data center and all of a sudden, things start to fail.

Well, we have monitoring on each web server. We have monitoring on the database server. Everything is saying they're healthy. We only have seven minutes left of this load test to really be able to identify a root cause. We're able to remote desktop into the server and validate that we can browse the website from the server. In the end, we were able to find that we had a network failure of a load balancer. It had been overloaded and overheated and stopped serving traffic.

With the metrics, we were able to confirm that at this load point right before failure, we were at 30% CPU usage on our web nodes which meant we had tons of room to go yet. We were at 20% on the database server, which meant we had tons of room to grow yet. The problem was is that our pipe just wasn't able to deliver the users to us, which allowed us to be able to determine what makes sense as our next step. Now, we got lucky in this case. It was a completely implemented solution and we had to make a bunch of changes. 

Adopting Change 

As we start working in organizations, I encourage people to try to promote a proactive approach to performance in your applications.

Bring it up as a metric for any new project. Set a standard that your new projects will have X page load time or you're going to try to keep your HTTP requests under a certain number. If you can start there and validate it through your development process, it'll be a lot easier to make things work rather than having to go back and make a bunch of changes later.

If you do have to work in that reactive environment, it's important to remember the scientific method that we learned in elementary and middle school, which is change one thing, validate that that change really did something and then revalidate what you're going to do next. If you change three things, you're going to run into situations where it's not as easy to resolve an issue because you don't know what actually fixed it. It was one of those ten things we changed on Friday. Which of those ten things was really the one to make it work?

The whole goal here is if you start with a process, if you start with a purpose, you should be able to get through these situations in a manner that makes it easier to get the complete job taken care of. I'll put my contact information back up for everybody and Gael, do we have a couple questions? We've got about five minutes or so here.

 

Q&A

Q: Will GTMetric or similar tools work within an authenticated area of a website?

A: With regards to authenticated traffic in validating performance. There's a couple options. Google Page Speed Insights is really only for public facing stuff. YSlow as a browser plugin will work on any page, so authenticated, unauthenticated, internal to your network, public facing does not matter. That's probably going to be your best choice of tooling for anything that requires a login to get the client-side activity as well as the developer tools themselves, to be able to get the timeline view.

Q: What is a normal/acceptable request per second (RPS) for a CMS site?

A: The acceptable number really depends on the hardware. From my experience with working with DotNetNuke content management system on an Azure A1 server, we start to see failures at about 102 - 105 requests per second. However, it really varies and will depend on what your application is doing (number of requests, number of static files it has, etc.). While there's not necessarily a gold standard, if you see anything under 100, it's typically a sign of an application configuration or other issue rather than the fact that you do not have a large enough hardware.

Q: When would you start getting tools like MiniProfiler from Stack Overflow or similar tools?

A: Typically what happens is the progression here is at this point in time, what we're trying to do is factor out other environmental issues, so from here, this is where we would typically jump into okay, our HTTP requests look good. Our file sizes look good. Our content looks good. Now we know that we have high CPU usage on the web server and the database server isn't busy. That's where I'm going to then want to dive in with MiniProfiler or any of the other profiling tools and start to look. If I notice that the web server looks good, but my database server is spiking, that's where we're going to want to start looking more at SQL Profiler and other things.

The goal here is to take the mix out of the puzzle of am I making IIS work too hard because it's serving static content and everything else? And focus more on getting that optimized to then dive in and really get our hands dirty with our .NET code or whatever server side code that we're working with.

 

 

 

About the speaker, Mitchel Sellers

Mitchel Sellers

Mitchel Sellers is a Microsoft C# MVP, ASPInsider and CEO of IowaComputerGurus Inc, focused on custom app solutions built upon the Microsoft .NET Technology stack with an emphasis on web technologies. Mitchel is a prolific public speaker, presenting topics at user groups and conferences globally, and the author of "Professional DotNetNuke Module Programming" and co-author of "Visual Studio 2010 & .NET 4.0 Six-in-One".
Mitchel's blog.