[Webinar Recording] Performance is a Feature

by Iveta Moldavcuk on 08 Dec 2016

Starting with the premise that "Performance is a Feature", Matt Warren will show you how to measure, what to measure and how to get the best performance from your .NET code.

We will look at real-world examples from the Roslyn code-base and StackOverflow (the product), including how the .NET Garbage Collector needs to be tamed!

Watch the webinar to learn:

  • Why we should care about performance
  • Pitfalls to avoid when measuring performance
  • How the .NET Garbage Collector can hurt performance
  • Real-world performance lessons from Open-source code

Performance is a Feature on Vimeo.

You can find the slide deck here: https://www.slideshare.net/sharpcrafters/adsa-69947835

Video Content

  1. Why Does Performance Matter? (4:36)
  2. What to Measure? (11:48)
  3. When to Measure? (21:06)
  4. How to Identify Performance Issues? (25:31)
  5. Benchmark.net Alternatives (36:41)
  6. StringConcat versus StringBuilder (41:49)
  7. Garbage Collection (46:21)
  8. Stack Overflow Performance Lessons (50:04)
  9. Roslyn Performance Lessons (50:59)
  10. Q&A (56:52)

Webinar Transcript

Hi. Good afternoon. My name is Matt Warren and this is a webinar doing alongside PostSharp. We have Tony from PostSharp on the webinar as well who will handle answering your questions. Let's make a start. Just to start things off ... this is my details and I'm on Twitter, like most people. I have a blog where I blog about similar things to this talk, certainly around the idea of performance and a bunch of things around internals of .net and that type of thing. That's the kind of thing I talk about a lot.

I do have currently on my Twitter account, there's a little poll if people want to just take a look at that at some point in the next half an hour. There's a poll around ... I would like to get some idea of how much people's current projects they have performance requirements and those sorts of things. If you have the chance and want to go to my Twitter account and see the poll there and answer it, it would be good to get an idea of how those things work out for different people's projects. That's me. Let's get into the main part of the presentation.

I have to put this upfront to really say that, unfortunately, I'm not eloquent enough to come up with this really nice title of Performance as a Feature. Mostly because you can type this into Google and this is the first result returned. If you're not familiar with the Coding Horror blog, it's Jeff Atwood's and he's the founder or starter of StackOverflow and that's when I first heard of this term and probably one of the most popular uses of this term recently is this idea. It's where we're going with this talk. He talks about it in his blog post and this talk is covering the same ideas in that we treat ... security as a feature, we treat usability as a feature we treat, obviously, functionality as a feature, otherwise there's nothing much left. But do we treat performance as a feature? Should we treat performance as a feature? What does it look like if we do treat performance as a feature? That's where we're going with this talk today.

Just to give a little bit more context as well, there's obviously a whole range of areas within the general .net applications, web applications, client or other ones. Many of these different levels are involved, so the UI and whether that's on a phone or an app or your web UI. You obviously have a database and caching layer quite a lot of the time as well, .NET CLR.

This talk is looking about performance within the .NET CLR and the specifics of that and where that is. There's a lot of resources out there on some of the other things for front-end stuff and there's great books around getting better performance there. Database and caching, that's just standard stuff as well. We'll touch on those side of things. The bottom box, if people aren't familiar with this idea of mechanical sympathy, it actually comes from motor racing originally. I don't know if any people are into their cars or petro-heads. This guy here, Guy Martin, popularized this quote. He's saying, basically, you've got to have a level of mechanical sympathy, don't you, or otherwise you're just a bull in a china shop.

The basic idea is for motor racing, to be a good driver, you have to understand the mechanics of the car, you have to have sympathy for the mechanics of the car to get the best out of it. A guy called Martin Thompson co-opted this term. He's mostly from the Java space and his blog is called Mechanical Sympathy. If you want to find more about getting the performance, if you look at the level below the CLR, we'll be talking around things like CPU caches and stuff that's almost outside of the CLR, then mechanical sympathy is a good term. That blog is a good blog to start with to find out what's going on there.

Onto the bits, the agenda, that we'll be covering today. Initially starting with why does performance matter? Why should we take performance seriously? Why should we care about performance? What do we need to measure as part of that and how can we fix the issues and some real-world examples of how some of these issues are fixed, what can be done, where we need to worry about these types of performance issues, so why, what and how.

Why Does Performance Matter?

Why, why should we care about performance? Why do we need to take performance seriously? I think there's a few reasons. One is that actually in this day and age of everything being cloud hosted or ... not everything, sorry, a lot of things being cloud hosted ... Actually there's a monetary saving to be made. If you were able to improve the performance of your application by 20%, 30%, that might mean that you could go into your boss on Monday morning and say, "We can save on our ... your hosting bill or AWS or whatever it might be."

I don't know what sort of relationship you have to your boss and whether you saying that is going to get any money passed back to you, I don't know how that works, but potentially there's savings for the companies anyway. Even if you're not in a hosted situation like that, where you're looking to spec machines yourself, you can still spec lower-cost machines or things like that. There's an idea of saving money. 

I think another one is saving power into ... particularly around constrained devices, phones, tablets, these sorts of things. There's a level where actually saving power is very useful for our users. It makes us happy users, if you like. I don't know how many of you have installed an app on your phone and within a week got rid of it, basically, because you realized it is draining your battery 10 times faster than without the app, so there's that idea.

The other end of the scale is, I guess, people like Google and Amazon and those where they're hosting data centers and for them, every amount of power they can save is more vital. We're probably not the extremes of that, but somewhere in the middle. Again, the idea of good performance equals a saving of power.

I think one of the main ones actually for a lot of our users, bad performance, bad perf basically equals broken. We might, as a developer understand that in reality that page is just loading slow or that button click takes a long time to render the response, whatever it might be, because of bad performance, but users don't really think in those terms. They just think this site's running slowly, this site's not responsive, this app takes too long every time I click a button I get frustrated and I kick it 10 times. For them, bad performance equals broken. The worst end of that is that they're customers who don't come back or they're customers who never buy our products or maybe they'd just be unhappy customers. Either way, it's not a good experience for our customers. Bad performance, at some level, equals broken for our customers.

A real classic example of this is ... Google did a study and they introduced artificially a half a second delay and for them that caused a 20% drop off in traffic. Obviously, that's an extreme end, but for us maybe there's a level that that's a problem. Maybe the customer demo goes so badly wrong because of bad performance and the customer never buys your product or maybe a fairly influential person on Twitter uses your product as a bad experience and tweets about it and gives you some bad press, whatever it might be. We're not going to probably see the same level as drop off in traffic as Google, but there's some level, I think, where we're going to have lost customers or unhappy customers. Customers aren't going to buy our products, aren't going to buy again, that type of thing.

I think there's a few reasons there. Maybe all of them apply to the sorts of products you work on, maybe just some of them, but there's some reasons why we should be taking performance seriously. I think another one, as well, is almost like a pride as us as software developers. This quote from a guy called Henry Petroski, who is an engineer and written a lot of books about engineering and mechanical engineering and civil engineering, a professor at Duke University. It says that basically us, if a lot of you on this webinar are part of the software industry, we're doing our level best to cancel out the steady gains of the hardware industry. We're probably not being that deliberate about it, but the idea applies. Hardware has generally been getting faster. We're not in the same level of CPU increasing, but it's multi-core and the ability of what hardware can do is generally increasing at a fairly large rate.

But potentially, software is treading water or causing that to slow down. We sort of know that ourselves, don't we? If we get annoyed with Word, the latest version of Word and we say, "Oh, on my old 386 PC, Word 98 was lightning fast."

The new version of Word on my quad-core PC with SDS and stuff is running ridiculously slow. We kind of know. There's that idea as well. It wouldn't be a talk about performance without this famous Donald Knuth quote about premature optimization is the root of all evil. That may be true, but a lot of the time that quote is actually misquoted, as you can probably guess where this is going. The entire quote looks like this, "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

There's this idea that actually yes, there is times where premature optimization is valid. We're doing it for the wrong reasons, we're doing it because we just want to optimize for the sake of it or whatever it might be. But it is also saying that there's times where there's opportunities, the critical 3%. It's interesting because it kind of implies that, in a way, that we need to measure this. What is the critical 3%? What is the 97% we can ignore? We need to know these things. We can't guess at these things. At least if people are going to quote that optimization is the root of all evil, it's nice to know the full quote that goes around it and where it fits in and what he was saying a bit more.

One bit to sum this all up for me, there's a developer named Rico Mariani, an architect at Microsoft, did a lot of work on, I believe, the version of Visual Studio, when it came out, when they added a lot of WCF capability into it and lots of nice UI, but it slowed things down a lot. I believe he had a lot of ... a hand in all of that as well, but a hand in making the performance of that better. He sums it up like this, "Never give up your performance accidentally."

 

This idea that there's always going to be a trade-off. We can't always have the time to make everything perform as possible and that's not always a worthwhile endeavor. There's a point where, actually, the performance is good enough for business reasons, for customer reasons, whatever it might be. But at least let's not give up our performance accidentally. Let's know where these places are. Have them measures and make sure we understand, yes, this bit of the code is not as fast as it could be, but we've measured it and we understand that it's as fast as we need it to be for our situation and any extra optimizations, we believe, are going to take too much time or not be worth it. We're not treating this blindly, we're saying we're going to understand where we might have performance issues. We're going to be deliberate about the places we do and don't fix them.

What Do We Need to Measure?

So onto the what side of things. I have a little section in this talk where I want to go on to about how averages are bad. I don't want to just flash up something in red in our webinar and leave it there. I'm going to explain it in a bit more detail, but generally when we're measuring the averages, it's not a bad aura. Don't give us the whole picture. I'm going to demonstrate that now and give a little bit more context to that now.

If you remember back to your math days at school, if that's the last time you've done math or maybe for some of you this is more familiar. This is the ... what's often known as a normal distribution or a nice bell curve and here the average is sitting right in the middle. We'd say the average is the peak of the curve. We know because it's a normal distribution we know that 95 our of 100 people, 95% of people are going to fall within the dark blue area right in the middle. We know that only four of 100 people are going to fall further out, with the extremes, the plus or minus. If you're wondering, that strange circle symbol, that's standard deviation. We know that only four out of 100 people are going to fall more than two standard deviations, plus or minus, and we know that only three out of 1,000 will fall into the very end pieces, the pink pieces is right at the end.

So given an average value in this sort of scenario, we can, in effect, the average just sets where the middle of the curve is. We know that if it's normally distributed, we know what the tail of. We know that we're not going to get responses, if we're talking about measuring response times on the webpage. We know that we're not going to get renderings if we're in an application, whatever it might be we're measuring. If it fits in a normal distribution, we know there's not going to be outliers, which is the problem. If not, people are going to have a really bad experience. Given the average value, we have a rough idea of where all the values might fall. But, unfortunately, this doesn't cover all scenarios. To put it a different way, I always like some good quotes and this guy, Hans Rosling ... I'll talk a bit more about him in a moment, but he came up with this fantastic quote, "Most people have more than the average number of legs."

I'll give you a little while to process that for your minds and see if you can figure it all out. I'm not going to do the math, I haven't got a whiteboard or anything, but the rough math is basically in a population of any country, whatever, there's a lot of people with two legs. Some amounts, bit smaller amounts, with people with less than two legs for a variety of reasons I won't go into, but you can imagine the reasons. If we then calculate the average number of legs divided by number of people, we're going to get a number less than two. 1.99, 1.998, whatever it might be, but it's going to be less than two. But we've said that by far the majority, most people, have two legs. Most people have more than the average number of legs. It's just a way of showing actually sometimes, in certain situations, averages can be misleading, not really give us the whole picture.

Hans Rosling, just as a very short aside if you're into stats or not even at all into stats, but want to learn a bit more about stats, he has some amazing Ted talks and links there at the bottom of the screen or you can search for Hans Rosling Ted talks. He has a fantastic way of bringing statistics alive in a way that few people do. If you've seen a talk with a guy jumping around, pointing at bubbles on the screen, you've seen some of his talks and I'm sure they'll be familiar to you.

But we're not generally measuring numbers of legs. That's a nice quote, it's a nice aside, it shows the point, but actually we're probably, for some of us anyway, we're measuring something like this. This is response times of a webpage, but it could apply to a variety of other scenarios. But we're just going to focus on this one. Again, don't remember your math because you haven't done it since school, histograms, in effect, buckets. The very left-hand bar is the bucket from zero to five milliseconds, in this case. We know that we have 21 responses fell in that bucket. We don't know where they fell within the zero to five milliseconds, we just know there's 21. The next bucket is five to 10, 10 to 15 and so on and so on across the scale across the bottom and high to the bars and underwriting. We know in this case most items fell between 20 and 25, if I'm reading it right, the bar at 31 high.

This response times, this is actually quite a classic scenario of response times. The reason we have the large amount of response times on the left that are happening in under 40 milliseconds, that blob of bars on the left-hand side is because they're not hitting the cache, in effect. They are hitting the cache. They are fetching it very quickly out of an in-memory, cache or wherever it might be, some level of caching. Most of our responses hit the cache. I think in this scenario it's around five out of every six or maybe six out of every seven, something like that. Anyway, a majority of them. We can see that from the graph.

We can actually see quite clearly in this case, our cache is working. The little group of ones on the right-hand side take around 100 to 140 milliseconds, that's ones that don't hit the cache because they're not in the cache, in effect. We have to then do a network call, for instance, to go and get the value from a backend service or a database call or something. That's why there's nothing really going on in the middle, because majority of ones hit the cache very quick, minus a network call or a quicker network lookup. The other ones take longer to do.

Now I've explained it or even from the point I showed the slide, for a lot of you, you can understand this. You can see whether this is acceptable for your users or not. You say, "Well actually, yeah, definitely the caching is working the majority. More than the majority of our hits are going to the cache, that's what we want."

We can see as well that actually no one's getting a response more than 130 milliseconds, the final bar. By the time you get down there, there's very few users. Only one, I think, got between 125. We know that's a kind of worst case and that's often what we care about, the worst case scenario. I've done it backwards deliberately. This is the histogram. If you were to try and imagine ... I'm not going to ask for a question and answer on this, but if you were to try and imagine what the average would be, it may be hard to fit it backwards so I'll help you out. The average value of this is 38.3. We've gone from the more detailed and more informative histogram to the average, that's fine, we can see that. If I was to work the other way around, if I was to have given you the average first and only the average, not the histogram, you might not have imagined that this was the way the response times panned out. You might not have imagined that there were some people getting response times of over 100 milliseconds when I told you the average is 38.3.

You might have imagined the nice bell curve we talked about in a few previous slides where they were trailing off nicely and by far the majority of them were clustered around 38. Actually, the story is a lot more complex than that and for those into your math, this is known as bimodal. There's two modes to this. It's not a normal distribution. Normal distributions only happen in things like height of a population, weight of a population. Response times of applications don't often fit the normal distribution. That's the main reason why if we're measuring performance and particularly when we're measuring things like websites or response times or applications or whatever it might be, when we're measuring these sorts of things, we need to look at things like histograms and stuff like that.

Histograms are all well and good, but they take a bit of space to plot out. They're not always ... they don't help for things over time. What are some other things? Well, this is from the Application Insights Analytics tool from Microsoft Azure and, actually, you can use it outside of Azure as well in this information map. They use what's called percentiles. Again, not to get too much into the math side of things, I know that's not everyone's cup of tea, but basically very simply, percentiles is if you took all the responses in a certain period of time and rank them from lowest to highest, the 95th percentile, if you only had 100 responses, would only be the 95th highest one or the 95th, so that's what it means.

Again, it tells you how people are experiencing at the worst end of stuff. Responses can never be lower than zero, that's kind of fixed, but they can tail off. 95th percentile, 99 percentile, that's linked to the idea of five nines and four nines and all this sort of stuff. This graph is quite nice to show this, because the huge peak that we see about a third away from the right-hand side, only shows up at a high percentile, is completely lost to the other percentiles and would be lost in the average. Whether that matters or not is another discussion, but we can certainly see here that some amount of customers had a very bad experience at that particular time. I'm only showing the average when I've seen … effect having the straight bar across the bottom, pretty much, which is the 50 percentile, which is very similar to the average. This is a way and a lot of tools display the idea behind histograms or the similar things they display in percentiles, because we can track it over time.

When We Should Measure Performance?

This leads us on to when. When should we be measuring this? When should we be looking at this sort of thing and, hopefully, if you've had the chance to answer my Twitter poll and I'll see the responses later on for how that matches in terms of performance requirements, but I would argue that actually for a lot of this we need to be doing this in production. I guess, if you've done web apps or apps on phones or these sorts of things, you can do all the testing you want with all the different handsets, all the different browsers, but there's always that one person that has that one that you couldn't possibly have imagined and all your fantastic test team and all the work you did before production wouldn't have shown up or you'd have to try 10 times harder and the costs would have been prohibitive.

As much as it's really useful to be measuring this stuff before production, there's a level that you want to use some type of tool and there's lots of tools out there that allow this to see this in production. The other reason ... the way to argue is that your users are seeing this. If there's a performance problem in production, a user or several users seeing it and you'd kind of like to know before they tell you, because they possibly won't. They might just never come back. Some level of monitoring production or whatever is possible is a good thing to have.

I would also say that you're unlikely to see any perf issues or very few of them in unit testing. This is not to knock unit testing in any way. I think it's a fantastic tool for what it does, which is allowing you to test a unit of your program and make sure the functionality works, but you can use unit testing frameworks to give you some idea about performance, but you'd want to be writing a different type of test. The basic reason is a lot of time in unit tests we put some mock data and that mock data is just enough to make the test do what we want it to do, to exercise the path. It's unlikely to be the same amount of data that we might put through our production systems or have running through our production systems. An algorithm that works fine for 10 items in a list might fall over and have horrible performance when there's 1000 items on a list. That's why generally you won't see any or very few performance issues during unit testing.

Also I'd argue to my first point, I don't think you'll see all performance issues in development or you'll have to try very hard to get to that level. There's always going to be ones that come up outside of development and production. You can have a great testing team, they can test out a lot of things and I've seen examples when it works, but there's always the times where things come out in production that you can't have imagined otherwise.

Tony: Excuse me , Matt, I have a question here.

Matt: Sure.

Tony:

Can you show us some real-world examples of when unit tests do not catch the performance issue and this performance issue will be only seen in production using some performance test or performance measurement?

Matt:

Yeah, I had one on a previous project we worked on. We were using an off the shelf IoC container, but we were customizing a bit for our needs and it worked absolutely fine for our unit testing, it worked fine when we were testing it, single person testing the application, but as soon as we put any load to the system, it fell off because we were using it in the wrong way, in effect. It looked absolutely fine in all stages of our development until we did our real longterm perf tests of some multiple days and it showed up over that time as a ... in effect a huge memory leak and actually caused pretty bad knock on effect for our response times and stuff like that. I've definitely seen that happen and that's a classic example. Everything looked fine under small load, because the app hadn't been running long enough, there wasn't the usage of the IOC framework and stuff wasn't getting exercised in unit tests because it was starting up with a clean one each time. But when it had been running for a while and we had a more realistic load through it, we definitely saw a big difference. Fortunately, in that case, we caught it in our pre-production.

Tony: Okay. Thank you.

How to Identify Performance Issues?

Okay, so how can we go about this? How can we identify performance issues? Measure, measure, measure, measure. Measure once, measure twice, however you want to think about it. You really need to be measuring this sort of stuff, but even more than that, I would say that you want to be identifying ... you want to measure to identify the bottlenecks in the first place. We'll talk about some tools that can help you with that in a moment. Also, equally important is you want to measure to verify the optimization works. A lot of time has built up knowledge over the years of things that are more performant or aren't more performant in .NET particularly or in other frameworks as well and some of those things may have been true five years ago in that version of the framework that aren't true nowadays.

You don't want to be just blindly applying what we think is an optimization we want to measure. Measure in the beginning, measure during and certainly measure afterwords to verify our optimizations work. Just get away from the idea of blindly applying stuff that we may have read elsewhere. Some of the tools we can use to do this ... one of the best ones I've come across is a tool called Mini Profiler from the development team at Stack Overflow developed this for themselves and then have made it, fortunately, for the rest of us has made it available. It's a great tool. Initially when you run the tool, you don't get this whole popup, you just get the little red section in the top right-hand corner. Integrates with ASP.NET, MVC web applications, integrates with a whole range, actually. You can see it runs as versions of Ruby, as versions for ... you can run it in console applications. It's quite a wide ranging tool, but the initial or the main use case is for ASP.NET.

It puts this little render into the top right-hand corner of your pages when you have it turned on or just for certain users, like if you want your developers to see it, but not for your customers, however you want to set that up. That gives you the page rendering time. This quote at the bottom really sums up to me why it's so useful is this idea of having that in the top right-hand corner when you're in development is pretty useful for developers. I know I'd much rather see straight away that I've made a certain page on the website slower by something I've just changed. I do like to see it before anyone else sees it, but certainly I'd like to see it before it goes to production. This idea of having these numbers up front and not in some log that developers need to go and look at every time they're there, every time the page is rendered it's there.

It gives you more than just the total time, it gives you a great drill down into ... we can see here, sequel calls, page render times. It gives you quite detailed information about parts of MVC, pipelines, rendering pages, natural controller action. You can insert your own timings if there's bits of your code that you particularly want to have a number for those ... it will tell you time for database calls. It integrates into things like NC framework and other ones as well and other, I believe, anyway. It does some wrappers to give you this. One other thing, which is not completely obvious, is down in the bottom right-hand corner, you've got these sequel in red and that's telling you when you have things like selecting plus one queries or duplicated queries. It's a pretty informative tool.

I know for a fact that at Stack Overflow, they run this in production. We don't get to see this when we visit the site, but their developers get to see this from any page when their developer visits the site. They also store the numbers that are used to create this rendering here in aggregate so they can then query that and come back to look at it later. I believe it's not for every request, it's for some sample of the requests, but certainly they're happy in having this running in production. I agree. I think it's quite a useful tool. There's a lot of information you can get from there. So check out Mini Profiler and there's a search for Mini Profiler on the site explains more of the features in detail.

Again, from StackOverflow, there's another tool they made available their Opserver Monitoring Tool. There's lots of monitoring tools available that give you this sort of idea of dash, but I just picked this one because it's an open source one and, again, it's one that StackOverflow, which I believe is a top 50 website, certainly a very high website in terms of page views and stuff like that they've had their own tool that they wrote because existing ones maybe didn't fit their needs. This tool is at least a bit more … for the type of scenarios they're in and this tool runs on a busy website. You can go and see more about Opserver there. It integrates quite nicely with Mini Profiler. What I quite like is this screenshot shows they actually used Mini Profiler, as you can see in the top left-hand corner, they run Mini Profiler on their own Opserver tool to make sure their Opserver pages are rendering reasonably quickly, which I guess makes some sense. If the page isn't rendering quick enough or is rendering so slowly because of performance issues, it's not going to help you as a dashboard, you want it updating reasonably quickly or reasonably frequently.

At some point, particularly, as I've said, this talk really focuses on stuff inside the CLR, that level of performance and we'll come to the real-world examples in a moment. You get to this topic of Micro-benchmarks. I guess I would always say that with these, you want to be profiling first to identify where there's places that are an issue and then do your micro-benchmarks. The problem when there's micro-benchmarks is you get into a situation ... you pick some bit of code that you think is running slowly, you run a micro-benchmark and say, "Oh yeah, that's running 20 milliseconds, or whatever it might be."

You then increase that bit of code to run 10 times faster or whatever it might be, but you lose the context of where that fits in the application. You lose any acknowledgment of actually is that a part of my application that runs repeatedly or is that a part of my application that runs just once a day? Does that speed improvement that I've made, the optimization made, does that have any effect on the real webs production system or is that just quicker when I'm testing in a micro-benchmark?

Whilst there are useful tools, micro-benchmarks, by their definition, they lose the context of where it runs in the whole system. I would always say you want to be starting the profiling first.

Tony:

Excuse me, can I have one question about profiling here? When we use the profiler, it will certainly show us a lot of issues and there's also this 80/20 rule in software development telling us that by modifying 20% of code, we usually solve 80% of troubles. Does it also apply in this case? What would you recommend for problems to solve?

Matt:

It's a good question. Generally, you should always be starting with the most expensive thing, the thing at the top of the profile. You should be fixing that first and the reason for that is quite simple. It's that actually when you ... if you fix that one first, if it's one you can fix, if you fix that one first, it might make some of the other ones go away, because they might have been dependent on the first one. If at all possible, you should always be starting with the most expensive thing, which fits in with if you like the 80/20 rule, the thing that's taking the time. Generally, a lot of times ... I've seen performance issues, there's often one thing that stands out as being a worst case and you should make your best effort to fix that first, if at all possible. Then when you've made that optimized, then run your tests again and see.

A similar sort of idea applies. You shouldn't just be ... if you're going to bother profiling, you shouldn't pick the thing that you want to fix the most. Try and start with the thing that's taking the most time at the top and get rid of that first and then see what the profile looks after that.

Tony: Okay, thanks.

Matt:

Okay, I just briefly flashed this up. It's a little bit tongue in cheek, but the idea behind this is when I show code samples ... a lot of presentations will show code samples and expect you to go away and use them. In this presentation, it's almost the opposite. I will show you some code samples of performance issues and the before and after and the changes that were made, but actually hopefully what I've got across previously is actually I really don't want people to go away and blindly change their code because it was in my presentation. I'm more showing you the tools and some of the areas of code to look at that can be performance bottlenecks, but you should certainly not be changing any of your code based on things you see on the slides coming up unless you've identified that as a bottleneck for your particular application.

The reason for that mostly is that a lot of the time the high performance code is harder to read, harder to understand and less intuitive. If it wasn't, you probably would have written it in the first place. It means that you're potentially making the code base, as a general thing, worse for the sake of performance and if that's needless, for the sake of performance, it's not a good thing to be doing. That's what I hope this half tongue in cheek but half seriously hope that people take away, particularly when you see some of the things later on, you shouldn't be changing it just because I told you to.

With micro--benchmark, I worked on a library called Benchmark.net and there's a nice writeup on Scott Hanselman's blog about it and working on it with a guy called Andrey and a guy called Adam and we have attempted to make a library that will make micro-benchmarking as easy as possible for you. If I show ... I'm going to click on the next slide to show an example. We're going to look at this one. This is a benchmark of reflection, so the standard stuff for most ... there's other tools available, but for most of them, really what Benchmark.net has done is you're writing ... in the functions you're writing what you want to benchmark. You put the benchmark attributes onto it and some other ones as well, baseline it, which is true in this case, and then the last bit of code is really asking benchmark.net to run a benchmark. That's as simple as you want it to be. Then it does the work of giving you accurate numbers, giving you numbers in a nice format and things we'll show later on ... trying to get an accurate representation of the functions that you're asking it to run.

Benchmark.net Alternatives

Tony:

Could you please mention some other tools similar to benchmark.net and tell us briefly how benchmark.net compares to those?

Matt:

Yeah, sure. A few that I've come across ... there's one called Nbench, which is from the guys, the people who made hacker.net and it's interesting actually because they made it because, as far as I understand, the story they had a particular release of their software that had a performance regression and they wanted to make sure that didn't happen again, so they devised NBench, which is more focused around writing performance ... they look like unit tests and they run as part of a unit test runner, but they are tailored, or they are designed to be performance tests. What I said before you're not going to catch by accident, probably, performance issues with a unit test if you craft a specific unit/performance test and that's what they've done.

That runs on every build and they have assertions in there that say, "Did this particular code take longer than x amount of milliseconds? Does this bit of code allocate more than this much memory?"

I believe you can do, "Does this bit of code run faster than this other bit of code?"

Anyway, they have those sort of ideas and those tests will then fail if they run too slow. The idea is to pick up performance regression. Benchmark.net is not focused around that, it's more of a console running tool. It could be extended, but we don't have that at the moment. That's the main difference with NBench.

There's also another tool called xunit.perfomance, that's used by some of the Microsoft tools like Roslyn and the Core FX and they have a seamless idea that it runs on every build and they've crafted performance tests for specific parts of that code and they just want to spot regression so it will give xunit.performance more traction over time so they can see these bits of code are running this fast and this build is fast and this build ... over time will be regressing were they getting slower or faster. The main issues with these sorts of tools is you need to make sure you're running on the same hardware if you're going to compare runs like that.

Another tool I should mention etimo.benchmarks. I just learned about this the other day is a very similar, much more aligned to benchmark.net and similar tools. There are other tools out there. Most of them, as far as all the ones I've come across, they all do the accuracy bit ... it's not straightforward, but they all do that, otherwise they won't be using that and you want accurate results, but they just vary in terms of their focus, whether it's focus for performance test or failure build or focus for running into console, wanted to try things out and give you results like that, benchmark.net has a different focus.

Tony: Okay, thank you.

Matt:

Yeah, no problem. I picked this example because reflection is often talked about and we say, "Reflection is slow."

We're just looking here on the uri system, .uri object and we're doing a regular property call, so object dot host in the first benchmark and in the second one we're going to get the same thing via reflection and get the same value and see the difference. Just a really quick recap, certainly in benchmark.net we cover all the way down to reporting results in nanoseconds, which is a billionth of a second. There's microseconds and milliseconds, just as a quick refresher for people, if you're not familiar with those and the terminology. A lot of the time when you're talking about stuff just in the CLR and stuff, you can get down very easily into nanoseconds, it's not inconceivable to have stuff running in the nanosecond range and hopefully the tools will report that.

What does this look like for the benchmark we talked about before? This just shows quickly the output of benchmark.net. As I said, we mostly focus on console output, although we provide tables and stuff you can post into to get help in other places. This is the numbers. The regular property call comes in at 13 nanoseconds and reflection comes in at 230, so reflection is clearly slow, there's no argument in that. It's roughly 18 times slower. In this case, the regular property URI is not the most ... simpler properties there's a bit more going on so the regular property call is a bit slower than what you'd expect if it was just directly getting a backing field, so it skews things a little bit. Anyway, you get a rough idea of the timings here. That's why in benchmark.net we like to report the scaled number and the absolute timings to give you an idea. So yes, reflection is slower, definitely. It's slow, it depends on how often you're doing it. What ways you're doing it. This is just a simple property access. If you're doing more of it, you're doing more complex reflection, that's going to add up. If you're doing it lots of times a second, that's going to add up. But the idea is not to say that that's wrong, that is definitely right, but the reflection is slower. How much slower is worth figuring out for your example, for your scenario.

StringConcat vs. StringBuilder 

Onto another one. StringConcat versus StringBuilder. It's always sort of said that you should be using StringBuilder and that's a great general rule, we're not going to argue against that, but in terms of performance, what's the difference? Interesting enough, there's a link to the Roslyn issue at the bottom to try and introduce StringBuilder in more places and you can read the issue if you're interested to see where that went, because generally StringBuilder is better. You're doing less allocations. The issue ... if you see what we're doing with String Concat in this case is we're taking the string and cutting a new string, but each time we're throwing away the previous string because next time around the loop, we're adding a new string to it. Strings in .NET are immutable. If you can concat them all in one go, great, but if you concat like this in the loop, you're basically taking a string that contains a number zero, we're adding to that a string that contains number one, making a new string which is zero one. Next time around it's taking zero one and adding two and so on. Basically there's a lot of waste in this particular example of StringConcat.

StringBuilder doesn't have that issue, we only build the string fully up at the end. We're doing two string. Anyway, what does the actual difference look like? This is an output from benchmark.net. It shows you a lot of detailed information around the allocations and I realize this is hard to look at, I'm just going to show that briefly, this is the raw stuff we give you but much more useful is some graphs that show this in a better way. Basically, the long and short of it is that for a lot of cases, the performance isn't hugely different but there comes a point, depending on how many times you're concatenating strings where the performance of StringBuilder is way better. That's mostly related to the fact that it's doing less temporary allocations, so less work for the garbage collector to do. The actual difference is not huge, but it's all about those temporary strings.

This is one example, it's potentially a controlled example because we're concatenating a lot of very small strings. If you were to conduct a small amount of larger strings, your results would be different. The point is not to take this as a general thing, but measure this sort of stuff when it matters and try and get some truth around these ideas if a StringBuilder is better than a StringConcat.

Tony:

Since this kind of performance issue is visible in source code directly, should we consider them when doing core reviews?

Matt:

Yes. I think the general rule of using StringBuilder with StringConcat is great and I think it should always be applied, basically because if we flick back to the code, you're not making the code more complex. You're not using some handwritten class to get absolute performance reviews built into the .NET run time. It's designed for this sort of thing. It's tailor made for it. This sort of thing is a great example of what actually ... this whole idea of premature optimization or not. Actually, we want to be writing the best code we can from the start, really, based on our knowledge and our best practice and all that sort of stuff. I think this is a good one, but the time, I would say, be careful in code reviews going beyond this. I think StringConcat, String Builder are on quite good ground there, but particularly back to the reflection example, to say blindly we shouldn't ever use reflection ... actually sometimes you have no choice, so that kind of balances that one, but in other situations it's like how much slower is reflection? Is it a worthwhile trade off in our case? To say don't use reflection during code review stage… I think like in a lot of these things, it is a balance, but I think the String Builder, StringConcat ... going to your actual question is actually there's no real downside to changing the code to String Builder if it's not ... it's no more complex. You're not writing code that won't be understandable by someone else, things like that. I think it's a worthwhile thing there. I guess the thing to do is make sure that you have someone else on who's taken the time to just check these things and understand what's going on in different scenarios and ... so there's a bit more knowledge and data behind it would be my recommendation.

Tony: Okay. Thank you.

Garbage Collection 

Onto some other things. Basically, we touched upon it on the benchmark just there but with the .NET Garbage Collector ... it’s fantastic aid to programming on the .NET runtime. It takes away so many issues that you have to worry about in languages that don't have or run times that don't have it and it's true that allocating is very cheap, but the main issue is that cleaning up afterwards, to make allocations cheap, the .NET GC has to do work in the background. It has to compact. It has to search for objects that are available, this sort of stuff. It's actually sometimes difficult to measure the impact because it happens that some of the tasks are, in effect, are synchronous, so when you write a bit of code, it's not that point the Garbage Collector kicks in. The Garbage Collector kicks in when it feels like it needs to and at that point you might get GC pauses. That's the main issue.

There's a few tools that can help understand when there's excessive amounts of GC. Very simply, with perf …tool, system internals, time in GC, it's hard to put an exact number, but I would say once you get above 50% of the time in GC, there's a problem because it's spending more time on doing garbage collection than it is running your program. I've heard numbers that say 10% or above GC is another cause of concern. But certainly seeing a sustained high amount of time in GC is a red flag and you want to investigate that more.

Another tool that allows you to investigate that more is PerfView. I always say that PerfView wins the prize for being the most useful but possibly ugliest looking tool. It's on that end of the scale. I'm sure we've all used tools that look amazing, but give you no functionality or no use for functionality. PerfView is the complete opposite. Don't be put off by what it looks. It's a functional UI. It does exactly what it needs to, but it can give you some very useful, low-level information. It works on top of ETW Events, event tracing for Windows events. It's designed to be very fast. They do say that it can be used in production apps for short periods of time with minimal impact. It's not saying you'd want to turn it on all the time, but you can turn it on for a while for investigation. Please test it out before turning on your production app.

In terms of GC, what it gives us in the chart at the bottom is this max pause time and GC pause time equals time when your application wasn't running. If you're here with a quite small pause time of eight milliseconds, but it does vary a bit with which GC mode, whether it's work station or server and background and foreground. At certain points of time, the GC kicks in and does stop the world or kind of stops the world and when it's doing that, none of your code can run. If that pause takes 100 milliseconds and you're SLA is 100 milliseconds, you've lost it because GC pools, because any responses that were happening at that time were only button clicks away will happen at that time will be paused until the GC is finished.

Fortunately, over the time the last releases of .NET, the GC has had more and more features, so it does this more and more in the background and a GC server mode has a background mode now in. NET related versions. So the times when your entire application is paused is becoming less and less, but it's still definitely a possibility.

StackOverflow saw these huge spikes in GC pauses ... at least over one second up to four seconds and they would render their pages generally in under 100 milliseconds. So for them, this is a bad experience for users and you can see a link there at the bottom for the full details of what happened there and how they fixed it. 

Stack Overflow Performance Lessons 

There's also some nice performance lessons from StackOverflow. They controversially say use static classes for them, the performance benefit of having static classes versus […]classes all the time was found to be a measurable impact on their application.

I'm not, again, saying that was a general thing, but for them they found that it worked well. They're also not afraid to write their own tools when off the shelf tools don't give them what they need or don't give them performance they need. Generally, Dapper is their micro ORM or macro ORM that has very high performance. Jil, JSON Serializer again, that is tuned for high performance and Miniprofilers that we talked about before. For a lot of this, you need to understand the platform, the CLR is not a black box. There's stuff going on in there, particularly around the garbage collector and things like that, you need to ... if you want to get the most performance out of .NET, you need to try to understand what's going on there.

Roslyn Performance Lessons 

Again, onto the code samples and more just finish with this last section and talk with some examples from the Roslyn codebase, actually. There's an entire talk on this. This is just a small sample. You can see the link at the bottom. There are some places ... the thing I find most interesting about this is this is the people who write C#, the C# compiling team. Some of the stuff they come up against is interesting because some of this then fed back into the language, because they were seeing this as performance issues in the Roslyn product they were writing, the Roslyn C# compiler and so they then, where applicable, fed it back into the language.

The performance ... with all these examples, you can assume it's a bit of code that's running a lot. This is a logger class and the fix in this case or what they changed their majors, they have all these boxing and added the two string cores there. There's some details in the pull request at the bottom to explain what's going on there and a bit more context to it and how it can, in some ways, has been added up back into the compiler but can't in all different ways because ... the link explains a bit better, but what's interesting with this is actually if you use Resharper or other tools, they tell you to remove the two strings, because they're redundant because they are technically redundant, but not if you care about the boxing.

Really, this isn't one you should be applying unless you, as I said, profiled it makes the codes… I guess uglier. It's not intuitive while you're doing that. Unless that code's being called a lot, the overhead of boxing won't be noticeable. But it is one they came across in Roslyn. 

Another one they came across as a performance issue. This is fine matching symbol in the compiling. You can assume that's being called a lot of times. A lot of calls to this server over a period of time. Roslyn compiler is not just running when we build our projects in Visual Studio, but it's constantly running in the background of Visual Studio to power intellisense, to power syntax highlighting. So there is bits of Roslyn, in effect, running continuously in the background whilst we develop in Visual Studio as well.

Interestingly, their fix in this case was to not use LINQ. This is really the one I really would hate for anyone to go away and take out LINQ. I think LINQ is a fantastic feature. It makes code that is much more understandable, more concise. Anyone, almost, could read LINQ. It takes a bit more understanding, potentially. But there's an overhead to LINQ. It doesn't come for free. There's stuff going on with the compiler to make that possible. Stuff going on in the background. It's basically, again, an extra allocation with LINQ. They found that the old iterative way of a simple foreach loop in doing the same thing worked for them in this case.

And to give you an idea, the actual difference in timing is between Iterative and Linq isn't huge, it isn't even double in terms of raw speed. But the main issue is the number of gen zero collections basically caused by the number of allocations. The Iterative one will almost never allocate. The LINQ one almost always will allocate, basically and it's those allocations that cause the garbage collector to do more work and if you call in this sort of code enough or cause the difference in times.

The final one from Roslyn, again, this is a bit of code that's running a lot. It's working with generic types in Roslyn and it's already using StringBuilder, which we discussed before is actually good best practice. It's not allocating lots of temporary strings. It's building it up bit by bit and then recoding the strings at the end. So in terms of string processing, this is about as efficient as you can currently get in .NET. Actually, they found that in this case, they used object pooling. They found that the allocation of a new StringBuilder object was costing them every time if you do it a lot. StringBuilder, by default, I think allocates pre-sizes to around 16, so there's a bunch of allocations, it's a certain amount of bytes allocated every time you make a StringBuilder even before you've added something to it. They made a simple object pooling cache. I’ll show you the code they used to do that. The system pool thread static when they acquire they look for the one on the current thread. If it's not there because it's not been created yet, they create a new one, clear it out and then return it. When they finish with it, they call it GetStringAndRelease and put it, in effect, back in the cache, it‘s not a pool in the classic sense, because it's per thread and there‘s a single one.

There's a couple of things just to call out before people even consider object pooling. One is that if you're going to pool objects, you need to make sure you return them to a clear state when you're finished with them because when we allocate new objects, normally that's done for us behind the scenes and we get an object in a known state of things errored out, mulled out if you like. With the StringBuilder, that's nice and easy because you can call .clear and in other cases it's not. This is a simple cache because it's thread static, so it's per thread. The downside is it stays on that thread for the lifetime of that thread. StringBuilder is still there. If you have lots of threads, you have lots of StringBuilders. Often the other solution used is a global logic pool, but that requires locking and that's much more complex. There's a trade off in both these scenarios but you need to worry about ... you don't want your object pool growing too large and storing more because there's a cost to storing. It has to be tuned and thought about before implemented.

That's brought me to the end of the talk. Hopefully that's been useful for people. I don't know if there's any questions that have come up as I've been going along.

Q&A

Q: Can benchmark.net can be used in continuous integration and fail?

A: It's not something we currently have support, if you like, out of the box with the main product. It's something we keep thinking about adding, but with time ... there's been a few community contributions getting us there, but we're not there yet, but it's certainly something we plan to have. You can take the raw stuff benchmark.net gives you and certainly you could build that yourself, but it's not something we provide as yet. The main issues around providing that is if you're going to do the whole thing, you need to worry about storing your results, running it on consistent hardware and a lot of other stuff that's on the outside of our ideas at benchmark.net. It may be something we get in the future, but it's not something you can do straightaway with benchmark.net, but you can certainly use benchmark.net to give you the raw numbers and then you could build that tool on top of it or implement on top of it, yes.

Q: How is benchmarking different than finding the difference of time span between start and end of a function?

A: How does benchmark.net do it differently? One of the main things is we use the stopwatch, which is a bit more accurate than just relying on time span. We run the code multiple times, that's the other one. There's a couple of things ... there's not lots of things you have to do, but you certainly want to call a function once first to let it be jitted, because you pay that cost of the jitting of the function the first time it's called. You don't want to measure that in terms of the real performance, because that only happens once. You want to get that out of the way, then you generally want to run the function multiple times in batches and get the timings on the batches because there's a limit to ... if we're talking about something that takes nanoseconds, you can't just measure that with a before and after.

You need to run it multiple times in a batch until you can actually record the length of the batch and then work out the per iteration time. It's doing a bit more but it basically boils down to running a function multiple times. Jitting it, first of all, and the one other thing we do is the jit compiler in .net ... if it sees that you're calling a function but not doing anything with the result, it might remove that, say there's no need to do that because it's … . It doesn't go into anything. We make sure that doesn't happen. We prevent that from happening so that we ensure the code that you think you‘re benchmarking is actually benchmarked. So they're the main sort of things we do.

Q: Does LINQ perfomance problem raise only in Roslyn or does it have a place with older .Net platforms?

A: With regards to LINQ, there are no major differences in the Roslyn compiler compared to the older one. So performance issues can exist in LINQ across all .NET compiler versions. The example was included because it was a change made to the Roslyn code base itself, not to the code it produces when compiling LINQ.

Q: What is the best setting for GC on Win Server 2012 R2 which hosts 250 ASP.NET MVC apps on IIS. I am talking about gcConcurrent and gcServer in aspnet.config file.

A: Generally, server mode is the best when you are running on a server, but you actually don’t need to do anything, because it’s the default mode in ASP.NET Apps, from https://msdn.microsoft.com/en-us/library/ee787088(v=vs.110).aspx
You can also specify server garbage collection with unmanaged hosting interfaces. Note that ASP.NET and SQL Server enable server garbage collection automatically if your application is hosted inside one of these environments.

Q: Do we get hints, how to measure or which tools are recommended to measure performance issues?

I really like PerfView, it takes a while to find your way around it, but it’s worth it. This post by Ben Watson will help you get started https://www.philosophicalgeek.com/2012/07/16/how-to-debug-gc-issues-using-perfview/, plus these tutorials https://channel9.msdn.com/Series/PerfView-Tutorial I’ve also used the JetBrains and RedGate profiling tools and they are all very good

Q: What other techniques can be applied for embedded dotnet, single use application, to avoid unpredicttable GC hesitation. Considering the embedded device is not memory constrained.

A: Cut down your unnecessary allocations, take a look with a tool like PerfView (or any other .NET profiler) to see what’s being allocated and if you can remove it. This post by Ben Watson will help you get started with PerfView https://www.philosophicalgeek.com/2012/07/16/how-to-debug-gc-issues-using-perfview/. PerfView will also tell you how long the GC pauses are, so you can confirm if they are really impacting your application or not.

Q: How is benchmarking different from finding the difference of timespan between start and end of a function?

A: BenchmarkDotNet does a few things to make its timings are accurate as they can be:

  1. Using Stopwatch rather than TimeSpan, as it’s more accurate and has less overhead.
  2. Call the [Benchmark] method once, outside the timer, so that the one-time effects of JITting the method are included in the timings.
  3. Call the [Benchmark] method several times, in a loop. Even Stopwatch has a limited granularity, so that has to be accounted for, when the methods only takes several nanoseconds to execute.

Q: This is not a question, but an answer for the question about examples for Unit tests not showing performance issues. We need to load data from Azure / SQL server in production, but in Unit tests we have a mock service that responses immediately

A: Thanks, that’s another great example, a mock service is going to perform much quicker than a real service!

Q: What measure could point me in the right direction to know that I could benefit from object pooling?

A: See Tip 3 and Tip 4 in this blog post by Ben Watson https://www.philosophicalgeek.com/2012/06/04/4-essential-tips-for-high-performance-garbage-collection-on-servers/ for a really great discussion on ‘object pooling'.

Q: How does benchmarking work on asynchronous code?

A: Currently we don’t do anything special in BenchmarkDotNet to help you benchmark asynchronous code. It’s actually a really hard thing to do accurately and so we are waiting till after the 1.0 release before we tackle it, sorry!

 

About the speaker, Matt Warren

Matt Warren

Matt is a C# dev who loves nothing more than finding and fixing performance issues. He's worked with Azure, ASP.NET MVC and WinForms on projects such as a web-site for storing government weather data, medical monitoring devices and an inspection system that ensured kegs of beer didn't leak! He’s an Open Source contributor to BenchmarkDotNet and RavenDB. Matt currently works on the C# production profiler at ca and blogs at www.mattwarren.org.