Pages: 1 2
[[
This is Chapter 16(a) from “beta” Volume V of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]
[[TODO: cross-platform C++, with a real-world example going across PC/Mac/NDK]]
Using C++ in the context of games implies quite a few considerations, many of them being performance-related. As usual with performance, there are many things which are subjective and/or depend on the specifics, so it is important to take everything you hear about it (this book included) with a good pinch of salt. Still, I’ll try to provide a few pointers and my personal feelings in this regard.
On Performance in GeneralOne thing which is very important when we’re speaking about performance, is so-called 90/10 law:
90% of the time is spent in 10% of the code, and only 10% of the time in the remaining 90% of the code
This law tends to stand for games too (ok, for some games the ratio might be as small as 70-30, but it is still quite uneven, as there are corner cases and corner cases of corner cases, and so on).
This observation means that
Not all the code needs to be absolutely optimized
In particular, the main question we’re facing during the development, is the one “whether this code is bad enough performance-wise to be rewritten right away, or it can wait until we run into performance problems and we can use profiler to identify the problem instead of guessing?”
My approach in this regard usually goes along the following lines:
As you can see, despite providing some (hopefully useful) hints, the approach outlined above still leaves LOTS of room for interpretation. Which, in turn, brings us to the most important statement of all:
Finding “what to optimize and what to leave as is for now” doesn’t not really qualify as science. It is an art.
On “noisy neighbor” code and “Resource Hogs”One further thing to keep in mind with regards to 90-10 (or 70-30) rule is that even if only performance of 10% of the code matters,
the rest of the code can still affect performance of critical 10% in a Pretty Bad Way 🙁
“Moreover, this effect will be observed even if we dedicate 100% of one of our cores to our critical codeIt is related to the fact that LOTS of CPU (and more generally – computer) resources are de-facto shared between different pieces of code. Moreover, this effect will be observed even if we dedicate 100% of one of our cores to our critical code1 :-(. Such sharing happens because at least some of the CPU caches are shared between different pieces of code (and if we don’t dedicate 100% of a given core to a single task, even per-core caches will be effectively shared due to context switching), memory is shared too, and even some blocks of CPU are often shared between threads concurrently-run-on-the-same-code due to hyper threading a.k.a. simultaneous multithreading a.k.a. SMT.
And as soon as we have a resource which is shared between non-critical piece of code and critical piece of code – well, it can happen that non-critical piece of code will overuse this resource to the point that it will start affecting the critical code 🙁 . For example, if our non-critical code reads LOTS of RAM – it can easily cause CPU to start kicking data-which-belongs-to-critical-code from L3 cache (an effect known as cache pollution), which in turn will affect performance of our critical code.
These observations are usually not that bad in practice, but they lead us to one important consequence:
Even being non-critical doesn’t give a right to be a Resource Hog
1 for example, via affinity masks
On Server PerformanceIn certain circles of gamedevs (especially some of those developers coming from client-side) there is a belief that the performance only matters on the client. In extreme cases, it may result in statements along the lines of “[performance is] not an issue on the backend, where scaling up a service can mean nothing more than dragging a slider to the right” [GlazerMadhav].
Well, there is only one objection against this statement, and it is that the statement is deadly wrong. First of all, linear scalability should not be taken for granted (hint: scaling DB in linear manner is very difficult at least, see also Chapter [[TODO]]). Second, even when you do have linear scalability,
“dragging a slider to the right” is going to cost your company.
Let’s consider a hypothetical example first. Current industry standard for multi-player games revolves around 1000 players/2S-workhorse-server (or 100 players/core, which is more or less the same thing). Now let’s consider a scenario when you didn’t care about server performance at all, and wrote your game in a way so it runs only 10 players/server. And then your game got Really Successful and grew to a million of simultaneous players, which means that you need to run 100’000 of the servers. Not only finding 100’000 of servers to rent will be quite a task by itself, but also even at $250/server/month you’ll need $25M/month just with server costs (effectively requiring you each of your players to pay at least $25/month just for servers, and that’s without any profit or even money to pay your salary). Things become even worse when we’re speaking about costs for Free-to-Play (a.k.a. F2P) games, where most of the players are not paying; for F2P even 100 players/server MAY easily put your game at risk (and also each and every bit of extra cost will put you in a competitive disadvantage compared to your better-performing competition).
“performance differences between different implementations CAN be REALLY HUGE.And depending on many factors, performance differences between different implementations CAN be REALLY HUGE. Coming from hypothetical clouds back to real-world, let’s consider another example. Once upon a time, I’ve seen two games. One has had 400K simultaneous players, and another one – around 100K, and from the player’s perspective the game was 99%-the-same. The former company, however, was running its game from mere 50 servers, while the latter one took 400 servers to run. It means that the first one has managed to get 8’000 players/server, while the second one was at 250, a whopping 32x difference in number of players per server (and therefore in server costs per player). And the difference was pretty much mirrored by the size of admin team, too, ouch. Now you’re in a good position to make a guess – which of these two companies has had more money for advertising?2
One last note in this regard – I do NOT mean that you should aim to absolutely-optimize performance of your Servers3 (and I do NOT rule out other-languages-than-C++ from server side too). All I want to say is that
Server-Side Performance Does Matter Too
And when going beyond this statement – see above about optimization being an art rather than science; which level of inefficiency to consider as “unacceptable from the very beginning”, is one of those choices which you’ll need to make for yourself.
For the rest of this Chapter, whenever speaking about best practices performance-wise, I’ll mean that they apply BOTH to Client-Side and to Server-Side (that is, unless it is explicitly stated otherwise).
2 Yep, you guessed it right
3 Neither you should aim to absolutely-optimize performance of your Clients. While this may sound as an heresy to some of the Client-Side gamedevs out there, I stand for this statement
Allocations are Evil. Even More So Because of LocalityIn the context of performance, first of all, let’s discuss one thing which IMHO tends to affect performance of C++ programs the-third-most4 – it is allocations.
The first important thing to remember for all allocations (whether it is C malloc() or C++ new) is that
Allocations are Expensive
Sure, there were quite a few improvements with allocation during last 10 years (including so-called thread-caching in tcmalloc and per-thread arenas in ptmalloc2), but well – it is still a rather slow operation (give or take, we’re speaking about the order of magnitude of 200 CPU clocks per allocation/deallocation pair, and more if you’re not so lucky with your allocator).
BTW, when it comes to discussions of “smart pointers are evil” – actually it is not smart pointers per se, it is allocations-inevitably-arising-from-smart-pointers, which affect performance of your code; on the other hand, if you replace smart pointer with a naked-pointer-allocated-via-malloc – it is very unlikely that you’ll see any performance gain. More on it in [[TODO]] section below.
Allocations and Data LocalityIn addition to allocations being relatively slow, even worse is the fact that
Allocations tend to reduce data locality
In many (most?) cases, poor data locality caused by allocation, incurs slowdown which is worse than just the cost of the allocation itself :-(. However, the concept of data locality is not that obvious (nor it can be easily separated from other effects), so let’s see what is it all about.
“During the times of Ancient Gamers (also known as times of Atari, ZX Spectrum, and Commodore64), CPUs were running at clock speed of a single-MHz (and one operation usually took several clocks). At the same time, memory run at about the speed comparable to that of CPUs.During the times of Ancient Gamers (also known as times of Atari, ZX Spectrum, and Commodore64), CPUs were running at clock speed of a single-MHz (and one operation usually took several clocks). At the same time, memory run at about the speed comparable to that of CPUs. For example, on Z80 CPU5 one register-register op took 4 CPU clocks, and the same operation but register-memory, took 6 clocks.
Since that time, however, things have changed considerably. A register-register operation currently routinely takes less than 1 CPU clock (don’t ask me how it is possible 😉6), but reading from (main) memory takes 100+ CPU clocks. In other words, while memory latencies have been reduced since ancient times by a factor of 10x,7 over the same time frame CPU’s pure number crunching speed has improved by a factor of 1000x+. This discrepancy was addressed by all kinds of caches (Z80 didn’t have any cache at all, and in recent years, it is common to have at least three levels of CPU caches, known as L1-L3, with an occasional L4). Not really surprisingly, these caches and their mechanics tend to have profound impact on the performance.
I don’t want to go into too much discussion of caches etc. (for this, you can read [Drepper07], though beware of some outdated concepts such as Northbridge and Front-Side-Bus which are currently replaced with modern NUMA-related stuff, which is discussed, for example, in [Lameter13] and [GaudEtAl15][[TODO: if you know an up-to-date all-around reference covering BOTH usual RAM stuff and NUMA, please LMK]]). However, I will make a few observations which are applicable to caches and programming:8
Armed with this information, let’s take a look at a classical in-memory binary tree structure (the one used by STL’s map<> etc.); let’s assume that our tree/map has a million elements. Now, let’s see what happens if we want to look for something within the tree: even if our tree is reasonably well-balanced (which it is with STL), we’ll need to go through 20 nodes of the tree (as 2^20 ~= 1e6). And if each of our elements is allocated in heap, it means that we’ll need to make about 20 jumps in memory. In practice, some of these tree nodes will happen to be cached, but some won’t (causing so-called cache miss when we’re trying to access them), so from a certain point in our lookup we’ll start incurring that 100+-clock penalty for each of these jumps(!).
Which leads us to a very important conclusion:
Data Locality DOES matter.
9In this regards, let’s first note that some of the allocators10 use some (though usually limited) trickery to improve data locality. This is a Good Idea, and in quite a few scenarios having such trickery MIGHT more than compensate for the allocator formally exhibiting a bit less performance than competition. Moral of the story: do NOT trust 3rd-party measurements regarding performance of allocators, and compare such performance exactly within the context of your game.
“For such ad-hoc tree-like structures, solution is not THAT easy, though if you run into problems, at least some mitigating approaches DO exist.Let’s further note that while the data locality problem with STL-like trees can usually be addressed by switching to hash tables11 a.k.a. std::unordered_*<> containers (though they do have their drawbacks in the worst-case scenarios, so you need to be careful), most of our programs have their own ad-hoc trees. One example is a list-of-players consisting of Player objects, each of Players containing player-name, and a list of Friends, and so on, and so forth; even more obvious example is a classical Scene Graph routinely used for 3D rendering (see, for example, Chapter 16 from [McShaffryGraham]). For such ad-hoc tree-like structures, solution is not THAT easy :-(, though if you run into problems, at least some mitigating approaches DO exist 🙂 .
4 after (a) using-O(N^2)-algo instead of O(N) one, and probably after (b) using blocking code instead of non-blocking one
5 or did it stand only for Z80’s predecessor i8080, and was improved for Z80 a bit? Doesn’t matter much for our purposes anyway
7 Note that here we’re speaking about “RAM access as it is seen from CPU”, not about “CAS latency”, and these two, while being of the same order of magnitude, happen to be quite different
8 the numbers below are for modern-as-of-2016 x64
9 actually, both spatial locality and temporal locality matter, but we’ll speak mostly about spatial locality here
11 with a hash table and no collisions, you have at most 2 jumps over RAM, but with 1e6-element tree you can get up to 20, and this can make LOTS of difference
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4