RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from http://ithare.com/c-for-games-performance-allocations-and-data-locality/ below:

C++ for Games: Performance. Allocations and Data Locality

Pages: 1 2

[[

This is Chapter 16(a) from “beta” Volume V of the upcoming book “Development&Deployment of Multiplayer Online Games”, which is currently being beta-tested. Beta-testing is intended to improve the quality of the book, and provides free e-copy of the “release” book to those who help with improving; for further details see “Book Beta Testing“. All the content published during Beta Testing, is subject to change before the book is published.

To navigate through the book, you may want to use Development&Deployment of MOG: Table of Contents.]]

[[TODO: cross-platform C++, with a real-world example going across PC/Mac/NDK]]

Using C++ in the context of games implies quite a few considerations, many of them being performance-related. As usual with performance, there are many things which are subjective and/or depend on the specifics, so it is important to take everything you hear about it (this book included) with a good pinch of salt. Still, I’ll try to provide a few pointers and my personal feelings in this regard.

On Performance in General

One thing which is very important when we’re speaking about performance, is so-called 90/10 law:

90% of the time is spent in 10% of the code, and only 10% of the time in the remaining 90% of the code

This law tends to stand for games too (ok, for some games the ratio might be as small as 70-30, but it is still quite uneven, as there are corner cases and corner cases of corner cases, and so on).

This observation means that

Not all the code needs to be absolutely optimized

In particular, the main question we’re facing during the development, is the one “whether this code is bad enough performance-wise to be rewritten right away, or it can wait until we run into performance problems and we can use profiler to identify the problem instead of guessing?”

My approach in this regard usually goes along the following lines:

Using the code which is orders-of-magnitude inefficient (and is executed more frequently than once-in-a-blue-moon) usually qualifies as a Pretty Bad Idea even for not-so-frequently-executed code. It becomes an even worse idea if we’re speaking about inefficiencies comparable to O(N), where N is total number of your players (or even number of players-per-server).
- “if your interfaces are forcing your code to be Really Inefficient – they generally should be changed at the earliest opportunity (as later rewrite can easily amount to rewriting The Whole Damn Thing).It becomes even more important to make sure that all the things which dictate the code to be that inefficient, and are used in many places across the code, are fixed as soon as possible. In other words, if your interfaces are forcing your code to be inefficient – they generally should be changed at the earliest opportunity (as later rewrite can easily amount to rewriting The Whole Damn Thing).
- On the other hand, if the inefficiency in question is very local and can be fixed without affecting anything else, it can be allowed to live longer.
It is usually good to use some practices which allow you to optimize “for free”. In other words, if all it takes is to use “++it” instead of “it++” (see also [[TODO]] section below for more examples) – it is usually a Good Idea to do it.
It is Really Important to have Very Clean separation between different parts of your code, at the very least between your infrastructure-level code and game-logic-level code (but different parts of game-logic-level code usually need to be separated too – for example, as separate Reactors, see Chapter V for details). Such separation has numerous benefits, but from optimization point of view – such Very Clean separation opens the door for optimizing your code later, when it becomes necessary (and when it becomes apparent which pieces of code become bottlenecks in real-world scenarios)

As you can see, despite providing some (hopefully useful) hints, the approach outlined above still leaves LOTS of room for interpretation. Which, in turn, brings us to the most important statement of all:

Finding “what to optimize and what to leave as is for now” doesn’t not really qualify as science. It is an art.

On “noisy neighbor” code and “Resource Hogs”

One further thing to keep in mind with regards to 90-10 (or 70-30) rule is that even if only performance of 10% of the code matters,

the rest of the code can still affect performance of critical 10% in a Pretty Bad Way 🙁

“Moreover, this effect will be observed even if we dedicate 100% of one of our cores to our critical codeIt is related to the fact that LOTS of CPU (and more generally – computer) resources are de-facto shared between different pieces of code. Moreover, this effect will be observed even if we dedicate 100% of one of our cores to our critical code¹ :-(. Such sharing happens because at least some of the CPU caches are shared between different pieces of code (and if we don’t dedicate 100% of a given core to a single task, even per-core caches will be effectively shared due to context switching), memory is shared too, and even some blocks of CPU are often shared between threads concurrently-run-on-the-same-code due to hyper threading a.k.a. simultaneous multithreading a.k.a. SMT.

And as soon as we have a resource which is shared between non-critical piece of code and critical piece of code – well, it can happen that non-critical piece of code will overuse this resource to the point that it will start affecting the critical code 🙁 . For example, if our non-critical code reads LOTS of RAM – it can easily cause CPU to start kicking data-which-belongs-to-critical-code from L3 cache (an effect known as cache pollution), which in turn will affect performance of our critical code.

These observations are usually not that bad in practice, but they lead us to one important consequence:

Even being non-critical doesn’t give a right to be a Resource Hog

¹ for example, via affinity masks

On Server Performance

In certain circles of gamedevs (especially some of those developers coming from client-side) there is a belief that the performance only matters on the client. In extreme cases, it may result in statements along the lines of “[performance is] not an issue on the backend, where scaling up a service can mean nothing more than dragging a slider to the right” [GlazerMadhav].

Well, there is only one objection against this statement, and it is that the statement is deadly wrong. First of all, linear scalability should not be taken for granted (hint: scaling DB in linear manner is very difficult at least, see also Chapter [[TODO]]). Second, even when you do have linear scalability,

“dragging a slider to the right” is going to cost your company.

Let’s consider a hypothetical example first. Current industry standard for multi-player games revolves around 1000 players/2S-workhorse-server (or 100 players/core, which is more or less the same thing). Now let’s consider a scenario when you didn’t care about server performance at all, and wrote your game in a way so it runs only 10 players/server. And then your game got Really Successful and grew to a million of simultaneous players, which means that you need to run 100’000 of the servers. Not only finding 100’000 of servers to rent will be quite a task by itself, but also even at $250/server/month you’ll need $25M/month just with server costs (effectively requiring you each of your players to pay at least $25/month just for servers, and that’s without any profit or even money to pay your salary). Things become even worse when we’re speaking about costs for Free-to-Play (a.k.a. F2P) games, where most of the players are not paying; for F2P even 100 players/server MAY easily put your game at risk (and also each and every bit of extra cost will put you in a competitive disadvantage compared to your better-performing competition).

“performance differences between different implementations CAN be REALLY HUGE.And depending on many factors, performance differences between different implementations CAN be REALLY HUGE. Coming from hypothetical clouds back to real-world, let’s consider another example. Once upon a time, I’ve seen two games. One has had 400K simultaneous players, and another one – around 100K, and from the player’s perspective the game was 99%-the-same. The former company, however, was running its game from mere 50 servers, while the latter one took 400 servers to run. It means that the first one has managed to get 8’000 players/server, while the second one was at 250, a whopping 32x difference in number of players per server (and therefore in server costs per player). And the difference was pretty much mirrored by the size of admin team, too, ouch. Now you’re in a good position to make a guess – which of these two companies has had more money for advertising?²

One last note in this regard – I do NOT mean that you should aim to absolutely-optimize performance of your Servers³ (and I do NOT rule out other-languages-than-C++ from server side too). All I want to say is that

Server-Side Performance Does Matter Too

And when going beyond this statement – see above about optimization being an art rather than science; which level of inefficiency to consider as “unacceptable from the very beginning”, is one of those choices which you’ll need to make for yourself.

For the rest of this Chapter, whenever speaking about best practices performance-wise, I’ll mean that they apply BOTH to Client-Side and to Server-Side (that is, unless it is explicitly stated otherwise).

² Yep, you guessed it right

³ Neither you should aim to absolutely-optimize performance of your Clients. While this may sound as an heresy to some of the Client-Side gamedevs out there, I stand for this statement

Allocations are Evil. Even More So Because of Locality

In the context of performance, first of all, let’s discuss one thing which IMHO tends to affect performance of C++ programs the-third-most⁴ – it is allocations.

The first important thing to remember for all allocations (whether it is C malloc() or C++ new) is that

Allocations are Expensive

Sure, there were quite a few improvements with allocation during last 10 years (including so-called thread-caching in tcmalloc and per-thread arenas in ptmalloc2), but well – it is still a rather slow operation (give or take, we’re speaking about the order of magnitude of 200 CPU clocks per allocation/deallocation pair, and more if you’re not so lucky with your allocator).

BTW, when it comes to discussions of “smart pointers are evil” – actually it is not smart pointers per se, it is allocations-inevitably-arising-from-smart-pointers, which affect performance of your code; on the other hand, if you replace smart pointer with a naked-pointer-allocated-via-malloc – it is very unlikely that you’ll see any performance gain. More on it in [[TODO]] section below.

Allocations and Data Locality

In addition to allocations being relatively slow, even worse is the fact that

Allocations tend to reduce data locality

In many (most?) cases, poor data locality caused by allocation, incurs slowdown which is worse than just the cost of the allocation itself :-(. However, the concept of data locality is not that obvious (nor it can be easily separated from other effects), so let’s see what is it all about.

“During the times of Ancient Gamers (also known as times of Atari, ZX Spectrum, and Commodore64), CPUs were running at clock speed of a single-MHz (and one operation usually took several clocks). At the same time, memory run at about the speed comparable to that of CPUs.During the times of Ancient Gamers (also known as times of Atari, ZX Spectrum, and Commodore64), CPUs were running at clock speed of a single-MHz (and one operation usually took several clocks). At the same time, memory run at about the speed comparable to that of CPUs. For example, on Z80 CPU⁵ one register-register op took 4 CPU clocks, and the same operation but register-memory, took 6 clocks.

Since that time, however, things have changed considerably. A register-register operation currently routinely takes less than 1 CPU clock (don’t ask me how it is possible 😉⁶), but reading from (main) memory takes 100+ CPU clocks. In other words, while memory latencies have been reduced since ancient times by a factor of 10x,⁷ over the same time frame CPU’s pure number crunching speed has improved by a factor of 1000x+. This discrepancy was addressed by all kinds of caches (Z80 didn’t have any cache at all, and in recent years, it is common to have at least three levels of CPU caches, known as L1-L3, with an occasional L4). Not really surprisingly, these caches and their mechanics tend to have profound impact on the performance.

I don’t want to go into too much discussion of caches etc. (for this, you can read [Drepper07], though beware of some outdated concepts such as Northbridge and Front-Side-Bus which are currently replaced with modern NUMA-related stuff, which is discussed, for example, in [Lameter13] and [GaudEtAl15][[TODO: if you know an up-to-date all-around reference covering BOTH usual RAM stuff and NUMA, please LMK]]). However, I will make a few observations which are applicable to caches and programming:⁸

At physical level, all the work with memory is usually made via so-called “cache lines”. On x86/x64, cache line is 64 bytes for many years now
- “It means that as soon as you’ve accessed any single byte in a cache line, all the other 63 bytes are already in L1 cacheIt means that as soon as you’ve accessed any single byte in a cache line, all the other 63 bytes are already in L1 cache
- In addition, if your program exhibits a pattern of sequential reading, CPU usually does so-called “prefetch”, getting you the next-in-pattern cache line just in case you MIGHT need it.
Register-register operation is around 1 CPU clock
L1 read is like 4 CPU clocks these days (typical L1 cache data size is around 32K). Usually L1 is per-core
L2 read is around 10-12 clocks (typical L2 cache size is around 256K). L2 is usually per-core these days
L3 read is 30-50 clocks (typical L3 cache size is around 4-12M). Unlike L1-L2, L3 is usually shared between cores and is per-socket(!)
Main RAM access is around 100-150 CPU clocks
NUMA Non-uniform memory access (NUMA) is a computer memory design used in multiprocessing, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory (memory local to another processor or memory shared between processors).— Wikipedia —NUMA accesses. NUMA applies only to multi-socket configurations, i.e. to servers and (almost-)only to servers:
- With NUMA, each CPU has its own RAM connected to it (and only to it); there is no more SMP/FSB-style “RAM directly connected to all the CPUs”
  - Whenever CPU A needs to access RAM connected to CPU B, CPU A issues a request to CPU B which goes over ultra-fast CPU-to-CPU connection (for example, Hypertransport for AMD, or Quick Path Interconnect a.k.a. QPI for Intel)
- NUMA access to another socket, with data happening to be in other-socket L3, costs 100-200 CPU clocks
- NUMA access to another socket, with data happening to be in RAM, costs 200-300 CPU clocks

Armed with this information, let’s take a look at a classical in-memory binary tree structure (the one used by STL’s map<> etc.); let’s assume that our tree/map has a million elements. Now, let’s see what happens if we want to look for something within the tree: even if our tree is reasonably well-balanced (which it is with STL), we’ll need to go through 20 nodes of the tree (as 2^20 ~= 1e6). And if each of our elements is allocated in heap, it means that we’ll need to make about 20 jumps in memory. In practice, some of these tree nodes will happen to be cached, but some won’t (causing so-called cache miss when we’re trying to access them), so from a certain point in our lookup we’ll start incurring that 100+-clock penalty for each of these jumps(!).

Which leads us to a very important conclusion:

Data Locality DOES matter.

⁹

In this regards, let’s first note that some of the allocators¹⁰ use some (though usually limited) trickery to improve data locality. This is a Good Idea, and in quite a few scenarios having such trickery MIGHT more than compensate for the allocator formally exhibiting a bit less performance than competition. Moral of the story: do NOT trust 3^rd-party measurements regarding performance of allocators, and compare such performance exactly within the context of your game.

“For such ad-hoc tree-like structures, solution is not THAT easy, though if you run into problems, at least some mitigating approaches DO exist.Let’s further note that while the data locality problem with STL-like trees can usually be addressed by switching to hash tables¹¹ a.k.a. std::unordered_*<> containers (though they do have their drawbacks in the worst-case scenarios, so you need to be careful), most of our programs have their own ad-hoc trees. One example is a list-of-players consisting of Player objects, each of Players containing player-name, and a list of Friends, and so on, and so forth; even more obvious example is a classical Scene Graph routinely used for 3D rendering (see, for example, Chapter 16 from [McShaffryGraham]). For such ad-hoc tree-like structures, solution is not THAT easy :-(, though if you run into problems, at least some mitigating approaches DO exist 🙂 .

⁴ after (a) using-O(N^2)-algo instead of O(N) one, and probably after (b) using blocking code instead of non-blocking one

⁵ or did it stand only for Z80’s predecessor i8080, and was improved for Z80 a bit? Doesn’t matter much for our purposes anyway

⁷ Note that here we’re speaking about “RAM access as it is seen from CPU”, not about “CAS latency”, and these two, while being of the same order of magnitude, happen to be quite different

⁸ the numbers below are for modern-as-of-2016 x64

⁹ actually, both spatial locality and temporal locality matter, but we’ll speak mostly about spatial locality here

¹¹ with a hash table and no collisions, you have at most 2 jumps over RAM, but with 1e6-element tree you can get up to 20, and this can make LOTS of difference

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4