The performance optimization month: results

May 2015 was the performance optimization month for ReHLDS project: hundreds profiler runs and thousands lines of code changes led to over 2x performance boost. In this article I’m going to share performance test results, but before that, I’ll dive into technical background and tell you about the Rehlds demo recorder and player, the feature that allows testing of ReHLDS code and making benchmarks.

ReHLDS demo recorder/player

ReHLDS demo recorder & player are parts of ReHLDS test suite which, in a nutshell, is the ‘black box’ testing appliance. To understand how it works we should treat ReHLDS as the black box which consumes data from external services, does some processing and sends data back. Services are well-known APIs: Win32 API, standard C library, Steam API.

Before we can do a ‘black box’ testing we should intercept the data flow between ReHLDS and external services and write it to some file which is called ‘ReHLDS test demo’:

Now we can run ReHLDS in test mode and feed data from file we recorded on previous step. We should also make sure that ReHLDS produces the same output as it produced by (Re)HLDS during test demo recording:

In that way we can replay recorded scenario as many times as we need. We may also make some modifications to code and ensure that it still produces the same output as original(unmodified) version – this is how existing integration tests work. Another cool fact about this test suite is that we don’t depend on external things like OS timers and network anymore: when ReHLDS calls recvfrom() in test mode, interceptor just reads next recorded packet from the file; when ReHLDS calls Sleep(), nothing actually happens (interceptor just ensures that Sleep is the next function that was called by original app). This means that ReHLDS always consumes 100% of one CPU core in ‘test demo play’ mode, which, in turn, makes it suitable for benchmarking.

However, this test suite is not a silver bullet. Since it requires outgoing data flow to be bit-to-bit identical to flow that was recorded to a demo file, it becomes not possible to test FPU => SSE optimizations in that way, because SSE instructions have lower precision than FPU ones, and, therefore, may produce different results for almost all operations. The results will be very close to each other (ex. 5.01234 vs 5.01233) but they won’t be bit-to-bit equal.

Benchmark configuration

For benchmarking purposes 9 ReHLDS demos were recorded:

3 in stock engine and stock gamedll
3 in optimized engine and stock gamedll
3 in optimized engine and optimized gamedll

“Optimized” means optimizations that break binary outgoing dataflow compatibility with stock versions of gamedll/engine.

Demos were recorded in following environment:

32 bots (controlled by FakePlayer v1.11) playing on de_aztec
OS=Windows, Mod=Counter-Strike, mp_timelimit=20, sys_ticrate=100
HLDS build 6153 with Metamod 1.21p37 but without AmxModX
ReHLDS v0.2

Benchmarking session consists of playing demos on each of the following environment configurations:

Engine	GameDLL	Metamod
Stock	Stock	Stock
Stock	Stock	Optimized
Pedantic optimizations	Stock	Stock
Pedantic optimizations	Stock	Optimized
Optimized	Stock	Optimized
Optimized	Optimized	Optimized

Now let's go through configuration elements:

Engine's pedantic optimizations are optimizations that don’t break binary outgoing dataflow compatibility with stock version of the engine
Engine's optimizations consist of pedatic optimizations plus some algorithm changes and use of SSE in several functions
Metamod's optimizations consist of bypassing interceptors for following functions: AddToFullPack, ModelIndex, IndexOfEdict, CheckVisibility, GetCurrentPlayer, DeltaUnsetFieldByIndex. This means that metamod plugins are not able to intercept calls to these functions
GameDLL's optimization is AngleQuaternion function rewritten using SSE instructions

Benchmark results

Demos in each configuration were played 3 times, average duration was used as a result. 6 different systems were used to run benchmark. Raw result are available here.

To visualize raw results we should do two things:

Normalize duration of each demo as if it has 120K frames by solving simple equation x/120 = duration/num_frames
Calculate average duration of 3 test demos for each configuration for each CPU

And there is a chart with all results:

Charts for each CPU:

Analysis

It is clearly seen that fully optimized ReHLDS (E:Opt, G:Opt M:Opt) configuration is much faster (2.5 to 3 times) than stock configuration on all CPUs.

Now we’ll go through each configuration component and examine its impact on performance.

Metamod: stock vs optimized

Bypassing the plugins invocation on 6 functions (which are hooked very rarely) gives 20% to 30% performance gain.

Engine: stock HLDS vs ReHLDS with pedantic optimizations

A pack of ReHLDS optimizations gives 65% to 110% (usually around 90%) performance gain.

Engine: ReHLDS w. pedantic opt vs ReHLDS with all optimizations

Use of SSE instead of FPU in several places gives 11% performance gain.

Engine: GameDLL: stock vs optimized

One function (AngleQuaternion) rewritten using SSE gives 6% performance gain.

Conclusion

I don't know what to say, actually, since the numbers speak for themselves.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly