Not built in a day - lessons learned on Total War: ROME II

Paul Lindberg

0/5 (0 vote)

Jan 17, 2014

CPOL

12 min read

15394

This case study details how the game takes best advantage of low-power systems, while still scaling up to look and run fantastic on more robust systems.

Visit Intel® Developer Zone for Developing Windows* Apps for Intel Devices

Abstract

The developers at Creative Assembly had a challenge: How could they get Total War: ROME* II to play well across a wide range of Intel® systems without compromising the game aesthetics? This case study details how the game takes best advantage of low-power systems, while still scaling up to look and run fantastic on more robust systems.

High-fidelity landscapes are an essential part of the game’s rich historical environments, and lush foliage is key to those landscapes. The foliage was optimized with Adaptive Order Independent Transparency (AOIT), giving it a rich look with low performance overhead. The team added a game benchmark for an easy way to measure performance. An in-game battery meter allows players to monitor power when playing on the go. They also tuned for several different systems at once to ensure that any optimizations were balanced across systems, and added detection code to automatically set the right options for each system.

The team made improvements to many other areas, including LODs, shadows, landscape generation, particles, CPU tasks, and sound. They also optimized for memory bandwidth.

Together, these gave the game great performance across a wide range of Intel systems.

The challenge

When building Total War: ROME II, Creative Assembly challenged Intel. They wanted to deliver the most immersive experience of any Total War game to date, on all types of systems, with no compromise. But today’s systems have a variety of features and come in a number of form factors. How could they deliver great gaming at fluid frame rates across all of these systems? They turned to Intel’s engineers. Together, we delivered a fantastic game that players can enjoy across the full range of the latest Intel systems, from the most power-thrifty Ultrabook™ up through full power laptop, all-in-one, and desktop systems.

Figure 1. Typical scene

The team put their requirements on a sliding scale, with more capable machines delivering an even faster frame rate with higher resolution and quality settings. Since the game is faster on more capable systems, advanced features are enabled so the game renders with even higher quality. This includes AOIT built on top of the Intel® Iris™ graphics extension for pixel synchronization, which gives systems with Intel graphics much faster transparency calculations.

To make sure that the game is playable across a wide range of systems, it automatically configures itself to match each system. When the system is running on battery, the game displays a battery meter. ROME II also includes a built-in benchmark mode so anyone can see the game’s performance.

The team studied the GPU and CPU performance of the game with Intel® Graphics Performance Analyzers (Intel® GPA), Microsoft GPUView, and Intel® VTune™ Amplifier XE.

As we walk through this case study, you should see similarities with your game development. We hope this case study helps you implement similar features in your game.

Detecting the platform

With code like the GPU Detect code sample, ROME II detects the system’s graphics device. With this information, the game configures itself for the best visual fidelity on each system. On systems with Intel® HD Graphics 4200/4400/4600 (typically systems that use 15W of power), the game defaults to 1366x768 resolution and Medium quality.

Figure 2. Sample scene on Intel® HD Graphics 4600 system

For systems with Intel HD Graphics 5000 and Iris Graphics 5100, the game sets itself to 1600x900 with Medium quality, with increased shadow fidelity.

On systems with Iris™ Pro Graphics 5200, the game defaults to 1600x900 with High quality, using AOIT for even better visual quality.

Figure 3. Sample scene on Iris™ Pro Graphics 5200

Extensive benchmarking proves that the game has great frame rates (>=30 FPS for at least 95% of typical game play) on all these configurations.

Setting the bar with a benchmark

To make it easier to measure the game’s performance across a variety of systems, ROME II includes a benchmark to showcase typical performance across a campaign scenario. Go to the advanced graphics options screen and select “run benchmark”. For simpler benchmarking, it can also be started from the command line. Although the game ships with a single benchmark, it has a benchmark selection screen, so it can use additional benchmarks.

The benchmark doesn’t specifically measure power. If you want to study power during gameplay, run the benchmark and a power-monitoring tool (see Intel® Graphics Performance Analyzers (Intel® GPA) System Analyzer for real-time power measurement).

We recommend you build a benchmark like this (plus a benchmark-running tool), to show your game’s typical performance.

We’ve got the power

To showcase how long the game can last on battery, it includes a battery meter. The meter is hidden if the battery is fully charged and plugged in to AC power. Anytime an Ultrabook or laptop system is running on battery or charging from AC power, the battery meter appears so the player knows how long they can play. The battery meter has been integrated into the screen so that it doesn’t hide any essential information.

Figure 4. Battery meter at the top of the screen

Some games adapt to battery power by reducing resolution and quality settings, or reducing frame rate. These strategies did not give a satisfying gaming experience in ROME II, so the team did not include any specific power optimizations.

As you work on your game, study how it’s using power, and see if some of those optimizations might be right for you.

AOIT makes the vegetation look good

The Total War games are known for their immersive environments, with realistic foliage. This aesthetic requires transparency, and lots of it.

For the game to look its best on Intel graphics hardware, it uses an Intel Iris graphics extension to the DirectX* API. The pixel synchronization extension gives a low-overhead way to synchronize pixel writes via the graphics driver, which accelerates transparency.

The game originally used an Alpha-to-Coverage solution for transparency. The team planned to supplement this with a k-buffer solution in ROME II. They found that the k-buffer works fine for small areas of the screen with a fixed amount of transparency. However, there were problems on a full screen with the levels of overdraw seen in ROME II. It quickly ran out of GPU memory, so it ran far too slowly. AOIT doesn’t suffer from that problem, and is about 5x faster than k-buffer. AOIT also lets the player see more alpha around the edges of the leaves. This gives a better appearance of depth, especially at resolutions well below 1080p.

While a general AOIT algorithm was published a few years ago, a recent code sample details how to accelerate AOIT with pixel synchronization. With pixel synchronization, shaders write colour and depth into an Unordered Access View (UAV) buffer. The farthest colours are blended as they are written. Then, a visibility function (VF) combines the colours with a refactored alpha-blending equation. Together, this gives a deterministic and fast way to figure transparency.

The AOIT pixel synchronization sample was literally “dropped in” to the ROME II code base with few changes. The ROME II version of AOIT has a pre-multiplied alpha and does some culling. It was easy to add lighting to the AOIT pass and let the tree foliage cast its own shadows.

For higher-end systems, AOIT is now the default for transparency. All other configurations use Alpha-to-Coverage.

Figure 5. Vegetation looks great, thanks to AOIT

AOIT operates on the foreground and middle distance vegetation, with great results. While it’s made this screen shot look great, the actual movement in the game is even better. AOIT should be just as easy to include in your game.

Bandwidth optimizations

After studying the game, we felt that reducing bandwidth at the GPU would increase overall performance. With multiple render targets, the game wrote to a single output format, all the same size. Since it’s possible to have different output formats and bit widths on each render target, we changed the game so it picks appropriate formats and sizes for each.

The landscape engine generates textures. It stores surface colour and normal as RGBA 8 textures, even though the alpha channel is unused. A more efficient format can reduce the bandwidth.

The terrain sampler used manual bilinear filtering. The team studied the vertex shader’s profile in Intel GPA Frame Analyzer, and found the shader was too slow. It read 4 heights with a gather4, and then performed a manual bilinear filter of the 4 values. The vertex shader now has a sampler with appropriate filtering, so it’s quicker.

Figure 6. Profiled times for manual bilinear filtering in the terrain vertex shader

We discovered some LOD models were not optimal. For example, the torso animation model is instanced throughout a campaigning army. But Intel GPA revealed those vertex shaders were invoked too often. This is a common issue so we usually check for it. Instancing can be effective when rendering a whole group of objects with the same model using the same mesh. However, when the objects are dispersed throughout the scene with very different Z depths, there are often a large number of objects in the distance. They render to very small parts of the screen, giving lots of sub-pixel polygons. Even though the instancing reduces the number of draw calls, the number of vertex shader calls on those sub-pixel polygons can become very inefficient.

To avoid this, be cautious with Z depth when instancing. For complex models, only instance them if the Z depth of the instance is a fairly close match with the Z depth of the original.

Figure 7. Vertex shader invoked as often as pixel shader, a clear sign of trouble on this instanced mesh

In general, the vertex shader should be invoked much less often than the pixel shader, and any large number of primitives, post-filter texels and reads may indicate the same issue. In this example, the vertex shader was invoked as often as the pixel shader. To fix this, the game switched to a simpler LOD model for distant objects.

Shadows

Shadows revealed a number of issues that were all ultimately improved.

We had an issue with the number of cascades. ROME II had a cascaded shadow map, which allows great detail in close up areas but still maintains a wide area of shadow detail. At first, the game had three cascades that needed depth tuning for the best effect. By carefully placing the division between cascades, we reduced this to two cascades. The shadow effect was still good, but the game ran faster.

Shadow generation and the main rendering pass both had the same vertex signature (11 inputs). This was inefficient since shadow generation didn’t need a large part of that vertex signature. We couldn’t simplify the vertex signature, however. Many components of the scene need alpha or punch-through areas of their textures, so texture processing was necessary for shadow creation. Later in the project, the input signatures were separately reduced (by 3 float4s), giving shadows a small performance gain.

Shadow maps suffered from the same sub-pixel geometry issue created by distant instanced objects. When the LODs of the main scene were tuned (see above), the shadow map creation improved similarly. The game got a small speedup by changing to LOD models, with impostors for distant models.

Landscape

Originally, the landscape was tessellated in screen space. This resulted in high polygon counts, so we replaced it with a tiled renderer. This had a high vertex shader cost, but rendered much faster (from 5.4 ms to 2.4 ms on the same scene).

ROME II uses very large landscape textures. The visible area is generated in real time. Height and terrain information is composited on the GPU and stored in a texture atlas, but required careful tuning. Each frame renders enough tiles to display newly visible areas, while limiting its work so that it doesn’t take too much GPU and interfere with the rest of rendering.

Particles

Looking more closely at a typical frame capture, there was an expensive section of work that had little effect on the frame.

Figure 8. Particles were expensive but had little effect

This was due to particles, which were each drawn as a separate polygon. They had no impact on the scene, so they could be removed for a large speedup.

Although it seemed that particles might not interact well with AOIT, the new particle engine developed for the game worked well and had no problems with AOIT.

Tuning the CPU-side code

Although graphics received a lot of attention during this project, the team also wanted to optimize the CPU side of ROME II. With Intel® VTune™ Amplifier XE, the team studied the game for CPU bottlenecks, and several areas yielded impressive gains.

The sound engine took more CPU time than expected. While it’s a powerful sound engine, capable of sophisticated mixing and blending on the CPU, it ran at its highest detail level even when set to “normal.” Fixing this sped up frames by up to 1.1x. Lower-power systems benefited from this, yielding a better frame rate and longer battery life.

The game includes a task-based threading system. The tasks vary in size, and some of them didn’t interact well with the automatic task scheduling from the task pool. This caused “bubbles” in the schedule and slowed the frame. Problematic tasks were removed from the task list and manually scheduled on their own thread, separate from everything else. This gave an optimal thread schedule.

Conclusion

Working together, Creative Assembly and Intel carefully studied ROME II during development. Using Intel GPA, Intel VTune Amplifier, GPUView and deep analysis, the team found many issues to improve. Together, the team tuned many parts of the game and integrated industry-leading algorithms.

Since the game automatically configures itself for each system, it’s faster in many cases than it would have been. The battery meter and in-game benchmark make it possible to study the game during gameplay or for benchmarking.

Foliage benefited from AOIT, LODs were isolated to the right depth and replaced with impostors in the distance, shadows are much faster, the landscape and particle systems work better and faster, and the sound and task systems use the right workloads at the right times.

Together, this all lets Total War: ROME II look and run great on Intel platforms. We hope this case study helps you do the same with your next game! Let us know what you think.

Huge thanks to Creative Assembly for building a great franchise and extending ROME II to shine on Intel platforms. Special thanks also to Steve Hughes of Intel, who has worked with Creative Assembly on a number of Total War games, delivering the best results to date in ROME II.

Intel, the Intel logo, Iris, Ultrabook, and VTune are trademarks of Intel Corporation in the U.S. and/or other countries.
Copyright © 2013 Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.