Performance Considerations for Resource Binding in Microsoft DirectX 12

Intel

2.00/5 (1 vote)

Oct 23, 2015

CPOL

13 min read

27595

This article describes how to pick different resource binding mechanisms to run an application efficiently on specific Intel’s GPUs.

With the release of Windows* 10 on July 29 and the release of the 6th generation Intel® Core™ processor family (code-name Skylake), we can now look closer into resource binding specifically for Intel® platforms.

The previous article "Introduction to Resource Binding in Microsoft DirectX* 12" introduced the new resource binding methods in DirectX 12 and concluded that with all these choices, the challenge is to pick the most desirable binding mechanism for the target GPU, types of resources, and their frequency of update.

This article describes how to pick different resource binding mechanisms to run an application efficiently on specific Intel’s GPUs.

Tools of the Trade

To develop games with DirectX 12, you need the following tools:

Windows 10
Visual Studio* 2013 or higher
DirectX 12 SDK comes with Visual Studio
DirectX 12-capable GPU and drivers

Overview

A descriptor is a block of data that describes an object to the GPU, in a GPU-specific opaque format. DirectX 12 offers the following descriptors, previously named "resource views" in DirectX 11:

Constant buffer view (CBV)
Shader resource view (SRV)
Unordered access view (UAV)
Sampler view (SV)
Render target view (RTV)
Depth stencil view (DSV)
and others

These descriptors or resource views can be considered a structure (also called a block) that is consumed by the GPU front end. The descriptors are roughly 32–64 bytes in size and hold information like texture dimensions, format, and layout.

Descriptors are stored in a descriptor heap, which represents a sequence of structures in memory.

A descriptor table holds offsets into this descriptor heap. It maps a continuous range of descriptors to shader slots by making them available through a root signature. This root signature can also hold root constants, root descriptors, and static samplers.

Figure 1. Descriptors, descriptor heap, descriptor tables, root signature.

Figure 1 shows the relationship between descriptors, a descriptor heap, descriptor tables, and the root signature.

The code that Figure 1 describes looks like this:

// the init function sets the shader registers
// parameters: type of descriptor, num of descriptors, base shader register
// the first descriptor table entry in the root signature in
// image 1 sets shader registers t1, b1, t4, t5
// performance: order from most frequent to least frequent used
D3D12_DESCRIPTOR_RANGE Param0Ranges[3]; 
Param0Ranges[0].Init(D3D12_DESCRIPTOR_RANGE_SRV, 1, 1); // t1 Param0Ranges[1].Init(D3D12_DESCRIPTOR_RANGE_CBV, 1, 1); // b1 Param0Ranges[2].Init(D3D12_DESCRIPTOR_RANGE_SRV, 2, 4); // t4-t5  

// the second descriptor table entry in the root signature
// in image 1 sets shader registers u0 and b2
D3D12_DESCRIPTOR_RANGE Param1Ranges[2]; Param1Ranges[0].Init(D3D12_DESCRIPTOR_RANGE_UAV, 1, 0); // u0 Param1Ranges[1].Init(D3D12_DESCRIPTOR_RANGE_CBV, 1, 2); // b2  

// set the descriptor tables in the root signature
// parameters: number of descriptor ranges, descriptor ranges, visibility
// visibility to all stages allows sharing binding tables
// with all types of shaders
D3D12_ROOT_PARAMETER Param[4]; 
Param[0].InitAsDescriptorTable(3, Param0Ranges, D3D12_SHADER_VISIBILITY_ALL); 
Param[1].InitAsDescriptorTable(2, Param1Ranges, D3D12_SHADER_VISIBILITY_ALL); // root descriptor
Param[2].InitAsShaderResourceView(1, 0); // t0
// root constants
Param[3].InitAsConstants(4, 0); // b0 (4x32-bit constants)

// writing into the command list
cmdList->SetGraphicsRootDescriptorTable(0, [srvGPUHandle]); 
cmdList->SetGraphicsRootDescriptorTable(1, [uavGPUHandle]);
cmdList->SetGraphicsRootConstantBufferView(2, [srvCPUHandle]);
cmdList->SetGraphicsRoot32BitConstants(3, {1,3,3,7}, 0, 4);

The source code above sets up a root signature that has two descriptor tables, one root descriptor, and one root constant. The code also shows that root constants have no indirection and are directly provided with the SetGraphicsRoot32bitConstants call. They are routed directly into the shader registers; there is no actual constant buffer, constant buffer descriptor, or binding happening. Root descriptors have only one level of indirection, because they store a pointer to memory (descriptor->memory), and descriptor tables have two levels of indirection (descriptor table -> descriptor-> memory).

Descriptors live in different heaps depending on their types, such as SV and CBV/SRV/UAV. This is due to wildly inconsistent sizes of descriptor types on different hardware platforms. For each type of descriptor heap, there should be only one heap allocated because changing heaps could be expensive.

In general DirectX 12 offers an allocation of more than one million descriptors upfront, enough for a whole game level. While previous DirectX versions dealt with allocations in the driver on their own terms, with DirectX 12 it is possible to avoid any allocations during runtime. That means any initial allocation of a descriptor can be taken out of the performance "equation."

Note: With 3rd generation Intel® Core™ processors (code-name Ivy Bridge)/4th generation Intel® Core™ processor family (code-name Haswell) and DirectX 11 and the Windows Display Driver Model (WDDM) version 1.x, resources were dynamically mapped into memory based on the resources referenced in the command buffer with a page table mapping operation. This way copying data was avoided. The dynamic mapping was important because those architectures only offer 2 GB of memory to the GPU (Intel® Xeon® processor E3-1200 v4 product family (code-name Broadwell) offers more).
With DirectX 12 and WDDM version 2.x, it is no longer possible to remap resources into the GPU virtual address space as necessary, because resources have to be assigned a static virtual address when created and therefore the virtual address of resources cannot change after creation. Even if a resource is "evicted" from GPU memory, it maintains its virtual address for later when it is made resident again.
Therefore the overall available memory of 2 GB in Ivy Bridge/Haswell can become a limiting factor.

As stated in the previous article, a perfectly reasonable outcome for an application might be a combination of all types of bindings: root constants, root descriptors, descriptor tables for descriptors gathered on-the-fly as draw calls are issued, and dynamic indexing of large descriptor tables.

Different hardware architectures will show different performance trade-offs between using sets of root constants and root descriptors versus using descriptor tables. Therefore it might be necessary to tune the ratio between root parameters and descriptor tables depending on the hardware target platforms.

Expected Patterns of Change

To understand which kinds of change incur an additional cost, we have to analyze first how game engines typically change data, descriptors, descriptor tables, and root signatures.

Let’s start with what is called constant data. Most game engines store usually all constant data in "system memory." The game engine will change data in CPU accessible memory and then later on during the frame, a whole block of constant data is copied/mapped into GPU memory and then read by the GPU through a constant buffer view or through the root descriptor.

If the constant data is provided through SetGraphicsRoot32BitConstants() as a root constant, the entry in the root descriptor does not change but the data might change. If it is provided through a CBV == descriptor and then a descriptor table, the descriptor doesn’t change but the data might change.

In case we need several constant buffer views—for example, for double or triple buffered rendering— the CBV or descriptor might change for each frame in the root signature.

For texture data, it is expected that the texture is allocated in GPU memory during startup. Then an SV == descriptor will be created, stored in a descriptor table or a static sampler, and then referenced in the root descriptor. The data and the descriptor or static sample do not change after that.

For dynamic data like changing texture or buffer data (for example, textures with rendered localized text, buffers of animated vertices or procedurally generated meshes), we allocate a render target or buffer, provide an RTV or UAV, which are descriptors, and then these descriptors might not change from there on. The data in the render target or buffer might change.

In case we need several render targets or buffers—for example, for double or triple buffered rendering—the descriptors might change for each frame in the root signature.

For the following discussion, a change is considered important for binding resources if it does the following:

Changes/replaces a descriptor in a descriptor table, for example, the CBVs, RTVs, or UAVs described above
Changes any entry in the root signature

Descriptors in Descriptor Tables with Haswell/Broadwell

On platforms based on Haswell/Broadwell, the cost of changing one descriptor table in the root signature is equivalent to changing all descriptor tables. Changing one argument means that the hardware has to make a copy (version) of all the current arguments. The number of root parameters in a root signature is the amount of data that the hardware has to version when any subset changes.

Note: All the other types of memory in DirectX 12, like descriptor heaps, buffer resources, and so on, are not versioned by hardware.

In other words, changing all of the parameters is roughly the same cost as just changing one (see [Lauritzen] and [MSDN]). Changing none is still the cheapest, but not that useful.

Note: Other hardware, that has for example a split between fast / slow (spill) root argument storage only has to version the region of memory where the argument changed – either the fast area or the spill area.

On Haswell/Broadwell, an additional cost of changing descriptor tables can come from the limited size of the binding table in hardware.

Descriptor tables on those hardware platforms use "binding table" hardware. Each binding table entry is a single DWORD that can be considered an offset into the descriptor heap. The 64 KB ring can store 16,384 binding table entries.

In other words the amount of memory consumed per draw call is dependent on the total number of descriptors that are indexed in a descriptor table and then referenced through a root signature.

In case we run out of the 64 KB memory for the binding table entries, the driver will allocate another 64 KB binding table. The switch between those tables leads to a pipeline stall as shown in Figure 2.

Figure 2. Pipeline stall (courtesy of Andrew Lauritzen).

For example a root signature references 64 descriptors in a descriptor table. The stall will happen every 16,384 / 64 = 256 draw calls.

Because changing a root signature is considered cheap, having multiple root signatures with a low number of descriptors in the descriptor table is favorable over having root signatures with a larger amount of descriptors in the descriptor table.

Therefore it is favorable on Haswell/Broadwell to keep the number of descriptors referenced in descriptor tables as low as possible.

What does this mean for renderer designs? Using more descriptor tables with less descriptors and therefore more root signatures should increase the number of pipeline state objects (PSO), because with an increased number of root signatures the number of PSOs needs to increase because of the one-to-one relationship between these two.

Having more pipeline state objects might lead to a larger number of shaders that, in this case, might be more specialized, instead of longer shaders that offer a wider range of features, which is the common recommendation.

Root Constants/Descriptors on Haswell/Broadwell

Similar to where changing one descriptor table is the same cost compared to changing all of them, changing one root constant or root descriptor is the equivalent to changing all of them (see [Lauritzen]).

Root constants are implemented with "push constants" that are a buffer that hardware uses to prepopulate Execution Unit (EU) registers. Because the values are immediately available when the EU thread launches, it can be a performance win to store constant data as root constants, instead of storing them with descriptor tables.

Root descriptors are implemented as "push constants" as well. They are just pointers passed as constants to the shader, reading data through the general memory path.

Descriptor Tables versus Root Constants/Descriptors on Haswell/Broadwell

Now that we looked at the way descriptor tables, root constants, and descriptors are implemented, we can answer the main question of this article: is one favorable over the other? Because of the limited size of binding table hardware and the potential stalls resulting from crossing this limit, changing root constants and root descriptors is expected to be cheaper on Haswell/Broadwell hardware because they do not use the binding table hardware. For root descriptors and root constants, this is especially recommended in case the data changes every draw call.

Static Samplers on Haswell/Broadwell

As described in the previous article, it is possible to define samplers in the root signature or right in the shader with HLSL root signature language. These are called static samplers.

On Haswell/Broadwell hardware, the driver will place static samplers in the regular sampler heap. This is equivalent to putting them into descriptors manually. Other hardware implements samplers in shader registers, so static samplers can be compiled directly into the shader.

In general static samplers should be a win on many platforms, so there is no downside to using them. On Haswell/Broadwell hardware there is still the chance that by increasing the number of descriptors in a descriptor table, we end up more often with a pipeline stall, because descriptor table hardware has only 16,384 slots to offer.

Here is the syntax for a static sampler in HLSL:

StaticSampler( sReg,
               [ filter = FILTER_ANISOTROPIC, 
               addressU = TEXTURE_ADDRESS_WRAP,
               addressV = TEXTURE_ADDRESS_WRAP,
               addressW = TEXTURE_ADDRESS_WRAP,
               mipLODBias = 0.f,     maxAnisotropy = 16,
               comparisonFunc = COMPARISON_LESS_EQUAL,
               borderColor = STATIC_BORDER_COLOR_OPAQUE_WHITE,
               minLOD = 0.f, maxLOD = 3.402823466e+38f,
               space = 0, visibility = SHADER_VISIBILITY_ALL ])

Most of the parameters are self-explanatory because they are similar to the C++ level usage. The main difference is the border color: on the C++ level it offers a full color range while the HLSL level is restricted to opaque white/black and transparent black. An example for a static shader is:

StaticSampler(s4, filter=FILTER_MIN_MAG_MIP_LINEAR)

Skylake

Skylake allows dynamic indexing of the entire descriptor heap (~1 million resources) in one descriptor table. That means one descriptor table could be enough to index all the available descriptor heap memory.

Compared to previous architectures, it is not necessary to change descriptor table entries in the root signature as often. That also means that the number of root signatures can be reduced. Obviously different materials will require different shaders and therefore different PSOs. But those PSOs can reference the same root signatures.

With modern rendering engines utilizing less shaders than their DirectX 9 and 11 ancestors so that they can avoid the cost of changing shaders and the attached states, reducing the number of root signatures and therefore the number of PSOs is favorable and should result in a performance gain on any hardware platform.

Conclusion

Focusing on Haswell/Broadwell and Skylake, the recommendation for developing performant DirectX 12 applications are dependent on the underlying platform. While for Haswell/Broadwell, the number of descriptors in a descriptor table should be kept low, for Skylake it is recommended to keep this number high and decrease the number of descriptor tables.

To achieve optimal performance, the application programmer can check during startup for the type of hardware and then pick the most efficient resource binding pattern. (There is a GPU detect example that shows how to detect different Intel® hardware architectures at https://software.intel.com/en-us/articles/gpu-detect-sample/) The choice of resource binding pattern will influence how shaders for the system are written.

About the Author

Wolfgang is the CEO of Confetti. Confetti is a think-tank for advanced real-time graphics research and a service provider for the video game and movie industry. Before cofounding Confetti, Wolfgang worked as the lead graphics programmer in Rockstar's core technology group RAGE for more than four years. He is the founder and editor of the ShaderX and GPU Pro books series, a Microsoft MVP, the author of several books and articles on real-time rendering and a regular contributor to websites and conferences worldwide. One of the books he edited, ShaderX4, won the Game developer Front line award in 2006. Wolfgang is in many advisory boards throughout the industry; one of them is the Microsoft’s Graphics Advisory Board for DirectX 12. He is an active contributor to several future standards that drive the Game Industry. You can find him on twitter at wolfgangengel. Confetti's website is www.conffx.com

Acknowledgement

I would like to thank the reviewers of this article:

Andrew Lauritzen
Robin Green
Michal Valient
Dean Calver
Juul Joosten
Michal Drobot

References and Related Links

[Lauritzen] Andrew Lauritzen et al., "Efficient Rendering with DirectX 12 on Intel Graphic", GDC 2015
[MSDN] MSDN, "Advanced use of Descriptor Tables," https://msdn.microsoft.com/en-us/library/windows/desktop/dn859250(v=vs.85).aspx
Microsoft DirectX blog: http://blogs.msdn.com/b/directx/
DirectX 12 on Twitter: @DirectX12 https://twitter.com/DirectX12
Direct3D* 12 - Console API Efficiency & Performance on PCs (https://software.intel.com/en-us/articles/console-api-efficiency-performance-on-pcs)
Microsoft DirectX 12 Graphics Education (YouTube channel): https://www.youtube.com/channel/UCiaX2B8XiXR70jaN7NK-FpA

** Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark* and MobileMark*, are measured using specific computer systems, components, software, operations, and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.