Thread by @SebAaltonen, Let me explain the PCI-E resizable BAR: https://docs.microsoft.com/en-us/windows-hardware/drivers/display/resizable-bar-supportWhy is it better to [...]

Let me explain the PCI-E resizable BAR: https://docs.microsoft.com/en-us/windows-hardware/drivers/display/resizable-bar-support

Why is it better to get full GPU memory visible from CPU side compared to a small 256 MB region?

Thread...

Resizable BAR support - Windows drivers

It is typical today for a discrete graphics processing unit (GPU) to have only a small portion of its frame buffer exposed over the PCI bus.

https://docs.microsoft.com/en-us/windows-hardware/drivers/display/resizable-bar-support

Traditionally people allocate an upload heap. Which is CPU system memory visible to the GPU.

The CPU writes data there, and the GPU can directly read the data over PCI-E bus. Recently I measured 28 GB/s GPU read bandwidth from CPU system memory over PCI-E 4.0.

The two most common use cases are:

1. Dynamic data: CPU writes to upload heap. GPU reads it from there directly in pixel/vertex/compute shader. Examples: constant buffers, dynamic vertex data...

2. Static data: CPU writes to upload heap. GPU timeline copy to GPU resource.

Reading from CPU memory on GPU side has limited bandwidth. Around 28 GB/s over PCI 4.0 and 14 GB/s over PCI 3.0. Thus for all persistent GPU data, you want to copy them from the upload heap (in CPU memory) to GPU memory (200-1000 GB/s). Only keep single use data in upload heap.

Why don't you copy single use data to GPU memory too? Copy operation is also restricted to the same PCI-E bus width. Thus the copy would likely take the same time as reading the data in the shader. So there's no gain here. Only copy if you are using the data repeatedly...

GPUs have already exposed 256 MB pool of CPU visible memory. While this pool is great for small dynamic data, it doesn't help us with the extra copy operation we must do for static/reusable data. 256 MB is way too small for big static data such as textures.

The biggest gain from RBAR is thus that we can access the whole GPU memory from CPU side. Meaning that we can write persistent data directly from CPU side to GPU, without needing an additional copy. The data written by CPU will be optimal to access for GPU immediately!

In some corner cases we also see benefits in dynamic data. If the dynamic data is read once and is cache friendly, there's no difference. However if the dynamic data access pattern is not optimal, the GPU might need to load the same cache line multiple times over PCI-E.

RBAR helps for non-cache friendly dynamic data (such as skinning matrices). Since you can fit all your dynamic data to the GPU memory (without having to do gymnastics with 256 MB pool), you guarantee that accessing that data is fast even when the mem access pattern is not optimal

I recently made a mistake in putting my 1 million cubes index buffer to GPU visible CPU memory (similar memory type as common upload heap). I was ONLY reading the index buffer from CPU memory, and even on PCI-E 4.0, this bottlenecked the rendering. GPU upload made it 5x faster.

This result shows that AMD's claims of "Smart Access" memory providing up to 11% gains are completely reasonable. There are cases where the PCI-E memory access is the bottleneck. Even if the throughput is not the bottleneck, the added latency causes stalls, which add up.

The benefit is the largest for frames where you upload a lot of new stuff, and avoid the GPU copies. This doesn't affect the average frame rate that much. It reduces stalls/judder and improves the minimum frame rate.

However if you can avoid direct upload heap reads in shaders, you will see some performance gains even in frames where no loading happens. 11% is not unreasonable here at all. Especially if those loads are not cache friendly. Even constant buffer load over PCI-E has small impact.

I can measure this once Nvidia releases their new driver that exposes RBAR on Vulkan.

If you intent to write texture data directly to GPU memory using the RBAR, you need to use standard swizzle.

https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ne-d3d12-d3d12_texture_layout

So far there's no Vulkan support for standard swizzle.

Related tweet: https://twitter.com/SebAaltonen/status/1327171850485559296?s=20

Latest Threads Unrolled: