Surprise Season 4! Where I try to port a BC7 texture compressor to a compute shader, and after going in circles, return back to Metal. Previous season: https://twitter.com/aras_p/status/1342920648897785856 1/n
In previous episodes: DX11 was a pain with shader compiler (fxc) taking 10 minutes to compile a very stripped down version of the compressor, and failing to compile a more full compressor code. 2/n
That made me look at Vulkan, write "baby's first vulkan code" etc., try out two of the possible shader compiler toolchains (DXC and glslang), both with some issues. I can't get exact results as the CPU compressor does with Vulkan nor DX11, 3/n
And I'm not sure whether this is due to my PC GPU (I only have one here), or such differences in precision/rounding/whatever/something are allowed by the specs etc. Trying to optimize something that does not even produce correct results is meh, so... 4/n
Back to Metal it is! Season 1 was where Metal was producing bit-exact results as the CPU compressor (great!), all the way from default/basic to "ultrafast" quality levels. Issues were 10 minute shader load time (1st time, later Metal caches kick in), and... 5/n
The compute shader compression performance was not great, going at about 50-70% performance of the CPU compressor. All this trouble for a slower compressor is not excellent. Let's see what we can do! 6/n
My understanding of shader compiler stacks gives an impression that they tend to have "much worse than linear" compile times on some "complex" code. What exactly is "complex" is nebulous, but let's try to split up our giant compute shader. 7/n
The compressor code is essentially done like: compress with BC7 mode 6, measure error. Then try mode 1, if error is lower then use that. Then try mode 0, if error is lower then use that. And so on.
Let's split up into one shader per BC7 mode, saving temp results & error. 8/n
Let's split up into one shader per BC7 mode, saving temp results & error. 8/n
This means we need a new temporary buffer for compression, but that after some packing looks to be about 4 bytes/pixel (i.e. same extra space as input texture), so not terrible.
So a giant shader gets split up into like 10 smaller compute shaders, and... 9/n
So a giant shader gets split up into like 10 smaller compute shaders, and... 9/n
Looks pretty good!
- "fast" modes only: shader loads in 7 seconds (was 40s), runs at 60 Mpix/s (was 41 Mpix/s). CPU runs at 52 Mpix/s. We're a bit faster than CPU!
- "fast" modes only: shader loads in 7 seconds (was 40s), runs at 60 Mpix/s (was 41 Mpix/s). CPU runs at 52 Mpix/s. We're a bit faster than CPU!
- "basic"/default quality mode: shader loads in 59 seconds (was 600s), runs at 45 Mpix/s (was 30 Mpix/s). CPU runs at 35 Mpix/s. We're faster here too! The shader still takes a minute to load, but much better than 10 minutes, eh. 11/n
A minute of shader load time is still long though. Then I try to manually re-unroll some of almost identical calculation blocks in the original compressor code (e.g. the 3 almost identical m_uber1_mask branches around here https://github.com/BinomialLLC/bc7e/blob/f62e4617e1/bc7e.ispc#L2234)... 13/n
A bunch of code merging and shuffling like that later... all the shaders load in 11 seconds! The run time got slightly slower though, but hey at around 10 seconds per shader edit life's good. Certainly better than 10 minutes per shader edit. 14/n
Let's see about performance. Apple Xcode has a Metal frame capture that can also display "some performance things". Oh Xcode, what thou shalt say? The UI does look pretty, but it just says "low occupancy". How low? It won't say :) 15/n
Ok that the code ends up using "way too many variables" to run at good occupancy, that was my guess too. It would be great if the tools actually showed the actual amount of registers (here it just says "100% of vector register pressure" -- 100% of *what*?), 16/n
Or perhaps point me to "hey look, this piece of code here contributes umpteen registers, you might want to look at it" or similar. But I guess I'm out of luck.
(this is on Xcode 12, macOS Catalina, Radeon GPU -- maybe on Apple's own GPU it shows better info? no idea) 17/n
(this is on Xcode 12, macOS Catalina, Radeon GPU -- maybe on Apple's own GPU it shows better info? no idea) 17/n
Anyway, now with the giant shader split into multiple smaller ones, with compile/load times not terrible, I guess I could continue here trying to manually optimize it. Or do the same shader split in either DX11 or Vulkan and see there. 18/n
But all of that will have to be next year! Adieu 2020, you won't be missed. That's it for now, see you later. 19/19