Vega packed 16bit float performance

I'm currently optimising a realtime path tracer, and want to squeeze a bit more performance out of the higher end Vega cards. These cards support processing two packed 16bit floats in place of a 32bit float, potentially doubling (or more due to lower register use) performance where 16bits is enough.


Is there any info on this on macOS? Or is it just a case of changing datatypes to packed half?


Also, is there a recommended way to detect cards that support this? Because AFAIK doing this on older cards is going to hurt performance, meaning it's going to be necessary to have seperate shaders for older GPUs.

Accepted Reply

Ok, so I've kind of figured this out now. I was able to use AMD's shader disassembler on my app's shader cache, and the Vega ISA is public. It turned out some of my code was using the fast packed 16bit operations, some wasn't, and there were additional conversion operations that likely accounted for the performance hit.


If anyone else is loking at this, then basically you need to use packed_half2/4 types and limit your code to add, mul, fma, min and max. A bit more is possible with integer types - see AMD's Vega ISA document for details. It's useful for ML type work for sure, less useful for general shaders. You'll probably want to profile on older AMD cards, intel and nvidia because using this might lead to a performance drop (or gain!) depending on the hardware.

Replies

Imo, hardware-based optimizations like that rarely work if based on assumptions only. Grab a debug capturer, change a shader code during execution and observe the result. Metal debug capture gives some nice statistics profile when you hot-reload shader code and shows all the deltas in occupation of different parts of GPU.

Yeah, this is why I'd like some solid information about what works and what doesn't - all I have to go on is AMD's product info and docs, and the fact that it kind of looks like it should be supported on macOS.


I have my own shader editing tool, and running a very simple test (take a 4 component vector and sin() it ~5000 times), I can compare the results. With a float4 performance is ~25% higher than with packed_half4. There's no performance difference between half4 and packed half4.

Ok, so I've kind of figured this out now. I was able to use AMD's shader disassembler on my app's shader cache, and the Vega ISA is public. It turned out some of my code was using the fast packed 16bit operations, some wasn't, and there were additional conversion operations that likely accounted for the performance hit.


If anyone else is loking at this, then basically you need to use packed_half2/4 types and limit your code to add, mul, fma, min and max. A bit more is possible with integer types - see AMD's Vega ISA document for details. It's useful for ML type work for sure, less useful for general shaders. You'll probably want to profile on older AMD cards, intel and nvidia because using this might lead to a performance drop (or gain!) depending on the hardware.

You could also try using sin() on a different range of values to get more complete results. I found out that precise::sqrt() calculation took a significantly different time on Intel 5xx/6xx depending on the value you fed to it (and I assume both functions are evaluated iteratevely), so the argument value might also have its impact. If this makes sine caclulations more expensive, the conversion overhead might get less significant.
Also could you give me some tips regarding disassmbling AMD shaders? I didn't know you could get the actual GPU-ready code from Metal, and it seems really useful for debugging some driver-related bugs.
EDIT: I was able to disassemble some shaders from the cache by borrowing some parts of Pyramid disassembler, but it still requires a little bit of guesswork to figure out where the actual code starts. That's good enough for my purposes, but it would be great to know if someone managed to actually reverse-engineer the format of functions.[data|map]

sin() isn't going to help in any case here, as I'm trying to use the packed math features to increase ALU throughput and there's no equivalent to sin() for packed 16bit (i.e. it would just run sin() seperately on each value and not do two at once).


To disassemble AMD shaders, you need to get AMD's own shader decompiler from here:


https://github.com/CLRX/CLRX-mirror


After that, you need to actually run the shader (or perhaps building the pipeline state object is enough, not checked). It's highly advisable to do this in a small sample app that ONLY has that one pipeline state & shader, or you'll end up with a huge disassembly file. That compiles the shader, which gets helpfully cached. You can find the cache location with this command:


getconf DARWIN_USER_CACHE_DIR


Inside there you should find your app's cache, and the GPU you're interested in, and finally the 'functions.data' file which is the raw shader binary (and unfortunately other stuff). You can then disassemble it with:


clrxdisasm -g vega10 -r functions.data


(You'll need to replace 'vega10' with your GPU architecture, that'll work with a vega 56/64).


That gives you the assembly, plus some garbage around it (other data in the file gets interpreted as shader code too). I find it best to search for "s_endpgm" (end shader program), and work back. There will probably be 2+ shaders in there, hopefully you can spot some obvious code to know which you want.


Finally you'll want the relevant AMD ISA document, which is easily found on the web. The Vega ISA is here:


PDF Vega Instruction Set Architecture - AMD

Thanks for such a detailed reply!
It seems like a lot more clean approch than the one I eventually took