I have a metal compute kernel for dense matrix mutiply, and I'd like to optimize it with simdgroup_float8x8 and simdgroup_half8x8.
However, it seems no one apply them in Metal.
Can you give me some more demo on how to use them excpet that in Metal Shading Language Specification Version 2.4. Thanks!