I would like to write a ReductionSum Metal Shader like this:
https://github.com/alibaba/MNN/blob/master/source/backend/metal/MetalReduction.metal#L32
Sometimes the reduced dimension is large while the other dimensions is small, which cause few threads can be launched and inefficient.
Is there any way to optimize it?
Post
Replies
Boosts
Views
Activity
Captured from Video WWDC 2020 10603 Optimize Metal apps and games with GPU counters
How can I get information about System Level Cache?And is it inside the SOC?If not, where is it?
To optimize buffer read, I intend to use threadgroup memory.
Buf it seems:
(1) There is no API like std::memcpy in MSL;
(2) Also, there is no API like [setBuffer: atIndex:] to set data for threadgroup memory.
The amount of data is 2~4KB. How can I get the fastest way to copy data from device data to threadgroup memory? THX!
What I want to do are shown as the following two pictures:
My needs are:
(1)build framework for MacOS Apps;
(2)complie .metallib to accelerate shader init;
(3)frameworks can be used in old MacOS Version which only use Metal 1 Features ;
(4)frameworks use new Metal 2 Features in new MacOS Version and M1 Silicon.
I have konwn:
(1)while setting Deployment Target, MSL will adjust to corresponding version. But this design can't meet my needs, so I made the above settings.
(2)Apple have post a doucument Detecting GPU Features and Metal Software Versions on this issue , but not fully solved my concern.
My concern is:
(1)Can the .metallib complied by high version MSL be loaded by old version MacOS?
(2)Will App Store refuse the MacOS APP use this kind of setting?
Help:
Can you tell me is right to do these setting, or give me a better solution? Thanks!
OS: MacOS 12.2.1
Hardwear: MacBook Pro 2020, M1
Metal: 2.4
Xcode: 13.2.1
Here is my test computer kernel,which read input buffer with simdgroup_load adn write output buffer with simdgroup_store
kernel void fun(
const device half * Src [[ buffer(0) ]],
constant uint4 & SrcShape [[ buffer(1) ]],
device half * Dst [[ buffer(2) ]],
constant uint4 & DstShape [[ buffer(3) ]],
const device half * Weight [[ buffer(4) ]],
ushort3 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]],
ushort3 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]],
ushort3 threads_per_threadgroup [[ threads_per_threadgroup ]],
ushort3 thread_position_in_grid [[ thread_position_in_grid ]])
{
const int SrcSlices = (int)SrcShape[0];
const int SrcHeight = (int)SrcShape[1];
const int SrcWidth = (int)SrcShape[2];
const int DstSlices = (int)DstShape[0];
const int DstHeight = (int)DstShape[1];
const int DstWidth = (int)DstShape[2];
const int Kernel_X = 3;
const int KernelElemNum = 3 * 3;
const int N_Pack = 8;
// test only 1 thread
if(thread_position_in_grid.z != 0 || thread_position_in_grid.y != 0 || thread_position_in_grid.x * N_Pack != 0) return;
simdgroup_half8x8 sgMatY;
simdgroup_load(sgMatY, Src);
simdgroup_store(sgMatY, Dst);
}
It's a simple shader, however output buffer only save the first 2 values from input buffer, the other 62 values are ALL ZERO
Here is the result from Xcode Metal Capture
How can I debug or fix it?
I have a metal compute kernel for dense matrix mutiply, and I'd like to optimize it with simdgroup_float8x8 and simdgroup_half8x8.
However, it seems no one apply them in Metal.
Can you give me some more demo on how to use them excpet that in Metal Shading Language Specification Version 2.4.
Thanks!
I'm developing AI APPs with Metal Compute Function.
As shown in the following picture captured from Metal Debuger: The preview picutre is wrong while the picture in Attachment View is correct.
The final Result is the same as the preview picutre which is not expected.
This error is not appeared in x86_64 Macbook, but appeared in M1 Macbook.
I think there may be some issue(like concurrent read) mentioned in Bring your Metal app to Apple silicon Macs.
Can you give me some idea how to debug it?
Thanks!