Post

Replies

Boosts

Views

Activity

How to optimize ReductionSum with Metal?
I would like to write a ReductionSum Metal Shader like this: https://github.com/alibaba/MNN/blob/master/source/backend/metal/MetalReduction.metal#L32 Sometimes the reduced dimension is large while the other dimensions is small, which cause few threads can be launched and inefficient. Is there any way to optimize it?
1
0
738
Jun ’22
[Metal] Is it right to set different version of Deployment Target and Metal Version?
What I want to do are shown as the following two pictures: My needs are: (1)build framework for MacOS Apps; (2)complie .metallib to accelerate shader init; (3)frameworks can be used in old MacOS Version which only use Metal 1 Features ; (4)frameworks use new Metal 2 Features in new MacOS Version and M1 Silicon. I have konwn: (1)while setting Deployment Target, MSL will adjust to corresponding version. But this design can't meet my needs, so I made the above settings. (2)Apple have post a doucument Detecting GPU Features and Metal Software Versions on this issue , but not fully solved my concern. My concern is: (1)Can the .metallib complied by high version MSL be loaded by old version MacOS? (2)Will App Store refuse the MacOS APP use this kind of setting? Help: Can you tell me is right to do these setting, or give me a better solution? Thanks!
0
0
523
Apr ’22
[Metal] What's wrong with simdgroup_load or simdgroup_store?
OS: MacOS 12.2.1 Hardwear: MacBook Pro 2020, M1 Metal: 2.4 Xcode: 13.2.1 Here is my test computer kernel,which read input buffer with simdgroup_load adn write output buffer with simdgroup_store kernel void fun( const device half * Src [[ buffer(0) ]], constant uint4 & SrcShape [[ buffer(1) ]], device half * Dst [[ buffer(2) ]], constant uint4 & DstShape [[ buffer(3) ]], const device half * Weight [[ buffer(4) ]], ushort3 threadgroup_position_in_grid [[ threadgroup_position_in_grid ]], ushort3 thread_position_in_threadgroup [[ thread_position_in_threadgroup ]], ushort3 threads_per_threadgroup [[ threads_per_threadgroup ]], ushort3 thread_position_in_grid [[ thread_position_in_grid ]]) { const int SrcSlices = (int)SrcShape[0]; const int SrcHeight = (int)SrcShape[1]; const int SrcWidth = (int)SrcShape[2]; const int DstSlices = (int)DstShape[0]; const int DstHeight = (int)DstShape[1]; const int DstWidth = (int)DstShape[2]; const int Kernel_X = 3; const int KernelElemNum = 3 * 3; const int N_Pack = 8; // test only 1 thread if(thread_position_in_grid.z != 0 || thread_position_in_grid.y != 0 || thread_position_in_grid.x * N_Pack != 0) return; simdgroup_half8x8 sgMatY; simdgroup_load(sgMatY, Src); simdgroup_store(sgMatY, Dst); } It's a simple shader, however output buffer only save the first 2 values from input buffer, the other 62 values are ALL ZERO Here is the result from Xcode Metal Capture How can I debug or fix it?
3
0
2.0k
Mar ’22
What's error with this Metal Compute Function in Apple Silicon?
I'm developing AI APPs with Metal Compute Function. As shown in the following picture captured from Metal Debuger: The preview picutre is wrong while the picture in Attachment View is correct. The final Result is the same as the preview picutre which is not expected. This error is not appeared in x86_64 Macbook, but appeared in M1 Macbook. I think there may be some issue(like concurrent read) mentioned in Bring your Metal app to Apple silicon Macs. Can you give me some idea how to debug it? Thanks!
1
0
866
Jan ’22