Post

Replies

Boosts

Views

Activity

issues with heavy register usage
I'm running a kernel with heavy register usage. I'd appreciate explanation for any of three issues. Why do these issues exist and what might I do to mitigate? Maybe relevant: I'm reading and writing back to a host-shared buffer on an M2 chip. Only once have I gotten error reporting that there's not enough register space. Otherwise the kernel fails by returning the wrong answer (all zeros), reporting nothing. Why? Why can it not report properly? Below when I say the thread fails, I mean it returns all zeros. I run the kernel with and without a threadgroup_barrier(mem_flags:mem_none), and this difference alone determines whether the kernel succeeds or fails. How can this be? What resource is required for a threadgroup_barrier even just for execution synchronization (rather than memory synchronization) that can crash the kernel? Is it not just a matter of waiting with no resource consumption? I run the kernel in a certain configuration and it succeeds. I run it similarly but with half as many registers and it fails. These issues don't make sense to me, much less I know how to fix them, so I must be misunderstanding some fundamental principles of execution.
1
0
448
Mar ’23
Are threadgroups executed by cores or execution units on Apple GPUs?
In CUDA terminology, a threadgroup is executed by a "streaming multiprocessor." In Metal terminology, is a threadgroup executed by a "core" or an "execution unit" (within a core)? I can find no resources to answer online, but resources imply differently. Regardless the answer, why do Apple GPU's have this two-layer architecture of cores and execution units, whereas Nvidia has the single layer of streaming multiprocessors? Are both layers visible/accessible to the Metal programmer, or only one layer (whichever corresponds to threadgroups)? What's the purpose of the other layer?
3
0
818
Feb ’23
What's the benefit of metal function pointers?
What's the benefit of metal function pointers (focusing on visible functions rather than intersection functions), when the same functionality can be achieved without these newer features? For the same functionality, in place of every index of the function table, have an external function. In place of every indexing into the tableT[i], have a switch statement that matches i to the corresponding external function. For the case of visible functions statically linked, instead of using the functions array of linkedFunctions, use the private functions array. For the case of visible functions dynamically linked, instead of using the binaryFunctions array of linkedFunctions, use functions from a dynamic library.
0
1
593
Dec ’22
metal binary linked functions causing segmentation fault
I'm trying to follow WWDC2020/21 in making simplest use of Metal's binary linked functions. I attach a visible function with CompileToBinary to the binary linked functions of a compute pipeline descriptor. I'm getting a segmentation fault (I think), which occurs as the program suspending with Thread 1: EXC_BAD_ACCESS (code=1, address=0x10) where the device creates the pipeline (the program not reaching the NSLog). I can't figure out how to address this with memory debugging tools. Some points: Checking all relevant objects right before making the call ([device newComputePipeline...]) yields expected results. I get the same error when making the call asynchronously. Adding visible_func to linked functions instead of binaryFunctions works. I've also been following the FunctionPointersAndStitching example code (for rendering rather than compute) to make sure I'm setting needed properties and creating things in the right order. I can't even begin to use these linked functions (creating a visible function table), until I create the pipeline, so stopping short after creating the pipeline doesn't seem a candidate issue. #import <Foundation/Foundation.h> #import <Metal/Metal.h> int main() { @autoreleasepool { id<MTLDevice> device = MTLCreateSystemDefaultDevice(); id<MTLLibrary> lib = [device newDefaultLibrary]; id<MTLFunction> kernel_func = [lib newFunctionWithName:@"kernel_func"]; MTLFunctionDescriptor * visible_func_desc = [MTLFunctionDescriptor functionDescriptor]; visible_func_desc.name = @"visible_func"; visible_func_desc.options = MTLFunctionOptionCompileToBinary; NSError * error = nil; id<MTLFunction> visible_func = [lib newFunctionWithDescriptor:visible_func_desc error: &error]; MTLLinkedFunctions *linked_funcs = [MTLLinkedFunctions linkedFunctions]; linked_funcs.binaryFunctions = @[visible_func]; MTLComputePipelineDescriptor * pipeline_desc = [MTLComputePipelineDescriptor new]; pipeline_desc.linkedFunctions = linked_funcs; pipeline_desc.computeFunction = kernel_func; MTLAutoreleasedComputePipelineReflection * reflection = nil; id<MTLComputePipelineState> pipeline = [device newComputePipelineStateWithDescriptor:pipeline_desc options:MTLPipelineOptionNone reflection:reflection error:&error]; if(pipeline == nil) { NSLog(@"Error making pipeline: %@",error); } } } #include <metal_stdlib> using namespace metal; [[visible]] int visible_func(int x, int y) { return x + y; } [[kernel]] void kernel_func(uint gid [[ thread_position_in_grid ]]) { gid; }
0
0
577
Dec ’22
what to do when counter sample buffers unavailable?
Counters seem an important tool, eg I don't know how else to use timestamps. But newCounterSampleBufferWithDescriptor:error: discussion says The method may produce an error if the GPU driver has exhausted its underlying resources for counter sample buffers. And indeed, when I make the call I get error code 14 with localized description "out of memory." How to resolve this? Might the counter buffers be exhausted by other applications or what? Is there no sharing of these resources like other GPU resources? I'd be surprised if tools as important as timestamps/counters can simply be unavailable at any time with no explanation. If this is indeed the sad state of affairs, are there any tools like in Xcode I can use for timestamping? If relevant, I have: Intel UHD Graphics 617 1536 MB
4
0
812
Dec ’22
how to read metal feature set table?
The Metal feature set tables list features for each GPU family. I assumed that if a feature is present in a family, then it must be supported for the GPUs in the family, otherwise what's the point of specifying families? But I see exceptions to this assumption, so I'm confused how to make sense of this table. For example, I assumed that features listed for the Metal 3 family must be supported on all GPUs in the Metal 3 family. The Metal 3 family includes A13, which is the Apple 6 family. But I see features available for Metal 3 not available for Apple 6. (Could it be that a feature available for A13 is not available for Apple 6?) For example, see "SIMD-scoped reduction operations."
2
0
1.1k
Dec ’22