Xcode 14's release notes say that DocC now supports Objective-C and C APIs. To clarify, does this mean that C++ and Objective-C++ APIs are not yet supported? I was attempting to convert from Doxygen, but the document build fails with the error Command ExtractAPI failed with a nonzero exit code on the first line of the header citing the first line: #include <string>.
Post
Replies
Boosts
Views
Activity
Hello! I am starting to dig into the docs on object and mesh shaders. I see that the Metal API on the CPU side has new functions for setting object and mesh buffers in the new programable stage. But I don't see corresponding changes to the API for MTLIndirectCommandBuffer. Will we be able to use the GPU to encode draw commands using a pipeline that leverages the new shader types?
Thanks,
Hello -
I am in the early phase of developing an algorithm and was hopeful someone could help me understand how threadgroup memory persists before I go too far down the wrong path.
For simplicity, let's say I am working with 32 KB of threadgroup memory, and I have two kernels K1 and K2.
In the first pass, each threadgroup in K1 loads 8129 32-bit values into threadgroup memory (using all 32 KB).
In the next pass, K2 access the threadgroup memory from K1 and performs some operation on the data.
Since threadgroup memory usually persists only during the lifetime of the threadgroup, in this mid-render kernel example, what can K2 access from the threadgroup memory in K1?
For example, say we have:
kernel void K1(threadgroup uint * mem_k1 [[ threadgroup(0) ]] );
kernel void K2(threadgroup uint * mem_k2 [[ threadgroup(0) ]] );
Say we launch both kernels with 10 threadgroups. Can K2 access every block of threadgroup memory initialized in K1? Or does [[ threadgroup(0) ]] refer only to 1 block of 32KB memory?
If we launch K1 and K2 with a different number of threadgroups per grid, does that change anything?
Or is [[threadgroup(0)]] completely dependent on what the host code allocates via the Metal API?
Thank you in advance.
Hello -
I was hopeful someone could help me understand the syntax of address space qualifiers that I have seen in the metal standard library. I have seen the qualifiers used on constructors and on functions such as the following:
struct A {
A() thread : { }
A() device : { }
void do_it() thread { device int& param } // 1
void do_it() thread { thread int& param } // 2
void do_it() device { device int& param } // 3
void do_it() device { thread int& param } // 4
}
What do the address qualifiers mean in this context exactly? Are these overloads depending on where A resides in memory? For example:
kernel void process_data(device A& d_var, device uint& d_out ) {
thread A p_var { }; // OK
threadgroup A t_var { }; // ERROR?
thread int p_out = 0;
threadgroup int t_out = 0;
p_var.do_it(d_out); // calls 1?
p_var.do_it(p_out); // calls 2?
p_var.do_it(t_out); // Error?
d_var.do_it(d_out); // calls 3?
d_var.do_it(p_out); // calls 4?
d_var.do_it(t_out); // Error?
}
Or does this mean something else entirely? Would it be valid to add an address space qualifier to a global function? For example:
void do_it() device; // declared in program scope
If so, what does it mean for a function to be in (e.g.) device memory? Thank you for helping me understanding this better.
Is there any documentation aside from the Metal Shading Language Specification that explains the SIMD group functions such as
simd_or
simd_and
simd_xor
etc.
CUDA has functions comparable to the vote/ballot simd functions, but I am not aware of anything similar to the above. The MSL specification is vague in stating - for example - simd_or returns the bitwise OR across all active threads.
Does that mean T data is applied bitwise to the value in each SIMD lane? What does that return exactly? Do you get only the highest or lowest or a sum of the bitwise operation?
Hello -
I am seeking clarification on the description of the additional capabilities for argument buffers on Tier 2 hardware described in Section 2.12.1 of MSL Specification v. 2.3.
The specification states: "You can access argument buffers through pointer indexing. This syntax shown below refers to an array of consecutive, independently encoded argument buffers:" Then proceeds to provide an example.
I am able to replicate this behavior on an iPad Pro that supports only Tier 1 argument buffers in a kernel function. The argument buffer is simple and along the lines of the following:
Metal
typedef struct MyAB
{
constant float * buffer [[ id(0) ]];
texture2dfloat texture [[ id(1) ]];
} MyAB;
void kernel (const device MyAB * arguments [[ buffer(0) ]]
{
MyAB argument1 = arguments[0];
MyAB argument2 = arguments[1];
/ etc /
}
Where arguments is a metal buffer with a size equal to the length of the argument encoder multiplied by the number of elements to be accessed via pointer indexing (i.e., in the pseudo case above the size is encoded length times two).
MSL goes on to say that Tier 2 argument buffers can have pointers to other argument buffers. In experimenting, I was also able to replicate this behavior on Tier 1 hardware. The method newArgumentEncoderForBufferAtIndex: always returns nil, but if I instead create the argument encoder with an array of MTLArgumentDescriptor objects, I can encode that buffer in the same manner as described above and nest it into another argument buffer.
I expected Metal to throw an error based on the MSL section cited above. Is this undefined behavior that just happens not to crash on this device or am I not understanding what MSL is trying to say?
Of note, Metal does throw an error if I try to do this in a vertex or fragment function.
Hello -
I was hopeful someone could help me with a few ideas for troubleshooting a test program I am writing to try out encoding compute dispatch commands on the gpu using an indirect command buffer. My simple set up is described below. Please advise if more information would be helpful. The problem I find myself with is that commands are not being encoded into the icb.
Kernel: A simple function that accepts three parameters:
A buffer that holds the icb
A buffer that holds the compute pipeline state for another kernel function
A camera uniforms buffer that consists of a view and projection matrix
The thread_position_in_grid is used to index into the icb within the kernel. The body of the kernel is essentially:
compute_command cmd(bufferA-icb, thread_id);
cmd.set_compute_pipeline_state(bufferB-pipeline);
cmd.set_kernel_buffer(uniforms, 1);
cmd.concurrent_dispatch_threads(threads, threadsPerThreadGroup);
I am not getting any errors from the Metal API. The second kernel function is just a dummy function for testing purposes. It just takes the uniforms buffer and does nothing with it. This seems straight forward enough, but still, when I inspect the buffers, the GPU debugger says no commands have been encoded. Also, if I attempt to execute the commands in the icb from the render loop and take a frame capture, I get a bad access error.
I did confirm that I am calling useResource:options: on the icb in the render loop.
Thoughts on how I should trouble shoot?
Thanks,
Hello - I have a few follow up questions from the WWDC presentation on the new binary archive feature for pipeline descriptors. I am experimenting with this snippet of code:
id<MTLBinaryArchive> binary = [device newBinaryArchiveWithDescriptor:descriptor error:&error];
if (error) @throw error;
id<MTLFunction> vtx = [library newFunctionWithName:@"vertexShader"];
id<MTLFunction> frg = [library newFunctionWithName:@"fragmentShader"];
MTLRenderPipelineDescriptor * psd = [MTLRenderPipelineDescriptor new];
psd.vertexDescriptor = [MTLVertexDescriptor newDefaultMTLVertexDescriptor];
psd.vertexFunction = vtx;
psd.fragmentFunction = frg;
psd.binaryArchives = @[binary];
id<MTLRenderPipelineState> pso = [device newRenderPipelineStateWithDescriptor:psd error:&error];
An exception breakpoint hits on Line 13:
[MTLDebugBinaryArchive findProgramObjectForFunction:]: unrecognized selector sent to instance
If I comment out line 11 (psd.binaryArchives = @[binary]), the pipeline compiles as expected. Any ideas why an exception is being thrown if I set the binary archive? The binary serializes to disk and deserializes back into the test app without any errors.
My second question relates to the statement in the presentation to the effect that the binary can only be used by the same device that created it. Does this mean we have to create an archive for A9, A10, A11, etc.? And further, does that mean we need a device with every different kind of chip in it to ship an app that can leverage this feature on a variety of devices?
Thanks in advance.