Very nice! Please consider adding this to the "New in Metal 3.2" section for easier discoverability. Are there some best practices for using this feature wrt. performance? Also, Metal still does not offer any formal forward progress guarantees across threadgroups, right? So we can't use these new features to implement something like the decoupled Look-back?
Post
Replies
Boosts
Views
Activity
Thank you! Based on these examples, it seems that access to shared memory can be non-uniform even for a SIMD, which is why extra synchronization point is required. I wish that the official documentation was more clear about this. It would be also great to have official comment from the Apple GPU team. P.S. The second video is odd... why are they using simd width of 64 if Apple Silicon uses 32-wide SIMD?
Great catch! Now we have three types of display links to choose from :) I'd love to add this to my list but it doesn't seem I can edit it...
@philipturner the first point might not be as straightforward as one thinks. Physical memory is shared by the CPU and the GPU have different virtual memory tables and it's not clear that they can be shared with ease... there might be some subtle hardware differences that make this hard or even impossible on current hardware.
But the host code only starts a single thread. So the index is always 0.
@LeonardoDev you can use mesh shaders to generate the geometry but a better choice is probably to use SDFs. The thing is, hardware does not has the built in capability to rasterise 3D lines of arbitrary size. It can do triangles or basic lines (which are probably rasterised as degenerate triangles anyway). If you want special functionality you have to implement it yourself. APIs like OpenGL just hide the complexity from you, resulting in poor performance.
Thank you! Are there plans to add such an API? In my testing the GPU remains in the low power state if the delay between submitting short running kernels is as short 1ms. So one can be submitting GPU work hundreds of times per second without ever reaching maximal performance. This makes sense for most applications, but it can be a problem for some software where you cannot continuously submit GPU work but do want things to be done as quickly as possible.
Hi, thanks, that looks almost like what I need :) Only in our case we would need to load an imageblock slice from a texture, not store it. Yet the documentation does not mention a slice read function… I can imagine loading multiple pixels in each thread (but that would complicate the code a lot), or dispatch a larger thread grid and waste most of the threads. Neither is a good solution… regardless, thanks for the effort!
Hi Eugene, thanks for your answer. I think I understand the basics, but what evades me is how to copy the texture to the thread group memory efficiently. For example, let's imagine a following scenario: my compute shader will work on 16x16 pixels at a time and it needs access to a certain neoughbourhood, so I want to load a 32x32 block from the texture. I can then imagine the following pseudocode:
threadgroup pixel *data;
if(thread_index_in_threadgroup == 0) {
copy_texture_region(texture, data, <base offset>, 32, 32);
}
threadgroup_barrier();
output.write(compute_value_for_pixel(thread_position_in_grid, data));
but I have no idea how to do the copy_texture_slice() part efficiently. The talk implies that there is an efficient block transfer function for this but I can't find it in the documentation. A naive way would be to load a bunch of pixels in each thread but that makes everything ridiculously complicated...
P.S. Sorry for terrible formatting, the forum collapses my comment in a very awkward way...