Post

Replies

Boosts

Views

Activity

Comment on Metal 3.2 device memory coherency
Very nice! Please consider adding this to the "New in Metal 3.2" section for easier discoverability. Are there some best practices for using this feature wrt. performance? Also, Metal still does not offer any formal forward progress guarantees across threadgroups, right? So we can't use these new features to implement something like the decoupled Look-back?
Jun ’24
Comment on When is a `simdgroup_barrier()` required?
Thank you! Based on these examples, it seems that access to shared memory can be non-uniform even for a SIMD, which is why extra synchronization point is required. I wish that the official documentation was more clear about this. It would be also great to have official comment from the Apple GPU team. P.S. The second video is odd... why are they using simd width of 64 if Apple Silicon uses 32-wide SIMD?
Mar ’24
Comment on Metal 3D Line Width
@LeonardoDev you can use mesh shaders to generate the geometry but a better choice is probably to use SDFs. The thing is, hardware does not has the built in capability to rasterise 3D lines of arbitrary size. It can do triangles or basic lines (which are probably rasterised as degenerate triangles anyway). If you want special functionality you have to implement it yourself. APIs like OpenGL just hide the complexity from you, resulting in poor performance.
Oct ’22
Comment on Ensuring peak M1 GPU performance for short running kernels
Thank you! Are there plans to add such an API? In my testing the GPU remains in the low power state if the delay between submitting short running kernels is as short 1ms. So one can be submitting GPU work hundreds of times per second without ever reaching maximal performance. This makes sense for most applications, but it can be a problem for some software where you cannot continuously submit GPU work but do want things to be done as quickly as possible.
Mar ’22
Comment on Using threadgroup memory for image convolution
Hi, thanks, that looks almost like what I need :) Only in our case we would need to load an imageblock slice from a texture, not store it. Yet the documentation does not mention a slice read function… I can imagine loading multiple pixels in each thread (but that would complicate the code a lot), or dispatch a larger thread grid and waste most of the threads. Neither is a good solution… regardless, thanks for the effort!
Jun ’21
Comment on Using threadgroup memory for image convolution
Hi Eugene, thanks for your answer. I think I understand the basics, but what evades me is how to copy the texture to the thread group memory efficiently. For example, let's imagine a following scenario: my compute shader will work on 16x16 pixels at a time and it needs access to a certain neoughbourhood, so I want to load a 32x32 block from the texture. I can then imagine the following pseudocode: threadgroup pixel *data; if(thread_index_in_threadgroup == 0) { copy_texture_region(texture, data, <base offset>, 32, 32); } threadgroup_barrier(); output.write(compute_value_for_pixel(thread_position_in_grid, data)); but I have no idea how to do the copy_texture_slice() part efficiently. The talk implies that there is an efficient block transfer function for this but I can't find it in the documentation. A naive way would be to load a bunch of pixels in each thread but that makes everything ridiculously complicated... P.S. Sorry for terrible formatting, the forum collapses my comment in a very awkward way...
Jun ’21