Using threadgroup memory for image convolution

In the talk „Create image processing apps powered by Apple Silicon“ Harsh Patil mentioned that one should use threadgroup memory to load a chunk of image containing all the required pixels to run a convolution kernel. Unfortunately there was no code example and I have difficulty figuring out how something like that would be set up. I can imagine using imageblocks, but how would one load/store them in the shader? Could anyone offer some guidance (ideally with a code snippet)?

Replies

Hi jcookie!

When using threadgroup memory in your compute kernel, you basically use the same "local" memory as ImageBlock uses. That was exactly what Harsh has mentioned - you can explicitly use TileMemory by declaring threadgroup memory allocation. In other APIs, these type of memory is called "shared" or "local", for your reference.

Below is basic compute example (with no threadgroup usage, but you can get the idea): https://developer.apple.com/documentation/metal/processing_a_texture_in_a_compute_function?language=objc

I can't immediately find an example on official Apple Developer website. The idea is you first bring texture/buffer data to threadgroup memory in your shader, then you do all the calculations ONLY on this local threadgroup memory. Since this memory is much faster (though banks should be kept in mind), you can pack more ALUs and do more work while waiting less for the memory. In the end of the computation, you write (flush) threadgroup memory to device memory. That was what Harsh called "flush" and you should do it in the compute kernel yourself.

  • Hi Eugene, thanks for your answer. I think I understand the basics, but what evades me is how to copy the texture to the thread group memory efficiently. For example, let's imagine a following scenario: my compute shader will work on 16x16 pixels at a time and it needs access to a certain neoughbourhood, so I want to load a 32x32 block from the texture. I can then imagine the following pseudocode:

    threadgroup pixel *data; if(thread_index_in_threadgroup == 0) { copy_texture_region(texture, data, <base offset>, 32, 32); } threadgroup_barrier(); output.write(compute_value_for_pixel(thread_position_in_grid, data));

    but I have no idea how to do the copy_texture_slice() part efficiently. The talk implies that there is an efficient block transfer function for this but I can't find it in the documentation. A naive way would be to load a bunch of pixels in each thread but that makes everything ridiculously complicated...

    P.S. Sorry for terrible formatting, the forum collapses my comment in a very awkward way...

  • Hi jcookie!

    Is this sample from Metal Shading Language v 2.4 is what you are looking for?

    6.13.3 Writing an Imageblock Slice to a Region in a Texture

    Example: `struct Foo { half4 a; int b; float c; };

    kernel void my_kernel(texture2d src [[ texture(0) ]], texture2d<half, access::write> dst [[ texture(1) ]], imageblock img_blk, ushort2 lid [[ thread_position_in_threadgroup ]], ushort2 gid [[ thread_position_in_grid ]]) { // Read the pixel from the input image using the thread ID. half4 clr = src.read(gid); // Get the image slice. threadgroup_imageblock Foo* f = img_blk.data(lid); // Write the pixel in the imageblock using the thread ID in threadgroup. f->a = clr; // A barrier to make sure all threads finish writing to the imageblock. // In this case, each thread writes to its location in the imageblock // so a barrier is not necessary. threadgroup_barrier(mem_flags::mem_threadgroup_imageblock); // Process the pixels in imageblock, and update the elements in slice. process_pixels_in_imageblock(img_blk, gid, lid); // A barrier to make sure all threads finish writing to the elements in the // imageblock. threadgroup_barrier(mem_flags::mem_threadgroup_imageblock); // Write a specific element in an imageblock to the output image. // Only one thread in the threadgroup performs the imageblock write. if (lid.x == 0 && lid.y == 0) dst.write(img_blk.slice(f->a), gid); }`

  • Sorry, formatting is insane, I am impressed how all the CR/LF just turned into nothing. But please check 6.13.3 in MSL2.4 spec.