jcookie’s Profile | Apple Developer Forums

When is a `simdgroup_barrier()` required?

Metal offers both threadgroup_barrier() and simdgroup_barrier(). I understand the need for threadroup barriers — it would not be possible to rely on well cooperation between threads in a threadgroup without them, as different threads can execute on different SIMD partitions at different times. But I don't really get the simdgroup_barrier() — it was my impression that all threads in a simdgroup execute in lockstep and this if one thread in a simdgroup makes progress, all other active threads in the simdgroup are also guaranteed to make progress. If this were not the case we'd need to insert simdgroup barrier pretty much any time we read or write any storage or perform SIMD-scoped operations. It doesn't seem like Apple uses simdgroup_barrier() in any of their sample code. In fact, it seems like it's a no-op on current Apple Silicon hardware. Is there a situation when I need to use simdgroup barriers or is this a superfluous operation? P.S. It seems that Apple engineers are as confused by this as I am, see https://github.com/ml-explore/mlx/blame/1f6ab6a556045961c639735efceebbee7cce814d/mlx/backend/metal/kernels/scan.metal#L355

Metal

Posted

by

jcookie.

Last updated

.

Tile shading pipeline without fragment shader?

I am experimenting with some alternative rendering techniques, where the scene is represented as a mixture of parametrised SDFs and the final shading is done evaluating and mixing the SDFs for each fragment. The basic algorithm divides the screen into tiles, collects and sorts the SDFs intersecting each tile, and then invokes the final compute shader. There can be multiple SDFs affecting each pixel as they are partially transparent. It seems to me that Apple's TBDR tile shading pipeline would be an ideal for for this type of algorithm, but I am not quite sure how to utilise it efficiently. Essentially, I was thinking about rendering bounding rects over the SDFs and leveraging the binning hardware to arrange them into tiles for me. What I need the rasterisation pipeline to spit out is simply the list of primitives per tile. But there is no "per-primitive-per-tile" shader stage, so this has to be done in the fragment shader. I could of course record the primitive ID per pixel, but this is complicated by the fact that I can have multiple primitives affecting each pixel. Plus, there will be a lot of duplicates, as there are usually not more than 5-6 primitives per tile, and sorting the duplicates out seems like a waste. What would be the most efficient way to handle this? Is there a way to utilize the tile shading pipeline to simply build out a list of primitive IDs in the tile?

Metal

Posted

by

jcookie.

Last updated

.

Problems with mesh shaders that dispatch large amount of threadgroups

I was familiarising myself with the Metal mesh shaders and run into some issues. First, a trivial application that uses mesh shaders to generate simple rectangular geometry hangs the GPU when dispatching 2D grids of mesh shader threadgroups, but it's really weird as it is sensitive to the grid shape. E.g. // these work! meshGridProperties.set_threadgroups_per_grid(uint3(512, 1, 1)); meshGridProperties.set_threadgroups_per_grid(uint3(16, 8, 1)); meshGridProperties.set_threadgroups_per_grid(uint3(32, 5, 1)); // these (and anything "bigger") hang! meshGridProperties.set_threadgroups_per_grid(uint3(16, 9, 1)); meshGridProperties.set_threadgroups_per_grid(uint3(32, 6, 1)); The sample shader code is attached. The invocation is trivial enough: re.drawMeshThreadgroups( MTLSizeMake(1, 1, 1), threadsPerObjectThreadgroup: MTLSizeMake(1, 1, 1), threadsPerMeshThreadgroup: MTLSizeMake(1, 1, 1) ) For apple engineers: a bug has been submitted under FB10367407 Mesh shader code: 2d_grid_mesh_shader_hangs.metal I also have a more complex application where mesh shaders are used to generate sphere geometry: each mesh shader thread group generates a single slice of the sphere. Here the problem is similar: once there more than X slices to render, some of the dispatched mesh threadtroups don't seem to do anything (see screenshot below). But the funny thing is that the geometry is produced, as it would occasionally flicker in and out of existence, and if I manually block out some threadgroups from running (e.g. by using something like if(threadgroup_index > 90) return; in the mesh shader, the "hidden" geometry works! It almost looks like different mesh shaders thread group would reuse the same memory allocation for storing the output mesh data and output of some threadgroups is overwritten. I have not submitted this as a bug, since the code is more complex and messy, but can do so if someone from the Apple team wants to have a look.

Metal

Posted

by

jcookie.

Last updated

.

A17 Pro documentation

Is the documentation of the new hardware features in A17 Pro available yet? I was unable to find anything in the official tech notes.

Posted

by

jcookie.

Last updated

.

Changes in Metal 3.1

Is there a comprehensive list of what has changed in Metal 3.1? The API diff is not really helpful and there is no such document for the shading language. From a cursory glance, this is what I have noticed: curve support for raytracing bfloat support on the GPU option to specify vertex stride at the time of binding rather than as property of the render pass linked function support for mesh shading pipeline (essentially VK_EXT_shader_object from what I understand)? Anything substantial changes that I have missed?

Metal

Posted

by

jcookie.

Last updated

.

Vulkan and Metal (some observations)

As Vulkan spec has been released few days ago, I think it might be interesting to look at how it compares to what Apple gives us with Metal. First of all, a disclaimer: this is mostly from an academic standpoint, I am interested to comparing the APIs, the provided fetures and their relative merits. I hope that some other people here who are curious about GPUs, APIs and API design could offer their thouhgts on the matter.Some folks (me included) were quite dissapointed to learn that Apple is jumping the Vulkan bandwagon. After reading the spec, I think I understand why. And I am starting to believe this might be a very reasonable move by Apple. Here are some thoughts.I think it should be fairly clear that Vulkan offers higher performance potential then Metal. Metal still does a lot of hand holding and behind-the-scenes management for you, while with Vulkan you are responsible for — literally — everything. And man they were NOT kidding when they said that the API is explicit. Its actually quite ridiculous how diffficult and detailed the API is. Of course, the nice thing is that you can optimise the resource usage very precisely in regards to the specifics of your engine, and you get quite precise performance guarantees. On the other side, you need to make sure that the data you use for a particular pass is in the device memory, which means juggling data around, recreating resources, breaking down yoru renderign commands and doing all kinds of weird memory dances. In fact, I can't imagine that many people will use Vulkan directly, instead we will see a bunch of wrapper libraries that abstract the tedious tasks like manual memory management and operation synchronisation. At the same time — and that is the funny thing — Vulkan does not seem that much more powerful to me. Yes, it supports stuff like geometry and tesselation shaders, it has batched bindings updates, sparse ressources, command buffer reuse and atomic texture operations. But all these things can be trivially added to Metal (and I'm sure Apple is working on that already). The ressource binding model of Vulkan is more efficient, that for sure, but it is certainly not more powerful — it does not allow you to build more complex shader inputs than what Metal already offers.The explicit nature of Vulkan might offer additional optimisation opportinuties to applications seeking to squeeze those 100% out of the hardware, but at the extreme expense of usability. Metal is a more casual API, which is very convenient to use and still offers very good performance (and performance guarantees) that will satisfy an overwhelming majority of applications, both for graphics and compute. With some extensions, it will basically have feature parity with Vulkan, and it can easily borrow some of Vulkan's optimisations without sacrifising ease of use (e.g. batched binding updates, reusable command buffers as well as synchronisation primitives). And let's be honest here — applications that really need explicit control like Vulkan provides are high-end game titles, which are not targeted at the Apple platform anyway (because they require really beefy GPUs, which Apple simply does not ship in their machines). I think Apple might have lost the initial interest in Vulkan after they saw what it was shaping up to become. They were interested in having a convenient and efficient replacement for the difficult to maintain and erratic OpenGL. Vulkan is certainly efficient but I wouldn't call it 'convenient'. Its not an API that would draw developers (especially small-time developers) away from using OpenGL or encourage them to make more titles for OS X. Instead, Metal hits the spot exactly. I still would like to see Vulkan on OS X and iOS at some point (to make it easier for devs to port from other platforms), and from what I gathered, it should be actually possible to implement a Vulkan wrapper on top of Metal (which will of course lack features such as sparse resouces, tesselation shaders etc. — but thats is still perfectly legal according to the Vulkan spec). Personally however, I'd be much more interested in a Metal implementation on top of Vulkan to use on Windows/Linux.

Metal

Posted

by

jcookie.

Last updated

.

Clarification of Tier 2 Argument Buffer limits

Apple documentation states about Tier 2 Argument Buffer hardware capability The maximum per-app resources available at any given time are: 500,000 buffers or textures What does it mean exactly? Does this number refer to the maximal count of attachment points (e.g. unique indices) across all bound argument buffers, the maximal count of only bound resources across the argument buffers (e.g. when using dynamic indexing and sparsely binding resources) or the number of resource objects that the application can create and manage at a given time? Prompted by some discussions in the community I decided to run some tests and was surprised to discover that I could bind many millions buffer attachments to a single argument buffer in a Metal shader on my M1 Max laptop, way in excess of the quoted 500,000 limit. Is that just undefined behaviour that one should not rely on or does "500,000" refer to something else instead of the number of attachment points? Hope that someone from Apple Gpu team can clarify this. If this is not the correct venue for this question, please tell me where I can send my inquiry.

Metal

Posted

by

jcookie.

Last updated

.

Metal 3: GPU addresses and argument buffer encoding

Looking at the new Metal 3 APIs diffs, I noticed that objects now expose a new gpuHandle/gpuRessourceD property, and that the MTLArgumentEncoder is marked as deprecated and there seems to be the family of new MTLBinding APIs that looks like a replacement for it. Does this mean that we are getting some new resource binding model? I was not able to find any details in the documentation and Tuesday's Metal session did not mention these API changes at all. And the APIs themselves seem to be in flux, as gpuHandle is already marked as deprecated even though it is still beta :) Will there be a WWDC session about these APIs or could you share some details here?

Metal

Posted

by

jcookie.

Last updated

.

Ensuring peak M1 GPU performance for short running kernels

There is currently an ongoing discussion about the validity of GPU compute performance estimates like those offered by popular benchmarking tools such as Geekbench 5. It has been observed that Apple GPUs have a relatively slow frequency ramp up do not reach their peak performance if the submitted kernels have a runtime under a few seconds. I understand that these GPUs are designed for throughtput rather than latency, but sometimes one does work with “small” work packages (such as processing a single image). Is there an official way to tell the system that it should use peak performance for such work? E.g. some sort of hint along the lines of “I will now submit some GPU work and I want you to power up all the relevant subsystems” instead of relying on the OS to lazily adjust the performance profile?

Metal

Posted

by

jcookie.

Last updated

.

Using threadgroup memory for image convolution

In the talk „Create image processing apps powered by Apple Silicon“ Harsh Patil mentioned that one should use threadgroup memory to load a chunk of image containing all the required pixels to run a convolution kernel. Unfortunately there was no code example and I have difficulty figuring out how something like that would be set up. I can imagine using imageblocks, but how would one load/store them in the shader? Could anyone offer some guidance (ideally with a code snippet)?

Metal

Posted

by

jcookie.

Last updated

.

Writing to device memory from vertex function

Metal shading language specification in section 5.10 states: "If a vertex function does writes to one or more buffers or textures, its return type must be void" However, writing to buffers from vertex functions works correctly on Intel, AMD and A13 GPUs. Has this restriction been removed on later hardware? Can one rely on this behavior going forward? Or is it just a fluke?

Metal

Posted

by

jcookie.

Last updated

.

Any info on “ supportsPullModelInterpolation”?

I found this new API in MTLDevice “supportsPullModelInterpolation()” but there was no additional info. Anyone knows what is this about?

Metal

Posted

by

jcookie.

Last updated

.

Whats new in A14 GPU?

Apple so far has been very enigmatic about the capabilities of the A14. Updated Metal feature tables suggest that A14 gains some features up to now reserved to the desktop GPUs (e.g. barycentrics support). Is there anything else? Will there be an updated tech note?

Metal

Posted

by

jcookie.

Last updated

.

User Profile

jcookie

Posts

Posts

When is a `simdgroup_barrier()` required?

Tile shading pipeline without fragment shader?

Problems with mesh shaders that dispatch large amount of threadgroups

A17 Pro documentation

Changes in Metal 3.1

Vulkan and Metal (some observations)

Clarification of Tier 2 Argument Buffer limits

Metal 3: GPU addresses and argument buffer encoding

Ensuring peak M1 GPU performance for short running kernels

Using threadgroup memory for image convolution

Writing to device memory from vertex function

Any info on “ supportsPullModelInterpolation”?

Whats new in A14 GPU?