Post

Replies

Boosts

Views

Activity

What is the purpose of threadgroup memory in Metal?
I have been working with Metal for a little while now and I have encountered the threadgroup address space. After reading a little about it in Apple’s MSL reference, I am aware of how threadgroups are formed and how they can be split into SIMD groups; however, I have not yet seen threadgroup memory in action. Can someone give me some examples of when/how threadgroup memory is used? Specifically, how is the [[threadgroup(n)]] attribute used in both kernel and fragment shaders? References to WWDC videos, articles, and/or other resources would be appreciated.
2
0
1.9k
Sep ’20
Metal Debugger Issues
I have been unable to use the metal debugger ever since Apple released Xcode 12 as an update on the app store. It is very frustrating. Xcode 12.0.1 simply crashed on frame capture or after trying to debug a fragment/vertex. Now, Xcode 12.2 issues the following message: "Shader Debugger is not supported in this system configuration. Please install an Xcode with an SDK that is aligned to your target device OS version." I have macOS 10.15.7 and have not upgraded to Big Sur yet I downloaded Xcode 11.7 from the developer website but again, Xcode simply crashes. I will try other older Xcode versions but this should not be something that developers face, especially those working with Metal as it is nearly impossible to debug shaders without the shader debugger. Has anybody else had this issue? If so, what did you do to resolve it?
6
0
1.9k
Nov ’20
GPU Hardware and Metal concerning Tile Memory
In the WWDC talks on Metal that I have watched so far, many of the videos talk about Apple's A_ (fill in the blank, 11, 12, etc.) chip and the power it gives to the developer, such as allowing developers to leverage tile memory by opting to use TBDR. On macOS (at least Intel macs without the M1 chip), TBDR is unavailable, and other objects that leverage tile memory like image blocks are also unavailable. That made me wonder about the structure of the GPUs on macOS and external GPUs like the Blackmagic eGPU (which is currently hooked up to my computer). Are the concepts of tile memory ubiquitous across GPU architectures? For example, if in a Metal kernel function we declared threadgroup float tgfloats[16]; Is this value stored in tile memory (threadgroup memory) on the Blackmagic? Or is there an equivalent storage that is dependent on hardware but available on all hardware in some form? I know there are some WWDCs that deal with multiple GPUs which will probably be helpful, but extra information is always useful. Any links to information about GPU hardware architectures would be appreciated as well
2
0
2.0k
Nov ’20
MTLSharedEvent scheduled block called before command buffer scheduling and not in-flight
I am using a MTLSharedEvent to occasionally relay new information from the CPU to the GPU by writing into a MTLBuffer with storage mode .storageModeManaged within a block registered by the shared event (using the notify(_:atValue:block:) method of MTLSharedEvent, with a MTLSharedEventListener configured to be notified on a background dispatch queue). The process looks something like this: let device = MTLCreateSystemDefaultDevice()! 	let synchronizationQueue = DispatchQueue(label: "com.myproject.synchronization") 		 		let sharedEvent = device.makeSharedEvent()! 		let sharedEventListener = MTLSharedEventListener(dispatchQueue: synchronizationQueue) 		 		// Updated only occasionally on the CPU (on user interaction). Mostly written to 		// on the GPU 		let managedBuffer = device.makeBuffer(length: 10, options: .storageModeManaged)! 		 		var doExtra = true func computeSomething(commandBuffer: MTLCommandBuffer) { 	 	 // Do work on the GPU every frame 	 // After writing to the buffer on the GPU, synchronize the buffer (required) 	 let blitToSynchronize = commandBuffer.makeBlitCommandEncoder()! 				blitToSynchronize.synchronize(resource: managedBuffer) 				blitToSynchronize.endEncoding() 				 	 // Occassionally, add extra information on the GPU 	 if doExtraWork { 					 			 // Register a block to write into the buffer 			sharedEvent.notify(sharedEventListener, atValue: 1) { event, value in 								 						 // Safely write into the buffer. Make sure we call `didModifyRange(_:)` after 								 						// Update the counter 						event.signaledValue = 2 			} 		 commandBuffer.encodeSignalEvent(sharedEvent, value: 1) 		 commandBuffer.encodeWaitForEvent(sharedEvent, value: 2) 	 } 				 				// Commit the work 			 commandBuffer.commit() } The expected behavior is as follows: The GPU does some work with the managed buffer Occasionally, the information needs to be updated with new information on the CPU. In this frame, we register a block of work to be executed. We do so in a dedicated block because we cannot guarantee that by the time execution on the main thread reaches this point the GPU is not simultaneously reading from or writing to the managed buffer. Hence, it is unsafe to simply write to it currently and must make sure the GPU is not doing anything with this data When the GPU schedules this command buffer to be executed, commands executed before the encodeSignalEvent(_:value:) call are executed and then execution on the GPU stops until the block increments the signaledValue property of the event passed into the block When execution reaches the block, we can safely write into the managed buffer because we know the CPU has exclusive access to the resource. Once we've done so, we resume execution of the GPU The issue is that it seems Metal is not calling the block when the GPU is executing the command, but rather *before* the command buffer is even scheduled. Worse, the system seems to "work" with the initial command buffer (the very first command buffer, before any other are scheduled). I first noticed this issue when I looked at a GPU frame capture after my scene would vanish after a CPU update, which is where I saw that the GPU had NaNs all over the place. I then ran into this strange situation when I purposely waited on the background dispatch queue with a sleep(:_) call. Quite correctly, my shared resource semaphore (not shown, signaled in a completion block of the command buffer and waited on in the main thread) reached a value of -1 after committing three command buffers to the command queue (three being the number of recycled shared MTLBuffers holding scene uniform data etc.). This suggests that the first command buffer has not finished executing by then time the CPU is more than three frames ahead, which is consistent with the sleep(_:) behavior. Again, what isn't consistent is the ordering: Metal seems to call the block before even scheduling the buffer. Further, in subsequent frames, it doesn't seem that Metal cares that the sharedEventListener block is taking so long and schedules the command buffer for execution even while the block is running, which finishes dozens of frames later. This behavior is completely inconsistent with what I expect. What is going on here? P.S. There is probably a better way to periodically update a managed buffer whose contents are mostly modified on the GPU, but I have not yet found a way to do so. Any advice on this subject is appreciated as well. Of course, a triple buffer system *could* work, but it would waste a lot of memory as the managed buffer is quite large (whereas the shared buffers managed by the semaphore are quite small)
2
0
1.1k
Feb ’21
Threadgroup memory write then read without barrier
I posted this question to StackOverflow. Perhaps it is better suited here where Apple developers are more likely to see it. I was looking through the project linked on the page "Selecting Device Objects for Compute Processing" in the Metal documentation (linked here - https://developer.apple.com/documentation/metal/gpu_selection_in_macos/selecting_device_objects_for_compute_processing) There, I noticed a clever use of threadgroup memory that I am hoping to adopt in my own particle simulator. However, before I do so I need to understand a particular aspect of threadgroup memory and what the developers are doing in this scenario. The code contains a segment like so: metal // In AAPLKernels.metal // Parameter of the kernel threadgroup float4* sharedPosition [[threadgroup(0)]] // Body ... // For each particle / body for(i = 0; i params.numBodies; i += numThreadsInGroup) { // Because sharedPosition uses the threadgroup address space, 'numThreadsInGroup' elements // of sharedPosition will be initialized at once (not just one element at lid as it // may look like) sharedPosition[threadInGroup] = oldPosition[sourcePosition]; j = 0; while(j numThreadsInGroup) { acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); } // while sourcePosition += numThreadsInGroup; } // for In particular, the comment just before the assignment of sharedPosition starting with "Because..." I found confusing. I haven't read anywhere that threadgroup memory writes happen on all threads in the same threadgroup simultaneously; in fact, I thought a barrier would be needed before reading from the shared memory pool again to avoid undefined behavior since *each* thread is subsequently reading from the entire pool of threadgroup memory after the assignment (the assignment being a write of course). Why is a barrier unnecessary here?
1
0
1k
Feb ’21
Post-tessellation Vertex Function and Raytracing: Getting More Detailed Geometries for Acceleration Structures
I have recently gained some interest in the raytracing API provided by the Metal framework. I understand that you can attach a vertex buffer to a geometry descriptor that Metal will use to create the acceleration structure later (on a MTLPrimitiveAccelerationStructureDescriptor instance for example). This made me wonder if it were possible to write the output of the tessellator into a separate vertex buffer from the post-tessellation vertex shader and pass that along to the raytracer. I thought that perhaps you could get more detailed geometry and still render without rasterization. For example, I might have the following simple post-tessellation vertex function: // Control Point struct struct ControlPoint {     float4 position [[attribute(0)]]; }; // Patch struct struct PatchIn {     patch_control_pointControlPoint control_points; }; // Vertex-to-Fragment struct struct FunctionOutIn {     float4 position [[ position ]];     half4  color    [[ flat ]]; }; [[patch(triangle, 3)]] vertex FunctionOutIn tessellation_vertex_triangle(PatchIn patchIn [[stage_in]], float3 patch_coord [[ position_in_patch ]]) { // Barycentric coordinates float u = patch_coord.x; float v = patch_coord.y; float w = patch_coord.z; // Convert to cartesian coordinates float x = u * patchIn.control_points[0].position.x + v * patchIn.control_points[1].position.x + w * patchIn.control_points[2].position.x; float y = u * patchIn.control_points[0].position.y + v * patchIn.control_points[1].position.y + w * patchIn.control_points[2].position.y; // Output FunctionOutIn vertexOut; vertexOut.position = float4(x, y, 0.0, 1.0); vertexOut.color = half4(u, v, w, 1.0h); return vertexOut; } However, the following doesn't compile } where outPutBuffer would be some struct* (not void*). I noticed that the function doesn't compile when I don't use the data in the control points as output, like so [[patch(triangle, 3)]] vertex FunctionOutIn tessellation_vertex_triangle(PatchIn patchIn [[stage_in]], float3 patch_coord [[ position_in_patch ]]) { // Barycentric coordinates float u = patch_coord.x; float v = patch_coord.y; float w = patch_coord.z; // Convert to cartesian coordinates float x = u * patchIn.control_points[0].position.x + v * patchIn.control_points[1].position.x + w * patchIn.control_points[2].position.x; float y = u * patchIn.control_points[0].position.y + v * patchIn.control_points[1].position.y + w * patchIn.control_points[2].position.y; // Output FunctionOutIn vertexOut; // Does not use x or y (and therefore the `patch_control_pointT`'s values // are not used as output into the rasterizer) vertexOut.position = float4(1.0, 1.0, 0.0, 1.0); vertexOut.color = half4(1.0h, 1.0h, 1.0h, 1.0h); return vertexOut; } I looked at the patch_control_pointT template that was publicly exposed but didn't see anything enforcing this. What is going on here? In particular, how would I go about increasing the quality of the geometry into the raytracer? Would I simply have to use more complex assets? Tessellation has its place in the rasterization pipeline, but can it be used elsewhere? Of course, this would leave a much larger memory footprint if we were storing the tessellated patches.
0
0
725
Mar ’21
Storage of `ray_data` in ray tracing payload
This is a duplicate of my StackOverflow post linked here - https://stackoverflow.com/questions/67336596/storage-of-ray-data-in-ray-tracing-payload I am currently working with Metal's ray tracing API. I remembered I could pass data from an intersection function to the compute kernel that started the ray intersection process. After rewatching the WWDC 2020 talk Discover ray tracing with Metal by Sean James (linked here), I found the relevant section around 16:13 where he talks about the ray payload. However, I was curious where this payload is stored as it passed to the intersection function. When declared with the relevant [[ payload ]] attribute in the intersection function, it must be in the ray_data address. According to the Metal Shading Language Specification (version 2.3), pg. 64, the data passed into the intersection function is copied in the ray_data address space and is copied back out once the intersection function returns. However, this doesn't specify if, e.g., the data is stored in tile memory (like data in the threadgroup address space is) or stored in the per-thread memory (thread address space). The video did not specify this either. In fact, the declarations for the intersect function (see pg. 204) that include the payload term are in the thread address space (which makes sense) So where does the copied ray_data "version" of the data stored in the thread address space in the kernel go?
4
0
1.1k
Apr ’21
Metal Sample Code in Swift?
I've noticed that all of the sample Metal code I've downloaded, from the basic "Creating and Sampling from Textures" to "Modern Rendering with Metal," are all written in Objective-C. I'm hopeful someday one of the demo projects will be written in Swift since that's what I've used to write my Metal apps and I'm looking for some interesting uses of the Swift language as it relates to working with Metal. I understand the Objective-C provided but it would be neat if the samples were written in Swift. Will there ever be a sample project in Swift that uses Metal?Perhaps released with WWDC 2021?
3
0
3.3k
May ’21
Distributing two variants of an iOS app to a single business entity managed under two separate contracts
I am working on an iOS app that will be privately distributed to an organization. The organization has a single Organization Identifier we use in App Store connect for submission. However, our company handles two "branches" of the same organization separately, with different contracts, agreements, projects, etc. Our app would need to be tailored to both branches. The core functionality of the app would remain largely the same for both clients; but each would ultimately contain their own unique content. Sharing code between targets seems like an automatic; however, the situation is interesting because we will likely need to add authentication into our app and thus restrict users to a particular "version" of our app. Moreover, certain users within the organization may be restricted to viewing only content of a single branch while other users might have a need to interact with both branches. Essentially, we may need two very similar apps to service the organization adequately. But it may be possible to achieve this with only a single app. How should we go about providing our app to our client? Should we create a new project, extract as much code as possible into Swift packages/frameworks, and submit each project as a separate app? create multiple targets for each "version" of the app and distribute those apps separately? submit a single app by having our app dynamically change according to the credentials of the user using a single target? For example, if user X can view A and B, the app will function such that A and B are visible?
0
0
685
Jul ’21
Including Intellectual Property Information in an App
I am working on an app that will be distributed to a business that includes patented technologies (legally; I am working with the owner of the patents). We have been advised to have the patent numbers visible within the app, along with a description of the patents. Where is information like this best displayed in an app? We are trying to find a balance between making it clear some of the functionality within the app is backed by patents while not interfering with the main UI for our eventual day-to-day users
0
0
689
Jul ’21
UISplitViewController not hiding master view controller in all situations
I have an application that uses a UISplitViewController as its window's root view controller. The app used what was the Master-Detail Xcode template when it was made. The master VC is a UITableViewController which, when one of its cells are pressed, "hides" itself using self.splitViewController?.hide(.primary) We've updated the VC to be a double-column style split view controller introduced in iOS 14. The method does hide the primary column most of the time; however, there are two cases where our master view controller fails to be dismissed with this animation: Portrait mode on the iPhone Using the Zoomed "Display Zoom" on iPhone landscape mode We have not had any issues with the iPad. The documentation for the hide(:_) method reads When you call this method, the split view interface transitions to the closest display mode available for the current split behavior where the specified column is hidden. Clearly, though, there are conditions under which the primary column isn't hidden with this method. I have searched around for solutions about hiding the master view controller but most are lacking in relevance either because they are many years old (sometimes 10+) and/or not applicable to Swift or iOS 14 with the new split view controller styles. Why is the column not always hiding? and How can I ensure that the master view controller disappears?
0
0
726
Jul ’21
vImage vs CoreImage vs MetalPerformaceShaders strengths and weaknesses
While the above three frameworks (viz. vImage, CoreImage, and MetalPerformaceShaders) serve different overall purposes, what are the strengths and weaknesses of the each of the three frameworks in terms of performance with respect to image processing? It seems that any of the three frameworks is highly performant; but where does each framework shine?
1
3
1.2k
Jan ’22
DocC Documentation for targets other than static and dynamic libraries
Is it possible to add DocC documentation to a target that does not result in either a static library or a framework? It doesn't yet appear to be a feature of DocC. If not, will there be support in the future to add documentation to "regular" Xcode projects that don't result in a static library or framework? I think it could be useful to have documentation for larger apps that may use multiple frameworks in complex ways
2
0
1.5k
Jan ’22
Will TSAN or the Swift compiler identify possible Swift async-await race conditions?
I haven't followed the swift forums very closely, so perhaps there is news buried deep somewhere mentioning this. Will the swift compiler and/or TSAN at runtime in the future be able to identify possible race conditions associated with Swift async-await (excluding data races that are "erased" by async-await)? I suppose this could equate to proving a function is reentrant in some scenarios (from a compiler's perspective, though I'm not knowledgeable about compilers)? Consider, e.g. the scenario described in "Protect Mutable State with Swift Actors" around 9:15, where Dario talks about actor reentrancy, with the cache for the image URL
0
0
649
Jan ’22