Post

Replies

Boosts

Views

Activity

Controlling simdgroup structure in planar compute dispatches
In the "Discover advances Metal for A15 Bionic" Tech Talk right around the 20:00 mark, the presenter (Katelyn Hinson) says: The output image is split into a set of SIMD groups, where each SIMD group is a 4-by-8 chunk, [with] each thread writing to a single output. Supposing that we know the simdgroup will contain 32 threads (which they mention in the talk is true for Apple Silicon), is the only way to ensure that the threads in each simdgroup will be arranged into a 4 x 8 chunk to perform a dispatch with threadgroups that have a width dividing the number of threads per simdgroup? I can't think of another way to control the shape of a simdgroup directly within threadgroups since there is no explicit API to do so. For example, if we perform a dispatchThreadgroups(_:threadsPerThreadgroup:) with a threadgroup size of 8 x 8 to attempt to recreate the visuals in the presentation, wouldn't the resulting simdgroup shape be an 8 x 4 region and not a 4 x 8 region? The assumptions made in the video about where to sample the source texture and which shuffle functions to use are heavily influenced by the shape of the simdgroup. I'm trying to implement a similar reduction but I'm currently figuring out how to shape each simdgroup. If we don't know whether the simdgroup is 32 threads (I believe it's possible simdgroups have 64 threads?). What would be a reliable way to control the structure of the simdgroups? I believe if we always ensure that the width of the threadgroup divides the number of threads in the simdgroup we should get the behavior that we want, but I'm looking to confirm this logic. IIRC, simdgroups will always have a multiple of 8 threads (maybe it was only 4?), so perhaps a width of 8 (or 4) would always suffice for the threadgroup and you could specify a height of computePipelineState.maxTotalThreadsPerThreadgroup / 4 for example. Finally, must we only use uniform threadgroups (viz. we couldn't use dispatchThreads(_:threadsPerThreadgroup:)) for reliable results? I'm thinking that non-uniform threadgroups would again violate our assumptions about the simdgroup shape
0
1
469
Jun ’22
Difference between `thread_execution_width` and `threads_per_simdgroup`
I have a compute kernel that makes use of simdgroup operations such as simd_shuffle_up, simd_or, etc, and I'm looking to rewrite the kernel to support older hardware. One such computation requires that I know the index of the thread in the simdgroup (thread_index_in_simdgroup). I was hoping to derive it from the thread's position in its threadgroup (thread_position_in_threadgroup) and the thread execution width (thread_execution_width), along with other knowledge about the size of the threadgroup when I noticed there was also the threads_per_simdgroup attribute. The spec describes both respectively as thread_execution_width: The execution width of the compute unit. threads_per_simdgroup: The thread execution width of a SIMD-group. Under what conditions, if any, could these two values differ? If they do differ, is there a way to determine a thread's position in the simdgroup on hardware that doesn't support Metal 2.2?
1
1
577
Jun ’22
UITextInput autocorrection: where do we receive text replaced by the system?
I have an implementation of UITextInput that is used to implement note taking with text. We have a custom UIMenuItem that lists suggested text replacements for a misspelled word that a user can interact with to fix the word. This works well on iPhone and iPad where the only path for changing text is via this menu. On Mac Catalyst, however, the system also presents text replacement options with the best replacement; and when users attempt to replace text with the menu options provided by the system, our UITextInput handler seems to only receive a call to setSelectedTextRange: (the code is in Objective-C). I would expect a call to, for example, replaceRange:WithText: after an autocorrection is made Any ideas what could possibly be incorrectly implementing? I.e., how can we receive the text that the system attempts to replace?
0
0
519
Apr ’22
Unexpected behavior for shared MTLBuffer during CPU work
I have an image processing pipeline that performs some work on the CPU after the GPU processes a texture and then writes its result into a shared buffer (i.e. storageMode = .shared) used by the CPU for its computation. After the CPU does its work, it similarly writes at a different offset into the same shared MTLBuffer object. The buffer is arranged as so: uint | uint | .... | uint | float offsets (contiguous): 0 | ... where the floating point slot is written into by the CPU and later used by the GPU in subsequent compute passes. I haven't been able to explain or find documentation on the following strange behavior. The compute pipeline with the above buffer (call it buffer A) is as follows (without the force unwraps): let device = MTLCreateSystemDefaultDevice()! let commandQueue = device.makeCommandQueue()! let commandBuffer = commandQueue.makeCommandBuffer()! let sharedEvent = device.makeSharedEvent()! let sharedEventQueue = DispatchQueue(label: "my-queue") let sharedEventListener = MTLSharedEventListener(dispatchQueue: sharedEventQueue) // Compute pipeline kernelA.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture, destinationBuffer: bufferA)         commandBuffer.encodeCPUExecution(for: sharedEventObject, listener: sharedEventListener) { [self] in var value = Float(0.0) bufferA.unsafelyWrite(&value, offset: Self.targetBufferOffset) } kernelB.setTargetBuffer(histogramBuffer, offset: Self.targetBufferOffset) kernelB.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture, destinationTexture: destinationTexture) Note that commandBuffer.encodeCPUExecution simply is a convenience function around the shared event object (encodeSignalEvent and encodeWaitEvent) that signals and waits on event.signaledValue + 1 and event.signaledValue + 2 respectively. In the example above, kernel B does not see the writes made during the CPU execution. It can however see the values written into the buffer from kernelA. The strange part: if you write to that same location in the buffer before the GPU schedules this work (e.g. during the encoding instead of in the middle of the GPU execution or whenever before), kernelB does see the value of the writes by the CPU. This is odd behavior that to me suggests there is undefined behavior. If the buffer were .managed I could understand the behavior since changes on each side must be made explicit; but with a .shared buffer this behavior seems quite unexpected, especially considering that the CPU can read the values made by the preceding kernel (viz. kernelA) What explains this strange behavior with Metal? Note: This behavior occurs on an M1 Mac running MacCatalyst and an iPad Pro (5th generation) running iOS 15.3
4
0
1.2k
Feb ’22
Metal Quadgroups Example Usage
I've started working with simdgroups and as I was looking through the MSL documentation I noticed that there exists, in addition to simdgroups, quadgroups. The shading language documentation merely states that A quad-group function is a SIMD-group function (see section 6.9.2) with an execution width of 4. However, it doesn't appear there's a clear reason for using quadgroups over simdgroups, and I have yet to find demonstrations of using quadgroups within a compute kernel. What are quadgroups and how are they used in conjunction with/replacement of simdgroups?
1
1
677
Jan ’22
Will TSAN or the Swift compiler identify possible Swift async-await race conditions?
I haven't followed the swift forums very closely, so perhaps there is news buried deep somewhere mentioning this. Will the swift compiler and/or TSAN at runtime in the future be able to identify possible race conditions associated with Swift async-await (excluding data races that are "erased" by async-await)? I suppose this could equate to proving a function is reentrant in some scenarios (from a compiler's perspective, though I'm not knowledgeable about compilers)? Consider, e.g. the scenario described in "Protect Mutable State with Swift Actors" around 9:15, where Dario talks about actor reentrancy, with the cache for the image URL
0
0
595
Jan ’22
DocC Documentation for targets other than static and dynamic libraries
Is it possible to add DocC documentation to a target that does not result in either a static library or a framework? It doesn't yet appear to be a feature of DocC. If not, will there be support in the future to add documentation to "regular" Xcode projects that don't result in a static library or framework? I think it could be useful to have documentation for larger apps that may use multiple frameworks in complex ways
2
0
1.4k
Jan ’22
vImage vs CoreImage vs MetalPerformaceShaders strengths and weaknesses
While the above three frameworks (viz. vImage, CoreImage, and MetalPerformaceShaders) serve different overall purposes, what are the strengths and weaknesses of the each of the three frameworks in terms of performance with respect to image processing? It seems that any of the three frameworks is highly performant; but where does each framework shine?
1
3
1.1k
Jan ’22
UISplitViewController not hiding master view controller in all situations
I have an application that uses a UISplitViewController as its window's root view controller. The app used what was the Master-Detail Xcode template when it was made. The master VC is a UITableViewController which, when one of its cells are pressed, "hides" itself using self.splitViewController?.hide(.primary) We've updated the VC to be a double-column style split view controller introduced in iOS 14. The method does hide the primary column most of the time; however, there are two cases where our master view controller fails to be dismissed with this animation: Portrait mode on the iPhone Using the Zoomed "Display Zoom" on iPhone landscape mode We have not had any issues with the iPad. The documentation for the hide(:_) method reads When you call this method, the split view interface transitions to the closest display mode available for the current split behavior where the specified column is hidden. Clearly, though, there are conditions under which the primary column isn't hidden with this method. I have searched around for solutions about hiding the master view controller but most are lacking in relevance either because they are many years old (sometimes 10+) and/or not applicable to Swift or iOS 14 with the new split view controller styles. Why is the column not always hiding? and How can I ensure that the master view controller disappears?
0
0
650
Jul ’21
Including Intellectual Property Information in an App
I am working on an app that will be distributed to a business that includes patented technologies (legally; I am working with the owner of the patents). We have been advised to have the patent numbers visible within the app, along with a description of the patents. Where is information like this best displayed in an app? We are trying to find a balance between making it clear some of the functionality within the app is backed by patents while not interfering with the main UI for our eventual day-to-day users
0
0
640
Jul ’21
Distributing two variants of an iOS app to a single business entity managed under two separate contracts
I am working on an iOS app that will be privately distributed to an organization. The organization has a single Organization Identifier we use in App Store connect for submission. However, our company handles two "branches" of the same organization separately, with different contracts, agreements, projects, etc. Our app would need to be tailored to both branches. The core functionality of the app would remain largely the same for both clients; but each would ultimately contain their own unique content. Sharing code between targets seems like an automatic; however, the situation is interesting because we will likely need to add authentication into our app and thus restrict users to a particular "version" of our app. Moreover, certain users within the organization may be restricted to viewing only content of a single branch while other users might have a need to interact with both branches. Essentially, we may need two very similar apps to service the organization adequately. But it may be possible to achieve this with only a single app. How should we go about providing our app to our client? Should we create a new project, extract as much code as possible into Swift packages/frameworks, and submit each project as a separate app? create multiple targets for each "version" of the app and distribute those apps separately? submit a single app by having our app dynamically change according to the credentials of the user using a single target? For example, if user X can view A and B, the app will function such that A and B are visible?
0
0
605
Jul ’21
Metal Sample Code in Swift?
I've noticed that all of the sample Metal code I've downloaded, from the basic "Creating and Sampling from Textures" to "Modern Rendering with Metal," are all written in Objective-C. I'm hopeful someday one of the demo projects will be written in Swift since that's what I've used to write my Metal apps and I'm looking for some interesting uses of the Swift language as it relates to working with Metal. I understand the Objective-C provided but it would be neat if the samples were written in Swift. Will there ever be a sample project in Swift that uses Metal?Perhaps released with WWDC 2021?
3
0
2.6k
May ’21
Storage of `ray_data` in ray tracing payload
This is a duplicate of my StackOverflow post linked here - https://stackoverflow.com/questions/67336596/storage-of-ray-data-in-ray-tracing-payload I am currently working with Metal's ray tracing API. I remembered I could pass data from an intersection function to the compute kernel that started the ray intersection process. After rewatching the WWDC 2020 talk Discover ray tracing with Metal by Sean James (linked here), I found the relevant section around 16:13 where he talks about the ray payload. However, I was curious where this payload is stored as it passed to the intersection function. When declared with the relevant [[ payload ]] attribute in the intersection function, it must be in the ray_data address. According to the Metal Shading Language Specification (version 2.3), pg. 64, the data passed into the intersection function is copied in the ray_data address space and is copied back out once the intersection function returns. However, this doesn't specify if, e.g., the data is stored in tile memory (like data in the threadgroup address space is) or stored in the per-thread memory (thread address space). The video did not specify this either. In fact, the declarations for the intersect function (see pg. 204) that include the payload term are in the thread address space (which makes sense) So where does the copied ray_data "version" of the data stored in the thread address space in the kernel go?
4
0
932
Apr ’21
Post-tessellation Vertex Function and Raytracing: Getting More Detailed Geometries for Acceleration Structures
I have recently gained some interest in the raytracing API provided by the Metal framework. I understand that you can attach a vertex buffer to a geometry descriptor that Metal will use to create the acceleration structure later (on a MTLPrimitiveAccelerationStructureDescriptor instance for example). This made me wonder if it were possible to write the output of the tessellator into a separate vertex buffer from the post-tessellation vertex shader and pass that along to the raytracer. I thought that perhaps you could get more detailed geometry and still render without rasterization. For example, I might have the following simple post-tessellation vertex function: // Control Point struct struct ControlPoint {     float4 position [[attribute(0)]]; }; // Patch struct struct PatchIn {     patch_control_pointControlPoint control_points; }; // Vertex-to-Fragment struct struct FunctionOutIn {     float4 position [[ position ]];     half4  color    [[ flat ]]; }; [[patch(triangle, 3)]] vertex FunctionOutIn tessellation_vertex_triangle(PatchIn patchIn [[stage_in]], float3 patch_coord [[ position_in_patch ]]) { // Barycentric coordinates float u = patch_coord.x; float v = patch_coord.y; float w = patch_coord.z; // Convert to cartesian coordinates float x = u * patchIn.control_points[0].position.x + v * patchIn.control_points[1].position.x + w * patchIn.control_points[2].position.x; float y = u * patchIn.control_points[0].position.y + v * patchIn.control_points[1].position.y + w * patchIn.control_points[2].position.y; // Output FunctionOutIn vertexOut; vertexOut.position = float4(x, y, 0.0, 1.0); vertexOut.color = half4(u, v, w, 1.0h); return vertexOut; } However, the following doesn't compile } where outPutBuffer would be some struct* (not void*). I noticed that the function doesn't compile when I don't use the data in the control points as output, like so [[patch(triangle, 3)]] vertex FunctionOutIn tessellation_vertex_triangle(PatchIn patchIn [[stage_in]], float3 patch_coord [[ position_in_patch ]]) { // Barycentric coordinates float u = patch_coord.x; float v = patch_coord.y; float w = patch_coord.z; // Convert to cartesian coordinates float x = u * patchIn.control_points[0].position.x + v * patchIn.control_points[1].position.x + w * patchIn.control_points[2].position.x; float y = u * patchIn.control_points[0].position.y + v * patchIn.control_points[1].position.y + w * patchIn.control_points[2].position.y; // Output FunctionOutIn vertexOut; // Does not use x or y (and therefore the `patch_control_pointT`'s values // are not used as output into the rasterizer) vertexOut.position = float4(1.0, 1.0, 0.0, 1.0); vertexOut.color = half4(1.0h, 1.0h, 1.0h, 1.0h); return vertexOut; } I looked at the patch_control_pointT template that was publicly exposed but didn't see anything enforcing this. What is going on here? In particular, how would I go about increasing the quality of the geometry into the raytracer? Would I simply have to use more complex assets? Tessellation has its place in the rasterization pipeline, but can it be used elsewhere? Of course, this would leave a much larger memory footprint if we were storing the tessellated patches.
0
0
685
Mar ’21
Threadgroup memory write then read without barrier
I posted this question to StackOverflow. Perhaps it is better suited here where Apple developers are more likely to see it. I was looking through the project linked on the page "Selecting Device Objects for Compute Processing" in the Metal documentation (linked here - https://developer.apple.com/documentation/metal/gpu_selection_in_macos/selecting_device_objects_for_compute_processing) There, I noticed a clever use of threadgroup memory that I am hoping to adopt in my own particle simulator. However, before I do so I need to understand a particular aspect of threadgroup memory and what the developers are doing in this scenario. The code contains a segment like so: metal // In AAPLKernels.metal // Parameter of the kernel threadgroup float4* sharedPosition [[threadgroup(0)]] // Body ... // For each particle / body for(i = 0; i params.numBodies; i += numThreadsInGroup) { // Because sharedPosition uses the threadgroup address space, 'numThreadsInGroup' elements // of sharedPosition will be initialized at once (not just one element at lid as it // may look like) sharedPosition[threadInGroup] = oldPosition[sourcePosition]; j = 0; while(j numThreadsInGroup) { acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); acceleration += computeAcceleration(sharedPosition[j++], currentPosition, softeningSqr); } // while sourcePosition += numThreadsInGroup; } // for In particular, the comment just before the assignment of sharedPosition starting with "Because..." I found confusing. I haven't read anywhere that threadgroup memory writes happen on all threads in the same threadgroup simultaneously; in fact, I thought a barrier would be needed before reading from the shared memory pool again to avoid undefined behavior since *each* thread is subsequently reading from the entire pool of threadgroup memory after the assignment (the assignment being a write of course). Why is a barrier unnecessary here?
1
0
919
Feb ’21