In the "Discover advances Metal for A15 Bionic" Tech Talk right around the 20:00 mark, the presenter (Katelyn Hinson) says:
The output image is split into a set of SIMD groups, where each SIMD group is a 4-by-8 chunk, [with] each thread writing to a single output.
Supposing that we know the simdgroup will contain 32 threads (which they mention in the talk is true for Apple Silicon), is the only way to ensure that the threads in each simdgroup will be arranged into a 4 x 8 chunk to perform a dispatch with threadgroups that have a width dividing the number of threads per simdgroup? I can't think of another way to control the shape of a simdgroup directly within threadgroups since there is no explicit API to do so.
For example, if we perform a dispatchThreadgroups(_:threadsPerThreadgroup:) with a threadgroup size of 8 x 8 to attempt to recreate the visuals in the presentation, wouldn't the resulting simdgroup shape be an 8 x 4 region and not a 4 x 8 region?
The assumptions made in the video about where to sample the source texture and which shuffle functions to use are heavily influenced by the shape of the simdgroup. I'm trying to implement a similar reduction but I'm currently figuring out how to shape each simdgroup.
If we don't know whether the simdgroup is 32 threads (I believe it's possible simdgroups have 64 threads?). What would be a reliable way to control the structure of the simdgroups? I believe if we always ensure that the width of the threadgroup divides the number of threads in the simdgroup we should get the behavior that we want, but I'm looking to confirm this logic.
IIRC, simdgroups will always have a multiple of 8 threads (maybe it was only 4?), so perhaps a width of 8 (or 4) would always suffice for the threadgroup and you could specify a height of computePipelineState.maxTotalThreadsPerThreadgroup / 4
for example. Finally, must we only use uniform threadgroups (viz. we couldn't use dispatchThreads(_:threadsPerThreadgroup:)) for reliable results? I'm thinking that non-uniform threadgroups would again violate our assumptions about the simdgroup shape
Post
Replies
Boosts
Views
Activity
I have a compute kernel that makes use of simdgroup operations such as simd_shuffle_up, simd_or, etc, and I'm looking to rewrite the kernel to support older hardware. One such computation requires that I know the index of the thread in the simdgroup (thread_index_in_simdgroup). I was hoping to derive it from the thread's position in its threadgroup (thread_position_in_threadgroup) and the thread execution width (thread_execution_width), along with other knowledge about the size of the threadgroup when I noticed there was also the threads_per_simdgroup attribute. The spec describes both respectively as
thread_execution_width: The execution width of the compute unit.
threads_per_simdgroup: The thread execution width of a SIMD-group.
Under what conditions, if any, could these two values differ? If they do differ, is there a way to determine a thread's position in the simdgroup on hardware that doesn't support Metal 2.2?
I have an implementation of UITextInput that is used to implement note taking with text. We have a custom UIMenuItem that lists suggested text replacements for a misspelled word that a user can interact with to fix the word. This works well on iPhone and iPad where the only path for changing text is via this menu.
On Mac Catalyst, however, the system also presents text replacement options with the best replacement; and when users attempt to replace text with the menu options provided by the system, our UITextInput handler seems to only receive a call to setSelectedTextRange: (the code is in Objective-C). I would expect a call to, for example, replaceRange:WithText: after an autocorrection is made
Any ideas what could possibly be incorrectly implementing? I.e., how can we receive the text that the system attempts to replace?
I have an image processing pipeline that performs some work on the CPU after the GPU processes a texture and then writes its result into a shared buffer (i.e. storageMode = .shared) used by the CPU for its computation. After the CPU does its work, it similarly writes at a different offset into the same shared MTLBuffer object. The buffer is arranged as so:
uint | uint | .... | uint | float
offsets (contiguous):
0 | ...
where the floating point slot is written into by the CPU and later used by the GPU in subsequent compute passes.
I haven't been able to explain or find documentation on the following strange behavior. The compute pipeline with the above buffer (call it buffer A) is as follows (without the force unwraps):
let device = MTLCreateSystemDefaultDevice()!
let commandQueue = device.makeCommandQueue()!
let commandBuffer = commandQueue.makeCommandBuffer()!
let sharedEvent = device.makeSharedEvent()!
let sharedEventQueue = DispatchQueue(label: "my-queue")
let sharedEventListener = MTLSharedEventListener(dispatchQueue: sharedEventQueue)
// Compute pipeline
kernelA.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture, destinationBuffer: bufferA)
commandBuffer.encodeCPUExecution(for: sharedEventObject, listener: sharedEventListener) { [self] in
var value = Float(0.0)
bufferA.unsafelyWrite(&value, offset: Self.targetBufferOffset)
}
kernelB.setTargetBuffer(histogramBuffer, offset: Self.targetBufferOffset)
kernelB.encode(commandBuffer: commandBuffer, sourceTexture: sourceTexture, destinationTexture: destinationTexture)
Note that commandBuffer.encodeCPUExecution simply is a convenience function around the shared event object (encodeSignalEvent and encodeWaitEvent) that signals and waits on event.signaledValue + 1 and event.signaledValue + 2 respectively.
In the example above, kernel B does not see the writes made during the CPU execution. It can however see the values written into the buffer from kernelA.
The strange part: if you write to that same location in the buffer before the GPU schedules this work (e.g. during the encoding instead of in the middle of the GPU execution or whenever before), kernelB does see the value of the writes by the CPU.
This is odd behavior that to me suggests there is undefined behavior. If the buffer were .managed I could understand the behavior since changes on each side must be made explicit; but with a .shared buffer this behavior seems quite unexpected, especially considering that the CPU can read the values made by the preceding kernel (viz. kernelA)
What explains this strange behavior with Metal?
Note:
This behavior occurs on an M1 Mac running MacCatalyst and an iPad Pro (5th generation) running iOS 15.3
While the above three frameworks (viz. vImage, CoreImage, and MetalPerformaceShaders) serve different overall purposes, what are the strengths and weaknesses of the each of the three frameworks in terms of performance with respect to image processing? It seems that any of the three frameworks is highly performant; but where does each framework shine?
Is it possible to add DocC documentation to a target that does not result in either a static library or a framework? It doesn't yet appear to be a feature of DocC. If not, will there be support in the future to add documentation to "regular" Xcode projects that don't result in a static library or framework? I think it could be useful to have documentation for larger apps that may use multiple frameworks in complex ways
I've started working with simdgroups and as I was looking through the MSL documentation I noticed that there exists, in addition to simdgroups, quadgroups. The shading language documentation merely states that
A quad-group function is a SIMD-group function (see section 6.9.2) with an execution width of 4.
However, it doesn't appear there's a clear reason for using quadgroups over simdgroups, and I have yet to find demonstrations of using quadgroups within a compute kernel.
What are quadgroups and how are they used in conjunction with/replacement of simdgroups?
I haven't followed the swift forums very closely, so perhaps there is news buried deep somewhere mentioning this.
Will the swift compiler and/or TSAN at runtime in the future be able to identify possible race conditions associated with Swift async-await (excluding data races that are "erased" by async-await)? I suppose this could equate to proving a function is reentrant in some scenarios (from a compiler's perspective, though I'm not knowledgeable about compilers)? Consider, e.g. the scenario described in "Protect Mutable State with Swift Actors" around 9:15, where Dario talks about actor reentrancy, with the cache for the image URL
I am using a MTLSharedEvent to occasionally relay new information from the CPU to the GPU by writing into a MTLBuffer with storage mode .storageModeManaged within a block registered by the shared event (using the notify(_:atValue:block:) method of MTLSharedEvent, with a MTLSharedEventListener configured to be notified on a background dispatch queue). The process looks something like this:
let device = MTLCreateSystemDefaultDevice()!
	let synchronizationQueue = DispatchQueue(label: "com.myproject.synchronization")
		
		let sharedEvent = device.makeSharedEvent()!
		let sharedEventListener = MTLSharedEventListener(dispatchQueue: synchronizationQueue)
		
		// Updated only occasionally on the CPU (on user interaction). Mostly written to
		// on the GPU
		let managedBuffer = device.makeBuffer(length: 10, options: .storageModeManaged)!
		
		var doExtra = true
func computeSomething(commandBuffer: MTLCommandBuffer) {
	
	 // Do work on the GPU every frame
	 // After writing to the buffer on the GPU, synchronize the buffer (required)
	 let blitToSynchronize = commandBuffer.makeBlitCommandEncoder()!
				blitToSynchronize.synchronize(resource: managedBuffer)
				blitToSynchronize.endEncoding()
				
	 // Occassionally, add extra information on the GPU
	 if doExtraWork {
					
			 // Register a block to write into the buffer
			sharedEvent.notify(sharedEventListener, atValue: 1) { event, value in
								
						 // Safely write into the buffer. Make sure we call `didModifyRange(_:)` after
								
						// Update the counter
						event.signaledValue = 2
			}
		 commandBuffer.encodeSignalEvent(sharedEvent, value: 1)
		 commandBuffer.encodeWaitForEvent(sharedEvent, value: 2)
	 }
				
				// Commit the work
			 commandBuffer.commit()
}
The expected behavior is as follows:
The GPU does some work with the managed buffer
Occasionally, the information needs to be updated with new information on the CPU. In this frame, we register a block of work to be executed. We do so in a dedicated block because we cannot guarantee that by the time execution on the main thread reaches this point the GPU is not simultaneously reading from or writing to the managed buffer. Hence, it is unsafe to simply write to it currently and must make sure the GPU is not doing anything with this data
When the GPU schedules this command buffer to be executed, commands executed before the encodeSignalEvent(_:value:) call are executed and then execution on the GPU stops until the block increments the signaledValue property of the event passed into the block
When execution reaches the block, we can safely write into the managed buffer because we know the CPU has exclusive access to the resource. Once we've done so, we resume execution of the GPU
The issue is that it seems Metal is not calling the block when the GPU is executing the command, but rather *before* the command buffer is even scheduled. Worse, the system seems to "work" with the initial command buffer (the very first command buffer, before any other are scheduled).
I first noticed this issue when I looked at a GPU frame capture after my scene would vanish after a CPU update, which is where I saw that the GPU had NaNs all over the place. I then ran into this strange situation when I purposely waited on the background dispatch queue with a sleep(:_) call. Quite correctly, my shared resource semaphore (not shown, signaled in a completion block of the command buffer and waited on in the main thread) reached a value of -1 after committing three command buffers to the command queue (three being the number of recycled shared MTLBuffers holding scene uniform data etc.). This suggests that the first command buffer has not finished executing by then time the CPU is more than three frames ahead, which is consistent with the sleep(_:) behavior. Again, what isn't consistent is the ordering: Metal seems to call the block before even scheduling the buffer. Further, in subsequent frames, it doesn't seem that Metal cares that the sharedEventListener block is taking so long and schedules the command buffer for execution even while the block is running, which finishes dozens of frames later.
This behavior is completely inconsistent with what I expect. What is going on here?
P.S.
There is probably a better way to periodically update a managed buffer whose contents are mostly
modified on the GPU, but I have not yet found a way to do so. Any advice on this subject is appreciated as well. Of course, a triple buffer system *could* work, but it would waste a lot of memory as the managed buffer is quite large (whereas the shared buffers managed by the semaphore are quite small)
I have an application that uses a UISplitViewController as its window's root view controller. The app used what was the Master-Detail Xcode template when it was made. The master VC is a UITableViewController which, when one of its cells are pressed, "hides" itself using
self.splitViewController?.hide(.primary)
We've updated the VC to be a double-column style split view controller introduced in iOS 14. The method does hide the primary column most of the time; however, there are two cases where our master view controller fails to be dismissed with this animation:
Portrait mode on the iPhone
Using the Zoomed "Display Zoom" on iPhone landscape mode
We have not had any issues with the iPad. The documentation for the hide(:_) method reads
When you call this method, the split view interface transitions to the closest display mode available for the current split behavior where the specified column is hidden.
Clearly, though, there are conditions under which the primary column isn't hidden with this method. I have searched around for solutions about hiding the master view controller but most are lacking in relevance either because they are many years old (sometimes 10+) and/or not applicable to Swift or iOS 14 with the new split view controller styles.
Why is the column not always hiding? and How can I ensure that the master view controller disappears?
I am working on an app that will be distributed to a business that includes patented technologies (legally; I am working with the owner of the patents). We have been advised to have the patent numbers visible within the app, along with a description of the patents.
Where is information like this best displayed in an app?
We are trying to find a balance between making it clear some of the functionality within the app is backed by patents while not interfering with the main UI for our eventual day-to-day users
I am working on an iOS app that will be privately distributed to an organization. The organization has a single Organization Identifier we use in App Store connect for submission. However, our company handles two "branches" of the same organization separately, with different contracts, agreements, projects, etc. Our app would need to be tailored to both branches. The core functionality of the app would remain largely the same for both clients; but each would ultimately contain their own unique content.
Sharing code between targets seems like an automatic; however, the situation is interesting because we will likely need to add authentication into our app and thus restrict users to a particular "version" of our app. Moreover, certain users within the organization may be restricted to viewing only content of a single branch while other users might have a need to interact with both branches.
Essentially, we may need two very similar apps to service the organization adequately. But it may be possible to achieve this with only a single app.
How should we go about providing our app to our client?
Should we
create a new project, extract as much code as possible into Swift packages/frameworks, and submit each project as a separate app?
create multiple targets for each "version" of the app and distribute those apps separately?
submit a single app by having our app dynamically change according to the credentials of the user using a single target? For example, if user X can view A and B, the app will function such that A and B are visible?
I've noticed that all of the sample Metal code I've downloaded, from the basic "Creating and Sampling from Textures" to "Modern Rendering with Metal," are all written in Objective-C. I'm hopeful someday one of the demo projects will be written in Swift since that's what I've used to write my Metal apps and I'm looking for some interesting uses of the Swift language as it relates to working with Metal. I understand the Objective-C provided but it would be neat if the samples were written in Swift.
Will there ever be a sample project in Swift that uses Metal?Perhaps released with WWDC 2021?
This is a duplicate of my StackOverflow post linked here - https://stackoverflow.com/questions/67336596/storage-of-ray-data-in-ray-tracing-payload
I am currently working with Metal's ray tracing API. I remembered I could pass data from an intersection function to the compute kernel that started the ray intersection process. After rewatching the WWDC 2020 talk Discover ray tracing with Metal by Sean James (linked here), I found the relevant section around 16:13 where he talks about the ray payload.
However, I was curious where this payload is stored as it passed to the intersection function. When declared with the relevant [[ payload ]] attribute in the intersection function, it must be in the ray_data address. According to the Metal Shading Language Specification (version 2.3), pg. 64, the data passed into the intersection function is copied in the ray_data address space and is copied back out once the intersection function returns. However, this doesn't specify if, e.g., the data is stored in tile memory (like data in the threadgroup address space is) or stored in the per-thread memory (thread address space). The video did not specify this either.
In fact, the declarations for the intersect function (see pg. 204) that include the payload term are in the thread address space (which makes sense)
So where does the copied ray_data "version" of the data stored in the thread address space in the kernel go?
I have recently gained some interest in the raytracing API provided by the Metal framework. I understand that you can attach a vertex buffer to a geometry descriptor that Metal will use to create the acceleration structure later (on a MTLPrimitiveAccelerationStructureDescriptor instance for example).
This made me wonder if it were possible to write the output of the tessellator into a separate vertex buffer from the post-tessellation vertex shader and pass that along to the raytracer. I thought that perhaps you could get more detailed geometry and still render without rasterization. For example, I might have the following simple post-tessellation vertex function:
// Control Point struct
struct ControlPoint {
float4 position [[attribute(0)]];
};
// Patch struct
struct PatchIn {
patch_control_pointControlPoint control_points;
};
// Vertex-to-Fragment struct
struct FunctionOutIn {
float4 position [[ position ]];
half4 color [[ flat ]];
};
[[patch(triangle, 3)]]
vertex FunctionOutIn tessellation_vertex_triangle(PatchIn patchIn [[stage_in]],
float3 patch_coord [[ position_in_patch ]])
{
// Barycentric coordinates
float u = patch_coord.x;
float v = patch_coord.y;
float w = patch_coord.z;
// Convert to cartesian coordinates
float x = u * patchIn.control_points[0].position.x + v * patchIn.control_points[1].position.x + w * patchIn.control_points[2].position.x;
float y = u * patchIn.control_points[0].position.y + v * patchIn.control_points[1].position.y + w * patchIn.control_points[2].position.y;
// Output
FunctionOutIn vertexOut;
vertexOut.position = float4(x, y, 0.0, 1.0);
vertexOut.color = half4(u, v, w, 1.0h);
return vertexOut;
}
However, the following doesn't compile }
where outPutBuffer would be some struct* (not void*). I noticed that the function doesn't compile when I don't use the data in the control points as output, like so
[[patch(triangle, 3)]]
vertex FunctionOutIn tessellation_vertex_triangle(PatchIn patchIn [[stage_in]],
float3 patch_coord [[ position_in_patch ]])
{
// Barycentric coordinates
float u = patch_coord.x;
float v = patch_coord.y;
float w = patch_coord.z;
// Convert to cartesian coordinates
float x = u * patchIn.control_points[0].position.x + v * patchIn.control_points[1].position.x + w * patchIn.control_points[2].position.x;
float y = u * patchIn.control_points[0].position.y + v * patchIn.control_points[1].position.y + w * patchIn.control_points[2].position.y;
// Output
FunctionOutIn vertexOut;
// Does not use x or y (and therefore the `patch_control_pointT`'s values
// are not used as output into the rasterizer)
vertexOut.position = float4(1.0, 1.0, 0.0, 1.0);
vertexOut.color = half4(1.0h, 1.0h, 1.0h, 1.0h);
return vertexOut;
}
I looked at the patch_control_pointT template that was publicly exposed but didn't see anything enforcing this. What is going on here?
In particular, how would I go about increasing the quality of the geometry into the raytracer? Would I simply have to use more complex assets? Tessellation has its place in the rasterization pipeline, but can it be used elsewhere? Of course, this would leave a much larger memory footprint if we were storing the tessellated patches.