My experience has been that once the metal code is optimized, it runs faster, at least on all desktop architectures (from 2012 MacBook Air to the DTK and lots in between). Two things that I am curious about in your sample, 1) you are doing 8X the work per thread in the OpenCL code, I wonder if you made them really equivalent (meaning unrolled 8-way in your metal kernel) what that would do and 2) coalesce the read from global memory into shared. Only other thing I could imagine that would make metal slower here is how you are compiling - if you are turning on fast math for example for the transcendental in openCL but using higher precision in the metal kernel.
Post
Replies
Boosts
Views
Activity
I saw you were moving on from trying it in metal - but figured I would post the followup just in case.
1) your OpenCL kernel is doing 8X more work (float8 vs float), which means you have 8X the number of threads in metal. This will be slower because you're adding instruction overhead. So to make them "equivalent", you would need to do 8 elements per thread in your metal kernel. Be careful with how you address when you do this to make sure you have no bank conflicts (i.e. do sequential rather than interleaved addressing). Then, code wise, these are equivalent kernels. But, because you have the asinh and tan, the fast math comment was just about are you sure that you compiled them equivalently (did you for example turn on "fast math" in one but not the other, which is less accurate but way faster, and would significantly impact these two built in functions I would imagine).
2) Although these are equivalent now, my experience has been that Metal is more sensitive to the use of shared (threadgroup) memory than OpenCL was for me. So I am pretty sure that if you first loaded from global (device) to shared (threadgroup) memory, then did the math, that this would be faster. So something simple like this at the top of the kernel:
// load from global to local - but - if you 8 way unrolled this would obviously need to do 8 loads
sharedxyin[localid] = xyin[globalid];
threadgroupbarrier(memflags::memthreadgroup);
Anyway, just thoughts.
Amazing - markdown - good to know. thanks!
I didn't write the OpenCL version. I'm pretty sure it isn't doing 8x the amount of work. I assume one of the many glue functions is doing the math somewhere. Its not doing 8X the amount of work overall, its doing 8X the amount of work in a single kernel invocation. Specifically, it is doing 8 floats per kernel execution, whereas the metal kernel is doing a single float. This means for example when you submit in OpenCl in clEnqueueNDRangeKernel your work dimension is 8X smaller than in metal. This is way less kernel invocations, and so you are adding overhead in the metal code. Metal is never going to win under these circumstances. So to make them equivalent, you would need to do 8 x/y pairs in metal. Maybe an apple engineer can chime in here.
Also, as I mentioned before, my SIMD and 3D experience predates Metal, OpenCL, and OpenGL. Most of the discussions involving Metal assume that someone is 1) writing a game, 2) has used Metal on iOS, 3) has used OpenGL or OpenCL, DirectX, etc. I don't meet any of those assumptions. For example, the word "global" doesn't appear in Apple's Metal Shading Language Specification. I know what it means in other contexts, but not here. I assume you mean it is the same as "device". For example, I do know what "shading" means in a 3D context, but I'm also pretty sure that I will never actually perform that operation. I have very specific goals for this project. Agreed. Metal documentation is assuming you know things. But partly this is because there is so much documentation out there on CUDA especially, and OpenCL to a much lesser degree. Once you get the big picture design principles on Metal as a system, and what's different compared to how you run things, it's really largely the same inside the actual kernels. And so many of the normal optimization tricks that are extensively discussed in CUDA land work (i.e. limiting number of kernel invocations, using shared memory, watching how you access memory). Apple calls the things different names (because they "think different" perhaps), but conceptually what they are is largely the same. So with memory - yes - global is device, const is constant, shared is threadgroup. How you use them is always dependent on the architecture to some degree, but mostly similar in concept. But even within device memory there are options, like private, managed, etc, these all have equivalents in OpenCL, and can make a huge difference depending on what you are doing, you just have to figure out the mapping between OpenCL and metal to be able to use them efficiently.
I only do compute with Metal, coming from OpenCL. I have found precision on float is identical to OpenCL (and equivalent to CPU float code), but you have to watch how you compile/write if you need highly precise float code. If you need double, you are indeed out of luck.
After you do the simple things (like increasing the amount of work per kernel, and using thread groups appropriately), that's when I typically look at GPU frame capture and take a look. For me the biggest help there is with register pressure inside a kernel which is hard for me to eyeball.
Thanks so much for the thoughtful reply, I really appreciate you being on here looking at these discussions. Would you indulge two followup questions? Point 1 is of course an easy fix, use an atomic counter instead of threadgroup_position_in_grid to have the thread groups be sequenced (I had tried that but stripped it out for the example here).
Point 3 I am a little stuck on - since only one threadgroup writes to a given index, isn't it that as long as the relative order of writing the inclusive sum before writing the status flag within a threadgroup that matters, because the subsequent threadgroup waits until the single byte is written? I had assumed that single byte write is intrinsically atomic, but even if that assumption is wrong, the other thread group can't do anything until something (anything) is written. I guess I am not understanding how if only one thread group can write, how another thread group can squash the result. Is it really that it's just so slow to write to device that the thread group timeout gets hit before the next threadgroup sees the write?
Point 2 is obviously a deal killer, but is also a limitation of other platforms that people have gotten around. Was going to try to implement "Inter-Block GPU Communication via Fast Barrier Synchronization" (10.1109/IPDPS.2010.5470477). Any other ideas or resources about how to do global synchronization in Metal?
In the interim, I will take your advice, and do it the old way:) Thanks again!!!
After thinking about your responses more, I think I get it, so since I can't edit my last post figured I would just post my updated thoughts in case anyone else was thinking about this.
About point 3, when I was thinking about a single byte write with volatile already being atomic and ordered within a thread group, I was missing the point that it still doesn't ensure that its globally ordered, and that assumption is obviously wrong in hindsight. AND, this naive strategy wouldn't necessarily work in OpenCL or CUDA either. We would use something like CLK_GLOBAL_MEM_FENCE in OpenCL or __threadfence() in CUDA to make sure that the memory transaction completes. So will just have to wait for that to get added to Metal at some point to make an algorithm like this work.
About point 2, I know you all are really busy, but it would be useful, if it's not purposefully a secret, to have more documentation on how the scheduler is actually working beyond "no assumptions allowed". For example, once a threadgroup begins execution, if it stalls does it get evicted from execution or can we assume it will finish after the stall is cleared (i.e. once a thread group is running, it will run until it completes)? If there were ways to handle the memory transaction part of the data dependency, and if the dependencies were all forward, i.e. thread group 2 depends on thread group 1, thread group 1 depends on thread group 0 (no cycles), AND we correctly handled the thread groups being submitted out of order (by using atomic counter as group_id for example), could this deadlock or would be OK?
M1 max is 1024. You can query a given pipeline with with [MTLComputePipelineState maxTotalThreadsPerThreadgroup] or you can override it by setting maxTotalThreadsPerThreadgroup on MTLComputePipelineDescriptor when you make your pipeline. But, if Metal is saying that it is by default enforcing 256 for a given kernel, you aren't likely to do better than it because its calculating how much shared memory/registers a given kernel is using, and is calculating what the threadgroup size should be to achieve maximum occupancy.
Here for more info:
https://developer.apple.com/documentation/metal/mtlcomputepipelinedescriptor/2966560-maxtotalthreadsperthreadgroup
So bottom line, if Metal is saying 256 for a given kernel, you will likely just want to use 256. If you must go higher, then you have to use a pipeline descriptor and set the property there when you instantiate the pipeline.
Good luck!
p.s. I was assuming you were converting the OpenCL into Metal because its a Metal forum. If you want to do in OpenCL it is a different answer obviously.
That would be strange. Are you are saying that CL_DEVICE_MAX_WORK_GROUP_SIZE comes back as 256? That's got to be a bug. Or is that coming back from CL_KERNEL_WORK_GROUP_SIZE?
If its coming back as just the kernel (meaning device max kernel max), I vaguely remember that you had to set an attribute on the kernel at compile time, so something like __attribute__((reqd_work_group_size(1024, 1, 1))) would override the default (or fail).
I checked myself:
CL_DEVICE_MAX_WORK_ITEM_SIZES: 256 / 256 / 256
CL_DEVICE_MAX_WORK_GROUP_SIZE: 256
You're totally right.
FB9017493
thank you!