Metal performance compared to OpenCL

I am working on porting some code to Metal.

I have a popular open source library that runs entirely on the CPU. I have another package that performs the same functionality on OpenCL. I am trying to port the OpenCL package to Metal so I can get GPU performance on all devices.

The OpenCL package has a nice test set that compares its own output against the reference project. On the two simplest test cases, OpenCL runs about 14 and 24 times as fast as on the CPU.

My Metal version is consistently 4 times slower than OpenCL. It is still several times better than the CPU version, but why would it be so slow compared to OpenCL?

These two test cases are the absolute easiest and simplest ones there are. It is doing a Mercator map projection, so this is all there is for the Metal code:

Code Block
kernel void project_mercator_s(
device const float2 * xy_in [[buffer(0)]],
device float2 * xy_out [[buffer(1)]],
device const spherical_params * params [[buffer(2)]],
uint index [[thread_position_in_grid]])
{
float lambda = radians(xy_in[index].x);
float phi = radians(xy_in[index].y);
float x = lambda;
float y = asinh(tan(phi));
xy_out[index].x = params->x + params->scale * x;
xy_out[index].y = params->y + params->scale * y;
}


The OpenCL version is the same:
Code Block
__kernel void pl_project_mercator_s(
__global float16 *xy_in,
__global float16 *xy_out,
const unsigned int count,
float scale, float x0, float y0)
{
int i = get_global_id(0);
float8 lambda = radians(xy_in[i].even);
float8 phi = radians(xy_in[i].odd);
float8 x, y;
x = lambda;
y = asinh(tan(phi));
xy_out[i].even = x0 + scale * x;
xy_out[i].odd = y0 + scale * y;
}

I have removed all the Metal setup code from the time comparison. My test data set has 200,000 pairs of floats. I'm using shared memory buffers. I tried using a private buffer but my dataset is so small that the time to do the copying into the private buffer was longer than just using a shared memory buffer.

Is this normal? Is there something I'm missing?
Just to follow up. I worked on the code and got it to the point where OpenCL was only about 3 times as fast as Metal. I was doing these tests on my 2014 MacBook Pro, usually with integrated graphics, running 10.16. When I tried it on my 2017 MacBook Pro, with both discrete and integrated graphics, the difference in performance was much smaller. I was also using an older version of the underlying PROJ library for comparison. Most of the overall GPU speedup I was seeing was due to the newer, slower version of PROJ. Metal still runs at about half the speed of OpenCL, but I guess I can live with this.

I have learned a few things about Metal optimization. I tried 3 different optimization schemes:
1) Use buffers big enough for all threads so I don't have to worry about reduced efficiency or boundary checks.
2) Use loops in the Metal code to avoid per-pixel threads that are supposedly less efficient.
3) Using private buffers with blits.

None of these optimization strategies yielded any improvement. Using private buffers imposed another 50% speed reduction.

I think I will go ahead and file a DTS ticket. It should be faster than this, shouldn't it?
Have you looked at the performance information from Capture GPU Frame? Unfortunately, I don't think you get the fine grained stats on Intel/Discrete that are available on A10X and newer SOCs. If you expect the code to live into the Apple Silicon era, exploring performance on a modern iPad might not be a waste of time. Also, would OpenCL be using both CPU and GPU while Metal is only using GPU?

Have you looked at the performance information from Capture GPU Frame?

No. I'm a total GPU newbie. My knowledge of SIMD computing dates from before GPUs were popular.

Unfortunately, I don't think you get the fine grained stats on Intel/Discrete that are available on A10X and newer SOCs. If you expect the code to live into the Apple Silicon era, exploring performance on a modern iPad might not be a waste of time.

This code might never actually be deployed on Intel. Apple Silicon, on iOS or macOS, is all I care about. I will definitely look at GPU performance when I have this running on iOS. But for now, it still a proof of concept to just get the logic implemented on Metal. I'm using the OpenCL code as a path of least resistance to port to Metal.

Also, would OpenCL be using both CPU and GPU while Metal is only using GPU? 

I don't think so. This code has the ability to use either, but I think it has to be one or the other.

I think it is just a side effect of using this older 2014 machine. I think OpenCL is optimized for it more than it is for the 2017. On the newer machine, Metal does much better. OpenCL still wins 90% of the time, but sometimes Metal wins on the 2017.
My experience has been that once the metal code is optimized, it runs faster, at least on all desktop architectures (from 2012 MacBook Air to the DTK and lots in between). Two things that I am curious about in your sample, 1) you are doing 8X the work per thread in the OpenCL code, I wonder if you made them really equivalent (meaning unrolled 8-way in your metal kernel) what that would do and 2) coalesce the read from global memory into shared. Only other thing I could imagine that would make metal slower here is how you are compiling - if you are turning on fast math for example for the transcendental in openCL but using higher precision in the metal kernel.

1) you are doing 8X the work per thread in the OpenCL code

What do you mean?

I wonder if you made them really equivalent (meaning unrolled 8-way in your metal kernel) what that would do

Sorry, not following here.

2) coalesce the read from global memory into shared.

??? Not a clue.

Only other thing I could imagine that would make metal slower here is how you are compiling - if you are turning on fast math for example for the transcendental in openCL but using higher precision in the metal kernel.

What higher precision? I've abandoned this idea because I decided that metal's limited precision was unacceptable. I did a brief experiment trying to do double-precision in software, but it is way too slow for the GPU to make up the difference.
I saw you were moving on from trying it in metal - but figured I would post the followup just in case.

1) your OpenCL kernel is doing 8X more work (float8 vs float), which means you have 8X the number of threads in metal. This will be slower because you're adding instruction overhead. So to make them "equivalent", you would need to do 8 elements per thread in your metal kernel. Be careful with how you address when you do this to make sure you have no bank conflicts (i.e. do sequential rather than interleaved addressing). Then, code wise, these are equivalent kernels. But, because you have the asinh and tan, the fast math comment was just about are you sure that you compiled them equivalently (did you for example turn on "fast math" in one but not the other, which is less accurate but way faster, and would significantly impact these two built in functions I would imagine).

2) Although these are equivalent now, my experience has been that Metal is more sensitive to the use of shared (threadgroup) memory than OpenCL was for me. So I am pretty sure that if you first loaded from global (device) to shared (threadgroup) memory, then did the math, that this would be faster. So something simple like this at the top of the kernel:

    //  load from global to local - but - if you 8 way unrolled this would obviously need to do 8 loads
    sharedxyin[localid] = xyin[globalid];
    threadgroup
barrier(memflags::memthreadgroup);

Anyway, just thoughts.

your OpenCL kernel is doing 8X more work (float8 vs float), which means you have 8X the number of threads in metal. 

I didn't write the OpenCL version. I'm pretty sure it isn't doing 8x the amount of work. I assume one of the many glue functions is doing the math somewhere. One of my experiments did include trying to do more work inside this function in a loop. It didn't have any effect. Metal doesn't seem to have a float8 or a float16 anyway.

Although these are equivalent now, my experience has been that Metal is more sensitive to the use of shared (threadgroup) memory than OpenCL was for me. So I am pretty sure that if you first loaded from global (device) to shared (threadgroup) memory, then did the math, that this would be faster. So something simple like this at the top of the kernel:

I'm still not sure what you are saying here. For one thing, the new forums automatically use markdown, so you either have to use a code tag or escape any underscores.

Also, as I mentioned before, my SIMD and 3D experience predates Metal, OpenCL, and OpenGL. Most of the discussions involving Metal assume that someone is 1) writing a game, 2) has used Metal on iOS, 3) has used OpenGL or OpenCL, DirectX, etc. I don't meet any of those assumptions. For example, the word "global" doesn't appear in Apple's Metal Shading Language Specification. I know what it means in other contexts, but not here. I assume you mean it is the same as "device". For example, I do know what "shading" means in a 3D context, but I'm also pretty sure that I will never actually perform that operation. I have very specific goals for this project.

I did try all of this with private buffers, but that had no effect.

I will definitely still be using Metal, but not for any calculations. I need more precision than it provides and I can't catch up in software, even with Metal. I can use it for sampling, resizing, and filtering images. Next year I might do some actual 3D drawing.
How did you dispatch the work in host code?
Especially regarding the threads per threadgroup. You may want to check https://developer.apple.com/documentation/metal/calculating_threadgroup_and_grid_sizes

This can make a big difference in efficiency.
Apart from that, as Etresoft already mentionned, you should check the performance data provided by GPU Frame Capture.
Amazing - markdown - good to know. thanks!

I didn't write the OpenCL version. I'm pretty sure it isn't doing 8x the amount of work. I assume one of the many glue functions is doing the math somewhere.

Its not doing 8X the amount of work overall, its doing 8X the amount of work in a single kernel invocation. Specifically, it is doing 8 floats per kernel execution, whereas the metal kernel is doing a single float. This means for example when you submit in OpenCl in clEnqueueNDRangeKernel your work dimension is 8X smaller than in metal. This is way less kernel invocations, and so you are adding overhead in the metal code. Metal is never going to win under these circumstances. So to make them equivalent, you would need to do 8 x/y pairs in metal. Maybe an apple engineer can chime in here.

Also, as I mentioned before, my SIMD and 3D experience predates Metal, OpenCL, and OpenGL. Most of the discussions involving Metal assume that someone is 1) writing a game, 2) has used Metal on iOS, 3) has used OpenGL or OpenCL, DirectX, etc. I don't meet any of those assumptions. For example, the word "global" doesn't appear in Apple's Metal Shading Language Specification. I know what it means in other contexts, but not here. I assume you mean it is the same as "device". For example, I do know what "shading" means in a 3D context, but I'm also pretty sure that I will never actually perform that operation. I have very specific goals for this project.

Agreed. Metal documentation is assuming you know things. But partly this is because there is so much documentation out there on CUDA especially, and OpenCL to a much lesser degree. Once you get the big picture design principles on Metal as a system, and what's different compared to how you run things, it's really largely the same inside the actual kernels. And so many of the normal optimization tricks that are extensively discussed in CUDA land work (i.e. limiting number of kernel invocations, using shared memory, watching how you access memory). Apple calls the things different names (because they "think different" perhaps), but conceptually what they are is largely the same. So with memory - yes - global is device, const is constant, shared is threadgroup. How you use them is always dependent on the architecture to some degree, but mostly similar in concept. But even within device memory there are options, like private, managed, etc, these all have equivalents in OpenCL, and can make a huge difference depending on what you are doing, you just have to figure out the mapping between OpenCL and metal to be able to use them efficiently.

I only do compute with Metal, coming from OpenCL. I have found precision on float is identical to OpenCL (and equivalent to CPU float code), but you have to watch how you compile/write if you need highly precise float code. If you need double, you are indeed out of luck.

After you do the simple things (like increasing the amount of work per kernel, and using thread groups appropriately), that's when I typically look at GPU frame capture and take a look. For me the biggest help there is with register pressure inside a kernel which is hard for me to eyeball.


 So to make them equivalent, you would need to do 8 x/y pairs in metal.

As I said before, I tried doing this in a loop. It had no effect.

If you need double, you are indeed out of luck.

Precisely.
device const sphericalparams * params [[buffer(2)]] --- Metal
...
xy
out[index].x = params->x + params->scale * x;
xy_out[index].y = params->y + params->scale * y;
Code Block
vs

float scale, float x0, float y0 --- OpenCL
xyout[i].even = x0 + scale * x;
xy
out[i].odd = y0 + scale * y;


The compiler code optimizer most likely likes the OpenCL approach. The params -> indirection may be costly.




This is an interesting thread, thanks for the orig post. I was under the impression that Metal was almost always faster than OpenCL because OpenCL is and old and clunky higher level API, but Metal "goes right to the metal" as they say. I guess not? I would suggest looking into the global memory access times if you can, e.g. the read-write refs to the "device" declared arrays. I am unsure of the latest tools for this, but you can try to check into the problem by removing device arrays from the picture and just doing fake calculations. E.g. try to set up some local arrays in thread or threadgroup storage instead of device storage for the arrays xyin, xyout and spherical_params (see https://developer.apple.com/metal/Metal-Shading-Language-Specification.pdf section 4.3) and dont even bother to initialize them. Just let the array dereferences and calculations stay the same in your kernel and let it run on bogus float data in those threadgroup arrays. See if it is faster. It should be a lot faster. If it is faster, then you can see the problem is the GPU / kernel code "reaching out" to global memory space for reads and writes. I have seen that sometimes the performance tools show slow calculations on a given line of code, but that may be due to memory read-write access, not the computation itself. A common pattern is copying "device" or global arrays into a chunk of threadgroup array storage e.g. in parallel, then let the kernel calcs read/write on the threadgroup arrays, then block copy in parallel the threadgroup results back into device/global memory. There are some cool CUDA examples of this in the CUDA intro tutorials... there may be some for Metal also.

Oh I also see that "mattke" in this thread has the threadlocal mem / sync example. Before going down that more complicated path (which may be required eventually).../ you could try stuff like this:

Copy globals to a local variable (not threadgroup), do all processing on local variables, then copy out to global in one step. E.g. for the 2 lines that use the "params" global:

xyout[index].x = params->x + params->scale * x;
xy
out[index].y = params->y + params->scale * y;

Instead, you could try:

sphericalparams localparams;
float2 localxyout; 
localparams = params; // copy whole struct to local mem

// now all mem refs are local in these 2 lines:

local
xyout.x = localparams.x + localparams.scale * x;
local
xyout.y = localparams.y + localparams.scale * y;

// copy the float2 back to global mem w/ one index/deref operation

xy
out[index] = localxyout;

PS: this Apple service is making some words italic but not in the preview I am typing...that is not intentional.

Similarly, you could access the input array only once at the top of the kernel:

float2 localxyin;
localxyin = xyin[index];

Now use local
xy_in where you compute lambda and phi.



mattke is right, the comparison is unfair as the OpenCL code uses vectors, while the Metal code just uses a scalar type. You have a bunch of options to fix that. You can either use a loop over 8 scalars in your Metal kernel function which the Metal compiler will unloop (i.e. vectorize) for you automatically, and/or you can use SIMD code. You also need to adjust the thread grid after those changes.

Some other things that come to my mind (some of them were already suggested):
  • Enable fast-math for the Metal compiler to prevent a massive slow down if the GPU encounters denormals.

  • For the input buffers you probably want to use constant as address space specifier instead of device. (i.e. device const is not the same as constant const).

  • Use & instead of * for function arguments where possible.

Metal performance compared to OpenCL
 
 
Q