I have a popular open source library that runs entirely on the CPU. I have another package that performs the same functionality on OpenCL. I am trying to port the OpenCL package to Metal so I can get GPU performance on all devices.
The OpenCL package has a nice test set that compares its own output against the reference project. On the two simplest test cases, OpenCL runs about 14 and 24 times as fast as on the CPU.
My Metal version is consistently 4 times slower than OpenCL. It is still several times better than the CPU version, but why would it be so slow compared to OpenCL?
These two test cases are the absolute easiest and simplest ones there are. It is doing a Mercator map projection, so this is all there is for the Metal code:
Code Block kernel void project_mercator_s( device const float2 * xy_in [[buffer(0)]], device float2 * xy_out [[buffer(1)]], device const spherical_params * params [[buffer(2)]], uint index [[thread_position_in_grid]]) { float lambda = radians(xy_in[index].x); float phi = radians(xy_in[index].y); float x = lambda; float y = asinh(tan(phi)); xy_out[index].x = params->x + params->scale * x; xy_out[index].y = params->y + params->scale * y; }
The OpenCL version is the same:
Code Block __kernel void pl_project_mercator_s( __global float16 *xy_in, __global float16 *xy_out, const unsigned int count, float scale, float x0, float y0) { int i = get_global_id(0); float8 lambda = radians(xy_in[i].even); float8 phi = radians(xy_in[i].odd); float8 x, y; x = lambda; y = asinh(tan(phi)); xy_out[i].even = x0 + scale * x; xy_out[i].odd = y0 + scale * y; }
I have removed all the Metal setup code from the time comparison. My test data set has 200,000 pairs of floats. I'm using shared memory buffers. I tried using a private buffer but my dataset is so small that the time to do the copying into the private buffer was longer than just using a shared memory buffer.
Is this normal? Is there something I'm missing?