Hello Compute performance...

The sample code HelloCompute (that demonstrates how to perform data-parallel computations using the GPU) needs 13 ms on an iPhone 5S to perform this very basic kernel on a 512 x 512 RGBA texture.


kernel void
grayscaleKernel(texture2d<half, access::read>  inTexture  [[texture(AAPLTextureIndexInput)]],
                texture2d<half, access::write> outTexture [[texture(AAPLTextureIndexOutput)]],
                uint2                          gid         [[thread_position_in_grid]])
{
    half4 inColor  = inTexture.read(gid);
    outTexture.write(inColor, gid);
}


Unless I'm mistaken, this represent only 80MB loads and 80MB writes per second. Considering my current Neon code does 10 time more load and read per second, those numbers seem very low, no?

Replies

Apple's samples are rarely optimized apps, attempting to function only as stripped down basic demos of given APIs, etc. I wouldn't read too much into this one's measured performance.

Texture read/write bandwidth is several GB/s on an iPhone. What you're measuring is not just the time it takes your compute kernel to run but also the time needed to set up the GPU, etc. To see exactly how long the kernel took, look at kernelStart/EndTime on MTLCommandBuffer.


In other words, this sort of thing only makes sense if the speed gains offset the overhead from using the GPU.

My app needs a buffer to accumulate results from every pass. To accomplish this, my current implementation uses 2 textures that I ping-pong on every pass. The destination of previous pass becomes the source and the source becomes the destination...


So one pass uses texture A as the sources and render in texture B (then render B on-screen).

On next pass, uses texture B as the sources and render in texture A (then render A on-screen).


With Metal compute, I saw the opportunity to reduce memory usage (by using a single texture) to improve performance but unfortunately, performance is not there. And I really don't understand why it is so much slower.


Using render pipeline, shader must read the complete source texture and write the complete destination texture in addition to doing all the other pipeline stages like vertex and rasterization.


Using Compute, it only needs to read and write the complete texture. So I really don't understand. The only difference between the shader and compute function is that one return a value and the other write a value. Otherwise, both functions do the same thing. I even try to use two textures so the compute function read from one texture and write to another (as when I use the render pipeline) but it didn't improve performance.


I will make a sample program and request technical support...

So I did a sample program that convert a 2048 x 2048 source texture to grayscale using either a render pipeline and a compute pipeline.


On an iPhone XS Max running iOS 12.1, the fragment shader and compute kernel take both 1.9 ms to complete (according to Metal System Trace instrument). I would have expect the compute kernel to execute faster but at least it's not slower.


But on an iPhone 5S running iOS 11.4, the fragment shader takes 3 ms and the compute kernel takes 177 ms to execute!!!


On iPhone 7 Plus running iOS 11.1, the numbers are 1.6 ms and 3.2 ms respectively.


Here's my compute and render methods:


- (void)compute {
    
    id commandBuffer = [_commandQueue commandBuffer];
    
    id computeEncoder = [commandBuffer computeCommandEncoder];
    [computeEncoder setComputePipelineState:_computePipelineState];
    [computeEncoder setTexture:_srcTexture atIndex:0];
    [computeEncoder setTexture:_dstTexture atIndex:1];
    [computeEncoder dispatchThreadgroups:_threadgroupCount threadsPerThreadgroup:_threadgroupSize];
    [computeEncoder endEncoding];
    
    [commandBuffer commit];
    
    [commandBuffer waitUntilCompleted];
}


- (void)render {
    
    id commandBuffer = [_commandQueue commandBuffer];

    MTLRenderPassDescriptor *renderPassDescriptor = [MTLRenderPassDescriptor renderPassDescriptor];
    renderPassDescriptor.colorAttachments[0].loadAction = MTLLoadActionDontCare;
    renderPassDescriptor.colorAttachments[0].texture = _dstTexture;
    renderPassDescriptor.colorAttachments[0].storeAction = MTLStoreActionStore;
    
    id renderEncoder = [commandBuffer renderCommandEncoderWithDescriptor:renderPassDescriptor];
    [renderEncoder setRenderPipelineState:_renderPipelineState];
    [renderEncoder setFragmentTexture:_srcTexture atIndex:0];
    [renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip vertexStart:0 vertexCount:4];
    [renderEncoder endEncoding];
    
    [commandBuffer commit];
    
    [commandBuffer waitUntilCompleted];
}


And the fragment and kernel functions:


// Grayscale Fragment Function
fragment half4 grayscaleFragment(RasterizerData in [[stage_in]],
                                 texture2d inTexture [[texture(0)]])
{
    constexpr sampler textureSampler;
    
    half4 inColor  = inTexture.sample(textureSampler, in.textureCoordinate);
    half  gray     = dot(inColor.rgb, kRec709Luma);
    return half4(gray, gray, gray, 1.0);
}


// Grayscale Kernel Function
kernel void grayscaleKernel(uint2 gid [[thread_position_in_grid]],
                            texture2d<half, access::read=""> inTexture [[texture(0)]],
                            texture2d<half, access::write=""> outTexture [[texture(1)]])
{
    half4 inColor  = inTexture.read(gid);
    half  gray     = dot(inColor.rgb, kRec709Luma);
    outTexture.write(half4(gray, gray, gray, 1.0), gid);
}


Should I request technical support or fill a radar?

Did you ever find a solution? i am seeing this exact same issue on iOS12, using an iPhone 6s. I haven't tested on our high-end devices, as it's the low-end causing problems from our perspective; it's these which would gain from expected CS performance.


When I dive into the shader details, it helpfully informs me that even trivial shaders compile into 200+ instructions, but sadly, I have no way of actually working out which these 200 instrutions are.


At this stage, Metal Compute appears a total and absolute waste of time, which is not really very helpful to us.

In fact, testing two of our worst compute shaders generate the following results (in 'Pipeline statistics').


XS Max: 22, 42 cycles.

iPhone 7: 46, 66 cycles.

iPhone 6s: 209, 212 cycles (!!!!!).