Instance rendering low performance problem

I want to draw large width lines using triangular primitives. As the number of vertices rises, the frame rate goes down sharply.The following code is the way to draw primitive and the shader.


Vertex structure

struct Point
{
    var position:float2
    var color:float4
    init() {
        position = float2(0, 0)
        color = float4(0,0,0,0)
    }

    init(position:float2,color:float4) {
        self.position = position
        self.color = color
    }
}
struct Line
{
    var begin:Point
    var end:Point
}



Draw method

commandEncoder?.drawPrimitives(type: .triangleStrip, vertexStart: 0, vertexCount: (size - 1) * 4 , instanceCount: size - 1)




Shader code


#include <metal_stdlib>

using namespace metal;



struct InputVertex {

float2 position;

float4 color;

};



struct Vertex {

float4 position [[position]];

float4 color;

};



struct Point

{

float2 position;

float4 color;

};



struct Line

{

Point begin;

Point end;

};



struct XCoord

{

float beginX ;

float endX ;

float leftX ;

float rightX ;

float x;

};



struct Uniforms {

float4x4 modelMatrix;

};



vertex Vertex vertex_func(constant Line *lines [[buffer(0)]],

constant Uniforms &uniforms [[buffer(1)]],

uint vertexId [[vertex_id]],

uint instanceId [[instance_id]]) {

float4x4 matrix = uniforms.modelMatrix;

float thickness = 0.004;

Line line = lines[instanceId];

uint index = vertexId % 4;

float4 startPosition = matrix * float4(line.begin.position.x,line.begin.position.y ,0 ,1);

float4 endPosition = matrix * float4(line.end.position.x,line.end.position.y ,0 ,1);

float4 position;

float4 color;

float4 v = endPosition - startPosition;

float2 p0 = float2(startPosition.x,startPosition.y);

float2 v0 = float2(v.x,v.y);

float2 v1 = thickness * normalize(v0) * float2x2(float2(0,-1),float2(1,0));

if (index == 0)

{

float2 pa = p0 + v1;

position = float4(pa.x,pa.y,0,1);

color = line.begin.color;

}

else if (index == 1)

{

float2 pb = p0 - v1;

position = float4(pb.x,pb.y,0,1);

color = line.begin.color;

}

else if (index == 2)

{

float2 pc = p0 - v1 + v0;

position = float4(pc.x,pc.y,0,1);

color = line.end.color;

}

else if (index == 3)

{

float2 pd = p0 + v1 + v0;

position = float4(pd.x,pd.y,0,1);

color = line.end.color;

}

else

{

float2 pd = p0 + v1 + v0;

position = float4(pd.x,pd.y,0,1);

color = line.end.color;

}



Vertex out;

out.position = position;

out.color = color;

return out;

}





fragment float4 fragment_func(Vertex vert [[stage_in]]) {

return vert.color;

}

Accepted Reply

Hello


Sorry for delay. Alas, I was not able of running this - I have older Xcode, and currently running urgent projects for client, so cannot reinstall that. But I think I know what is the problem. You invoke drawing like that:

commandEncoder?.drawPrimitives(type: .triangleStrip, vertexStart: 0, vertexCount: (size - 1) * 4 , instanceCount: size - 1)

And this is not what you want. If you want instancing of line segments, drawn as quads (4 vertex triangle strips) then you really should invoke your code with vertexCount: 4, not (size - 1) * 4. By doing what you do, you basically do O(n^2) instead of O(n) work (assuming single line strip drawing is O(1)). So for size = 51 like it is set now, you do basically 50 times the work you really want to do.

Hope that helps

Michal

Replies

Hello


First of all, you should get rid of that big "if". Conditionals that diverge within SIMD unit are a big performance cost on GPUs. So instead of single color and position you compute in a four diffferent ways you should try declaring four positions and two colors (as arrays), compute all of them and pick proper ones without ifs, like that

float2 positions[4];
float4 colors[2];
// fill tables for all vertices
positions[0] = p0 + v1;
positions[1] = p0 - v1;
positions[2] = positions[0] + v0;
positions[3] = positions[1] + v0;

colors[0] = line.begin.color;
colors[1] = line.end.color;
Vertex out;
out.position.xy = positions[vertexId & 0x3);
out.position.zw = float2(0, 1);
out.color = colors[vertexId & 0x1);

That should help performance - please give it a try.


Other than that, I am not sure if your code is 100% okay - for example, initially you use float4 vectors, and you operate on them using 4x4 matrix, also computing "v" as float4 vector - all 3D operations, possibly causing fourth coordinate to become different than 1 - and then you just go into 2d by picking first two coordinates, which isn't even proper "projection" technique. Of course for some matrices/vectors it will work, and perhaps you know all that and want just faster rendering - ignore what I said here then.


Hope that helps

Michal

I removed all the matrix calculations in the vertex shader.The number of vertices grew slightly, the frame rate also dropped very low, and the GPU was mostly time-consuming. I used instruments debugging to find that the performance bottleneck was vertex shader. By the way, I was using Xcode 9 beta6


line vertex(length 6) triangleStrip vertex(length 20 )

[p0 p1], [p2,p3], [p3,p4] -> [p0a, p0b, p0c, p0d], [p1a, p1b, p1c, p1d], [p2a, p2b, p2c, p2d], [p3a, p3b, p3c, p3d]

vertex shader


Maybe the following rendering method leads to poor performance. I don't know how to improve it. Will this be Metal's Bug? What are some good ways to efficiently draw a line with a certain width?


commandEncoder?.drawPrimitives(type: .triangleStrip, vertexStart: 0, vertexCount: (size - 1) * 4 , instanceCount: size - 1)



vertex Vertex vertex_func(constant Line *lines [[buffer(0)]],
                          constant Uniforms &uniforms [[buffer(1)]],
                          uint vertexId [[vertex_id]],
                          uint instanceId [[instance_id]]) {

    Line line = lines[instanceId];
    float2 position[4];
    float4 color[2];
    color[0] = line.begin.color;
    color[1] = line.end.color;
    position[0] = float2(-0.5,-0.5);
    position[1] = float2(0.5,-0.5);
    position[2] = float2(-0.5,0.5);
    position[3] = float2(0.5,0.5);


    Vertex out;
    out.position.xy = position[vertexId & 0x3];
    out.position.zw = float2(0,1);
    out.color = color[vertexId & 0x1];
    return out;
}

OK, but right now you're drawing very thick "lines" (squares in fact), one segment area being a quarter of screen's. So if above vertex shader gets called you get a lot of overdraw. Please change your original code to something like this:

float thickness = 0.004;
float2 startPosition = lines[instanceId].begin.position.xy;
float2 endPosition = lines[instanceId].end.position.xy;
float2 v0 = endPosition - startPosition;
float2 p0 = float2(startPosition.x,startPosition.y);
float2 tmp = normalize(v0);
float2 v1 = thickness * float2(tmp.y, -tmp.x);     // 2d vector (x,y) rotated by 90 degrees is (y, -x)

float2 positions[4]; 
float4 colors[2]; 
// fill tables for all vertices 
positions[0] = p0 + v1; 
positions[1] = p0 - v1; 
positions[2] = positions[0] + v0; 
positions[3] = positions[1] + v0; 
 
colors[0] = line.begin.color; 
colors[1] = line.end.color; 
Vertex out; 
out.position.xy = positions[vertexId & 0x3); 
out.position.zw = float2(0, 1); 
out.color = colors[vertexId & 0x1);

Then please test this version with some data, say pairs of random points, or some geometric figure, whatever you like. Then - if performance still disappoins you please describe:

- How you tested it (f.e. 1000 of random line segmens all over the screen, thickness - probably 0.004 is too fine, I'd try something a bit larger)

- On what device/GPU

- What was the performance (FPS, GPU/CPU loads showed on Metal performance monitoring tools)


And then I'll try to help. It is hard to relate to something like "I did this and it is slow", because "slow" alone doesn't mean a thing, except that you expected more.

You can also upload whole Xcode project to some file store site, maybe somebody else could then download, run and see.


Regards

Michal

Hi Michal

Thank you so much.


I appreciate your valuable advice. I just uploaded an example, you can build and run on a real device in Xcode 9.0 beta 6.

h t t p s://s3-ap-southeast-1.amazonaws.com/cgwang/MetalKline.zip


Zoom the screen ,when the number of vertices reaches a few hundred, the frame rate is very low.


Thank you very much indeed!

I have submitted a bug about Metal performance.

Hello


Sorry for delay. Alas, I was not able of running this - I have older Xcode, and currently running urgent projects for client, so cannot reinstall that. But I think I know what is the problem. You invoke drawing like that:

commandEncoder?.drawPrimitives(type: .triangleStrip, vertexStart: 0, vertexCount: (size - 1) * 4 , instanceCount: size - 1)

And this is not what you want. If you want instancing of line segments, drawn as quads (4 vertex triangle strips) then you really should invoke your code with vertexCount: 4, not (size - 1) * 4. By doing what you do, you basically do O(n^2) instead of O(n) work (assuming single line strip drawing is O(1)). So for size = 51 like it is set now, you do basically 50 times the work you really want to do.

Hope that helps

Michal

Completely solved my problem, 100% perfect, thank you very much!

I'm full of power again😁