Posts

Post not yet marked as solved
5 Replies
1.6k Views
Hi,I've spent lots of time unsuccessfully trying to figure out what I need to do to avoid stalls in performance critical pieces of my NEON code. For example screenshot of CPU profiler results:that's the fuction that I wrote to for testing the issue://CheckSolid_mem(unsigned char const* src, unsigned int* dst, int count): function CheckSolid_mem, export=1 add r12, r2, #1 0: sub r12, r12, #1 mov r2, r0 mov r3, r1 cmp r12, #0 vld1.32 {d16, d17, d18, d19}, [r2]! vld1.32 {d20, d21, d22, d23}, [r2]! vdup.32 q12, d16[0] vceq.i32 q8, q8, q12 vceq.i32 q9, q9, q12 vceq.i32 q10, q10, q12 vceq.i32 q11, q11, q12 vand q8, q8, q9 vand q10, q10, q11 vand q8, q8, q10 vand d16, d16, d17 vpmin.u32 d16, d16, d16 vbic.i16 d24, #0x7 vbic.i16 d24, #0x700 vorr.i32 d24, #0x2000000 vand d16, d16, d24 vst1.32 {d16}, [r3]! bgt 0b mov r0, r2 bx lr endfuncin short CheckSolid_mem reads sixteen 32-bit ints, then compares if they are all equal (creates mask of zeros or ones based on that equality in d16), then then does a few bitwise ops with d24 and then does vand of the mask in d16 and result in d24. Final result is stored in memory pointed to by dst pointer. This entire block is repeated count times. For testing I run it with count= 100M times or something like that and I get that profile picture. As you can see commonly in my neon code I get that stall on `vst1.32 {d16}, [r3]!` line. That instruction alone takes 50% of the entire runtime of the function. This is something that's expected to happen in neon code when trying to store value of a neon register to memory or to an ARM register. I want to understand why exactly this happens and what I have to do to avoid/mitigate the issue. Normally I know what to do in such case, but when dealing with iPhones I don't get why it never works: no matter what I try and no matter how I reshuffle my code I always get these stalls in places there they simply kill performance of my code.From my understanding, the code shown has a few issues: 1) result in d16 isn't immediately available, so there is some latency added before it can be stored. If I replace that store with (d2 wich wasn't used in that entire function), then that like takes roughly 200x instead of 255x, so, supposedly stall from writing neon register to memory is 200x in this function.Normally there shouldn't be any stalls from writing neon register to some memory. All specs list vst1 as an opcode that takes just a few cycles and in my case I use simples case which takes 1 cycle. Now, if I tried to access that memory using an arm register, then I would have to experience that stall on reading that memory as it should take 10-20 cycles before this piece of memory is available on arm side. Similar goes for moving results from neon registers to arm registers: `vmov.32 r3, d16[0]` takes a cycle, but then if I try to access r3 I'll get the same apic stall. In short, accessing any data on arm side will stall if that data originated on neon side.This was my understanding of neon for many years. To avoid stalls when dealing with neon->arm transfers you run some other unrelated code for 10-20 more cycles and then data becomes ready on arm side and can be accessed without that epic stall that takes 50% of function runtime.So.. apparently, there is something wrong with either profiler, my understanding of NEON, or even with apple's chips, but I cannot figure out what I could possily do to avoid these stalls. I tried to insert like 50 nop instructions before storing neon register, I tried to add like 50 nop instructions after storing the neon register: timeing changes, but this specific instruction always takes no matter what an epicly huge amount of time compared to all other instructions.Can some engineers from apple clarify what's going on with apple's chips, why I cannot avoid these stalls? I spent weeks tryign different approaches, without any success. People experienced in this type of high perf optimizations strongly advise me against uise of iPhones for this type of profiling work as they say that I will never get correct results or what I'd normally should have gotten from arm chips. But my target software mainly runs on iPhones, so I would really like to understand what is going on.I use iPhone6 for development. I build this test code in 32-bit ARM mode.
Posted
by pps83.
Last updated
.