Post

Replies

Boosts

Views

Activity

Pack high bit of every byte in ARM NEON, for 64 bytes like AVX512 vpmovb2m?
__builtin_ia32_cvtb2mask512() is the GNU C builtin for vpmovb2m k, zmm. The Intel intrinsic for it is _mm512_movepi8_mask. It extracts the most-significant bit from each byte, producing an integer mask. The SSE2 and AVX2 instructions pmovmskb and vpmovmskb do the same thing for 16 or 32-byte vectors, producing the mask in a GPR instead of an AVX-512 mask register. (_mm_movemask_epi8 and _mm256_movemask_epi8). I would like an implementation for ARM that is faster than below I would like an implementation for ARM NEON I would like an implementation for ARM SVE I have attached a basic scalar implementation in C. For those trying to implement this in ARM, we care about the high bit, but each byte's high bit (in a 128bit vector), can be easily shifted to the low bit using the ARM NEON intrinsic: vshrq_n_u8(). Note that I would prefer not to store the bitmap to memory, it should just be the return value of the function similar to the following function. #define _(n) __attribute((vector_size(1<<n),aligned(1))) typedef char V _(6); // 64 bytes, 512 bits typedef unsigned long U; #undef _ U generic_cvtb2mask512(V v) { U mask=0;int i=0; while(i<64){ // shift mask by 1 and OR with MSB of v[i] byte mask=(mask<<1)|((v[i]&0x80)>>7); i++;} return mask; } This is also a dup of : https://stackoverflow.com/questions/79225312
0
0
338
Nov ’24