Using SIMD instructions in a kext

Question

Created Jun ’20

Replies 4

Boosts 0

Participants 3

I would like to know if my kext is responsible for saving / restoring vector registers if I use SIMD instructions in my Kext or will the scheduler do this for me. The SIMD instructions would be running on a pool of dedicated threads which are created at driver load time. (The threads obviously block when there is no work to be done.)

Does this behavior differ between the x86_64 and arm64e kernels?

Boost

Answer 1

DTS Engineer OP

Apple

Jun ’20

This is actually covered in the Kernel Programming Style:

There are a number of issues that you should consider when deciding whether to use floating point math or AltiVec vector math in the kernel.

First, the kernel takes a speed penalty whenever floating-point math or AltiVec instructions are used in a system call context (or other similar mechanisms where a user thread executes in a kernel context), as floating-point and AltiVec registers are only maintained when they are in use.

Note In cases where AltiVec or floating point has already been used in user space in the calling thread, there is no additional penalty for using them in the kernel. Thus, for things like audio drivers, the above does not apply.

In general, you should avoid doing using floating-point math or AltiVec instructions in the kernel unless doing so will result in a significant speedup. It is not forbidden, but is strongly discouraged.

The last time I looked at this the kernel has a lazy mechanism for saving and restoring the non-general purpose registers. Let’s focus on floating point for the moment. On entry to the kernel the system disables the FPU. If your kernel code accessed the FPU, it traps within the kernel, which saves the FPU state to the user thread’s context, clear the registers, and then returns to your kernel code. This would then require a restore of FPU state as you leave the kernel.

I have no reason to believe this general model has changed over the years but, then again, the last time I looked at this in detail AltiVec was bleeding edge (-:

Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@apple.com"

0

Answer 2

Systems Engineer OP

Apple

Jun ’20

On x86_64, for standalone kernel threads, you should be able to use up to (and including) AVX without issue and without having to manually save and restore SIMD state.
That changes if, for example, your kernel thread executes in the context of a user thread (e.g. top-half processing like system calls). In that case, you will need to save and restore SIMD state that is not preserved across function calls, as defined by the ABI (e.g. %xmm0-%xmm15 / %mmx0-%mmx7, and %st0-%st7 are not preserved, so you need not save/restore these, but AVX state would need to be saved/restored).
Primary interrupt handlers, on the other hand, must save and restore ALL FP state they touch since the user context it interrupted is not guaranteed to have saved FP state.

1

Answer 3

tstanding OP

Jun ’20

Thank you so much for both of these replies. I will make sure that I only ever use the SIMD code in threads which I create using kernelthreadstart(). I will make sure they are not used on user threads which enter the kernel context or primary interrupt threads.

I know that this will incur a performance hit to save and restore the registers but since I will be processing 1 - 3 gigabytes a second in 4 MB chunks, I think it will be worth the expense.

When you state that they work with AVX, does this include the AVX-512 extension or just AVX (the 256 bit instructions)?

Lastly, will this behavior be the same on arm64e ISA Macs?

0

Answer 4

Systems Engineer OP

Apple

Jun ’20

At this time, AVX-512 is not supported in the kernel (and there are quite a few bear traps when using it in userspace as well (e.g. CPU frequency throttling, etc.)).

Note that since the kernel does lazy floating point state restoration, each time your thread is preempted / descheduled, then run again, the first FP instruction will result in a FP restoration trap (though these operate very quickly (think on the order of ~1us)).

Use of SIMD in the arm64 kernel is a bit different -- there is no lazy state restoration (so it's restored immediately on context switch).

1