Any reference to actual performance information on the Blas with AMX? I don't use Blas in the first place, so to know whether it's worth doing so, I need to know the performance characteristics of Blas/AMX ---- what the startup overhead is, and actual speedup is on common ops ---- in order to justify spending any time to take advantage of it (it could easily wind up being slower than what I already do). Some matrix multiply times with and without for various matrix sizes at the least, for example. A 5% speedup isn't going to do it, need to see some "factor-of" improvements.