i think i have traced the problem to drand48() system call random number generator which i use extensively
it seems like when you use the drand48 function a lot in the block sent to dispatch, the various threads
that run the block get serialized or otherwise jammed up because of this function, so your code doesnt speed
up the the expected amount when you dispatch concurrent blocks and just runs the same speed or slower
as on 1 thread (slower due to the extra overhead of other threads, dispatch, thread syncing, etc)
the thing i found is that this slow down doesnt seem to show up in the Instruments app. it shows a little
bit of drand48 taking up CPU as expected, but not huge.... since it is not using CPU power and is just waiting for other
threads to handle memory access, i would guess. such waiting may show up in some portion of instruments i didnt look at.
this post seems to get into the details of why this is occurring:
https://stackoverflow.com/questions/22660535/pthreads-and-drand48-concurrency-performance
will post a workaround. tentatively working on pre-generating some randoms in per-thread arrays or a global
array using a per thread index. if you use a global index to pull from the pre generated randoms, it shows a similar
slowdown as drand48