Adding concurrency reduces processing time – and then increases it!

I've got a little project running simulations of a tennis match. This mostly involves a great number of loops generating random numbers, if statements to who gets the point, and incrementing a tally to keep track of points / games / sets / matches / tournaments.

I run 10000 simulations (to give a resulting probability distribution) and the process takes 21s to come up with an answer if done sequentially using a task group. So now I decide to run the same number of simulations, but this time adding 2 tasks of 5000 simulations to the group.

This time the result takes 12.5s. Woohoo I think. That's a 40% reduction in wall clock time.

Of course the next thought is if less is more, then how much more will more be?!? So I keep increasing the count.

To my disappointment not only did the marginal returns decrease, they actually went negative.

  • 1x = 21s
  • 2x = 12.5s
  • 3x = 9.9s
  • 4x = 10.6s
  • 5x = 10.7s
  • 100x = 13.9s

Can someone tell me whether this is similar to this posting where the suspicion is memory bandwidth limitations?

I can't really include any code as it is quite long. I can say I'm using drand48 for randomness (as much faster than Double.random). Everything on the simulation side is a struct, with classes linking the matches together (so winners of one round move to the next).

There is one tournament simulator class created for every task, with the match simulator structs it manages being created once but altering constantly.

This is on a M1 Max chip running macOS 12 and Xcode 14 (beta 5). I'm also running the app as an archived (i.e. release optimised) version.

If memory bandwidth is most likely the problem then I can live with knowing that. I'm not wanting to make the assumption that it's a bottleneck which can't be overcome and miss an optimisation chance in my code.

It may be memory bandwidth or some other common factor. It's hard to say without seeing some code.

One thing to think about is cache line sharing. If your objects are small they may end up close enough together that they occupy the same cache line, which is almost as bad as actually accessing the same memory location from different cores.

Frankly the speedup you are seeing is not at all bad, compared to some of the things I've tried to do.

Question for anyone else reading this: what tools (e.g. in Instruments) are available for studying this?

Thanks for your response. There are so many hidden rocks under the surface of technology that take a lifetime to discover.

I would be interested in hearing what tools are available. Even the ones I know about in instruments overwhelm me in detail and I can't imagine what usable information there could be in the haystack of stochastic simulations over multiple cores.

I think I'll experiment with shifting where the concurrency occurs in my simulation (e.g. whether it's better to have two independent tournaments simulated in parallel, or have one simulation with matches "playing" at the same time) and see what I discover.

Adding concurrency reduces processing time – and then increases it!
 
 
Q