Revisiting the recommendations from WWDC 2017-706 regarding GCD queue hierarchies.

TL;DR: Why is a hierarcy of serial queues the recommended way for managing concurrency in a modern application?


Years later, the recommendations made in WWDC 2017-709 "Modernizing Grand Central Dispatch Usage" regarding the use of a hierarchy of serial queues to manage concurrency in an application remain unclear to me. Old posts on former Apple mailing lists, StackOverflow and Swift's Forums add to the confusion. Hopefully there's an opportunity for some clarity here.


(I'm writing in the context of a macOS application developer.)


In the WWDC video, to improve concurrency performance, it's recommended that one should split up their application into sub-systems and back each sub-system by a serial queue. It's then recommended that those sub-systems should target a single, "root" queue that is also a serial queue. The talk mentions that use of serial queues improved concurrency performance in many of Apple's own application.


But with laptop and desktops having so many cores, I'm struggling to reconcile how running everything through a single serial queue helps with concurrency. On the surface, it feels like you'd be seriously under-utilizing available cores.


For example, I have an application that has the following sub-systems:


  • Export Service - Used for exporting images or videos.
  • Rendering Service - Used for rendering thumbnail previews.
  • Caching Service - Used for caching random data.
  • Database Service - Used for reads and writes to the database.


With the exception of maybe the database service, I want each of the other services to run as many concurrent requests and reasonable for the given machine. So each of those services is backed by a concurrent queue with an appropriate quality-of-service level. On a multi-core system, I should be able to render multiple thumbnails at once so using a serial queue does not make any sense. Same goes for exporting files. An export of a small image should not have to wait for the export of a large video to finish in front of it. So a concurrent queue is used.


Along with using sub-systems, the WWDC recommends that all sub-systems target a single, root serial queue. This doesn't make too much sense to me either because that implies that there's no reason to use a concurrent queue anywhere in your tree of queues because its concurrency is negated by the serial queue it targets, or at least that's how I understand it.


So if I did back each service by a serial queue, then target a root serial queue, I'd be in the situation where a thumbnail request has to wait for an export request to complete, which is not what I would want at all.


(The WWDC talk also makes heavy use of DispatchSources, but those are serial in execution as well.)


For the example sub-systems above, I actually use a hierarchy of concurrent queues that all target a root concurrent queue. Each sub-system runs at a different quality of service to help manage execution priority. In some cases, I manually throttle the number of concurrent requests in a given sub-system based on the available cores, as that seems to help a lot with performance. (For example, generating thumbnails of RAW files where it's better for me to explicitly restrict that to a maximum limit rather than relying on GCD.)


As someone who builds a ton of concurrency into their apps, and as someone who felt that they had a reasonably good grasp on how to use GCD, I've never been able to understand why a hierarchy of serial queues is the recommended way for doing concurrency in a modern app. Hopefully someone can shed a bit more light on that for me.

WWDC 2017-[706] "Modernizing Grand Central Dispatch Usage"

For those folks following along at home, that’s WWDC 2017 Session 706 Modernizing Grand Central Dispatch Usage.

Along with using sub-systems, the WWDC recommends that all sub-systems target a single, root serial queue.

It does? Can you point me to where it says that? It’s been a while since I watched the talk but slide 153 says “Use a fixed number of serial queue hierarchies”, which means one root serial queue per subsystem, with a tree of queues layered on top of that. It short, it’s a small forest of trees, one for each subsystem in your process.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

Consider the slide at 20:55. It appears to show two dispatch sources (S1, S2) that each target their own serial queues (Q1, Q2), which in turn targets a single serial queue (EQ). My interpretation of that slide, is that all of the work is serialized by the one root queue, which means that S1 and S2 do not provide any additional concurrency.


A minute later, the speaker mentions that the order of the work items is guaranteed by the root "mutual exclusion queue", but that would have been the case anyways with a single dispatch source.


A few more slides later, there's one titled "QoS and Target Queue Hierarchy" which attempts to explain why you'd want to use multiple dispatch sources. In this example, S1 has a low QoS while S2 has a high QoS. But since they both target a root queue, then there's a good chance that the entire tree will run at the higher QoS if S2 is adding a lot of worker items. That means that low priority items, added by S1, will get boosted to a higher QoS which is unlikely to be what I'd want. I'd much rather the system context switch over the higher QoS work item, execute it, then go back to the lower QoS work item. This isn't possible in the presented design because of the root queue.


At 26:23, another example is presented using a "single mutual exclusion queue" as the root queue. In this example, the problem really seems to be that the jobs are too small to warrant individual work items. But the solution presented means that only a single event handler can be running at once.


At 28:30 the subject of subsystems is brought up. It's very possible I'm mis-interpreting this part of the talk. The solutions presented involve each sub-system targeting a serial queue. (Main Queue, Networking Queue, Database Queue.) Excluding the main queue because it's special, why would I want the networking and database queues to be serial? A long running work item on either would significantly slow down the overall application. Multiple requests to read from a database should be allowed to happen concurrently, IMHO.


My earlier comment regarding a single, root queue for the entire app was somewhat influenced by the subsequent slides that suggest using a "Fixed number of serial queue hierarchies."


If you look at 31:30 "Mutual Exclusion Context", they show a simply tree with a root serial queue (EQ). On the next slide, they reference EQ as the Application queue, or at least that's how I read it.


Finally, consider the slide at 43:00 "Protecting the Queue Hierarchy". The first bullet point suggests that one should "Build your queue hierarchy bottom to top." In that diagram, I see EQ as a root queue for the application, with Q1/S1 and Q2/S2 being task related or subsystems if the application is large enough.


But even I was wrong concluding that there should be a root serial queue, I'm still conflicted as to why I'd want all my subsystems to have serial queues. If all of my tasks are long enough to warrant their own work items, then I want as many of them running reasonably possible given the cores available to me. If I'm rendering thumbnails on a MacBook Pro, then I might want 4-6 thumbnail requests to run concurrently. If I'm running on a Mac Pro, then I can handle a lot more. I can't have that flexibility if I build a hierarchy of serial queues, yet that seems to be Apple's recommendation in some of the more recent WWDC videos related to GCD.


Follow-up:


Proper use of GCD is obviously quite dependent on how your application is architected, so I'm clearly approaching this from my app's perspective. Out of interest, my app's architecture looks something like this:


  • A half-dozen or so "managers", that one could consider to be sub-systems.
  • Each manager has a single, concurrent execution queue with an appropriate QoS level.
  • Each manager is responsible for a certain type of request. (Database, Export, Caching, Rendering, etc...)
  • Requests are submitted to each task almost always from the main thread as a result of a user event.
  • Requests are immutable and independent of each other. There are no locks are shared resources involved.
  • Requests are allowed to execute out-of-order, if explicitly stated. (i.e.: Two database reads can happen out-of-order, but writes cannot.)
  • Requests are relatively high-level and, with very few exceptions, run within their own work item. (i.e.: A request does not, in turn, spawn other GCD work items.)
  • An example request might be exporting an image, rendering a thumbnail, performing a database operation, etc.
  • A request might use a framework, like AVFoundation or Core Image, that in turn uses multi-threading. This is where some manual throttling needs to happen because if you have six cores and try to decode six RAW files concurrently, you'll have worse performance than decoding two or three concurrently since Image IO spawns a bunch of threads itself.


Using serial queues in each manager/sub-system would reduce the concurrency in my app, and I've tested that by limiting how many concurrent thumbnail requests I allow at any given time and the degradation is visually obvious.

So my app makes almost exclusive use of concurrent queues, with the odd barrier block when some form synchronization is required. However, this seems very much at odds with the above mentioned WWDC talk as well as tips listed on page:


https://gist.github.com/tclementdev/6af616354912b0347cdf6db159c37057

I think the key misunderstanding here is one of scale. The presenters are assuming that each component within a subsystem is using a serial queue for its own serialisation purposes, and that the resulting work items are relatively small. They recommend that you use a queue hierarchy to avoid having a ridiculously high number of serial queues all competing for the limited number of computation resources available. That setup results in a lot of cache sloshing as the small work items bounce between threads that are scheduled on different CPUs.

Let’s focus on networking for the moment. There’s very little point running your networking code on multiple different threads because the network interface is an inherent serialisation point. Thus, all the components within your networking subsystem should ultimately target a single networking serialisation queue that ensures serialisation (simplifying your code) and avoids cache sloshing.

The drawback here is, as you noted, latency. If some component within your networking code sits on the CPU for too long, it blocks all other networking code, which is something you want to avoid. The solution there is to model long-running CPU-bound kinda like you model long-running I/O: As an async operation that’s handled by its own subsystem.

The only part of this that I’m fuzzy on is how to implement a subsystem, like your thumbnail rendering subsystem, where the goal is to balance work across multiple CPUs [1]. If all the work comes in at once you can use

dispatch_apply
for this, but that’s not the case here.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

[1] This assumes that the work isn’t serialised elsewhere. For example, it’s unlikely that you’re doing your thumbnail rendering manually. You’re more likely to be calling a system service (Core Image, Metal) with its own serialisation points (like the GPU). However, this point still stands.

The only part of this that I’m fuzzy on …

This has bugged me for a long time so I used your question as an opportunity to raise it with the Dispatch team. Sadly, it turns out that this is a missing part of the Dispatch story (r. 34322779). If you’d like to see better support for this sort of thing, please file your own bug against Dispatch describing your specific use case. The team is happy to receive such bugs because it allows them to prioritise their future efforts.

As to what you can do about this now, there’s a variety of ad hoc solutions, none of which are ideal. My number one recommendation is that you try to carefully separate I/O from CPU. Once you do this you have a clear path forward in each case:

  • You can do your I/O on a serial queue because each individual work item will be fast. It might make sense to use more serial queues based on the hardware setup. For example, if you’re reading from the network and writing to the disk, there’s no contention at the hardware level and thus two serial queues could help.

  • For pure CPU work [1] there’s an obvious upper bound to the amount of parallelism, and you can deal with that by deploying one thread per core. How you set this up depends on the specific workload. If all the work items are of roughly the same size, you could simply round robin between a set of serial queues. If not, I can think of two options:

    • Use a thread API and have worker threads pull work off a queue [1] you manage. And yes, this means you’re kinda reimplementing Dispatch, hence the first paragraph of this post.

    • Use a concurrent queue and externally limit the number of work items you push to that queue. This is necessary to prevent overcommit [1], and hence thread explosion.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

[1] I’m ignoring page faults here, but hopefully that won’t be a significant source of I/O. If it turns out that it is, that warrants a separate investigation.

[2] Here I’m referring to a queue in the general sense, not a Dispatch queue.

Very much appreciate the double-reply, thank you.


The thumbnail rendering situation was something I came across quite awhile ago. I was using dispatch_apply to request thumbnails of dozens or more images at once, but that quickly overwhelmed disk I/O and various other frameworks like Image IO. Back then, I had posted a question on StackOverflow related to just this issue:


https://stackoverflow.com/questions/23599251


In the end I just ended up using a normal concurrent queue with an explicit semaphore to throttle in-flight worker items, similar to NSOperationQueue.maxConcurrentOperationCount. Works for now, but it's an ad-hoc solution based on the number of reported CPU cores.


I fully accept that there are no hard-and-fast rules for this and each application is somewhat different. Comparing my app's architecture to the talking points in the WWDC video, I feel like the video is using rather small dispatch work items while my app uses rather large ones. Most of my work items are "jobs", like exporting an image, fetching results from a database or rendering a thumbnail.


For those types of operations, I'm not aiming for maximum throughput, but rather for the best user experience. For example, rendering three thumbnails concurrently might take longer to complete in an absolute sense, but if two of the thumbnails are for small images and one of the thumbnails is for a massive panorama, then it's very likely the two thumbnails will finish sooner and thus be shown to the user quickly. Had they have had to wait for the panorama to finish, the user would see a blank screen for longer than needed. At least that's how I like to design things.


(This is particularly important for thumbnail rendering because it can be very hard to cancel a rendering that is in-progress. Many of the Image IO and QuickLook APIs don't have a way to cancel their requests, thus you're stuck waiting for a thumbnail to be generated even if the presenting view has scrolled off the screen.)

Similar concurrency thoughts apply to exporting. I'm OK if a small export job is pre-empting a longer export job because that allows the smaller job to complete sooner and for the user to be able to use the resulting files sooner. If a user initiates a small export while a large export is already underway, then chances are they want access to that small export ASAP. They shouldn't have to wait for the larger one to complete.

I realize this causes the total time to completion for all outstanding requests to increase, but from the user's perspective (IMHO), the system appears more performant. The only way I know how to do this is with a "tree of concurrent queues", so perhaps my use-case, architecture and trade-offs are significantly different than those used in the WWDC talking points.


Additional Commentary:


Earlier you pointed out that I was likely wrong in concluding that an application should have a single, serial root queue. Given this discussion, I agree. But what about in the context of a single, concurrent root queue?

I remember reading in older mailing lists that we shouldn't explicitly target any of the global queues. Perhaps the introduction of QoS was the reason? Regardless, it's my understanding that if you have a concurrent queue running at a given QoS, then Dispatch will boost that queue's QoS to the highest QoS of any in-flight worker item and then return to the original QoS once that worker item has completed.

I've applied that strategy by having a single, concurrent root queue in my application who's QoS is utility. The various sub-systems have their own concurrent queues with higher QoS values, but all target the root queue. This appears to work ok, but I'm sort of just guessing on this strategy. Sadly, imptrace, the tool for monitoring QoS boosting, does not appear to work in Catalina.


Should I use a single concurrent queue as the root queue of an application? Does it make sense to run it with a low QoS and rely on boosting from targetted queues or is that just silly?

Maybe this is just a case of premature optimization.


You have to remember that virtually everything Apple does is targeted towards its primary platforms and apps. So that WWDC video pretty much only applies to game developers on iOS. Unless you are heavy into data processing and GPU rendering, there is no point in getting fancy with concurrency. You want to keep the UI responsive, so that means dispatching tasks asynchronously on a background thread. Then you'll need to update said UI, which means dispatching tasks asynchronously back onto the main queue. For the vast majority of use cases, that's it.


If you have some slow-running tasks, you can carefully do those in parallel. In most cases, this means networking. I/O is really tricky. You can do very limited I/O reading in parallel. Try too much and it just gets slower. Anything that involves an operating system service (like generating icons) is really tricky. In many cases, these can only dispatched onto the main thread and that is rarely documented.


GCD, overall, has very little documentation. If you try to get fancy, you'll find yourself several blocks into asynchronous code and realize there are maybe 30 different distinct things that could happen at a given point. Then you go looking for documentation to figure out how it works, but there isn't any at that level. And as you have discovered, GCD pretty much assumes nothing ever fails. You don't know something has gone wrong until you **** out the thread count and/or RAM and there is no way to recover at that point


I prefer to keep my use of GCD very limited. I avoid the GCD I/O routines entirely. Old fashioned, well-documented BSD code is much easier to deal with. So maybe it is not as efficent at runtime. It is a lot more efficient than me trying to figure out how GCD I/O works. And if there is a problem, I can stop the process before it takes down the whole device.

Hey,
I'm a game developer (engineer, artist both)
who can answer this properly. But what matters is that I have little time to do this.
sorry for this.

I think the WWDC video has almost all we need to understand GCD only when we're fully versed in using GCD.
Fortunately due to the heavy workload of game development over many years and thanks to God Jesus and His Father, I grabbed fully what is all about GCD and how to use Dispatch framework effectively.

Without further ado, right into the key answer:
The target queue is mainly used for reducing 'context switches and resource contention'.
It is a great thing for GCD! Without it, thread will explode by overwhelming dispatch queues you may be going to
create the complex problem in the game.

That's all. It is not that difficult to understand. However, to use it properly, you should understand all of GCD, Dispatch framework.

For example,
My game is currently using 31 dispatch queues with the help of target queues. You can also avoid context switches and resource contention without using those 'target queues' if you use semaphore as well. But dispatch queue is better in efficiency, even if it would be more finicky to use. 'Artificial Intelligence' plus 'tons of animations' employed in the game is the source of this complexity.


I hope this may help.

My site is 'SungW.net' which I can NOT update properly due to lack of time.
However, if you have more questions, you may contact me through this.

Revisiting the recommendations from WWDC 2017-706 regarding GCD queue hierarchies.
 
 
Q