By definition MAP_SHARED area must be coherent with any other RW
operations in the system on that file slice which practically means
that kernel has to use page cache that is used to serve RW requests.
I don't think POSIX actually requires that, but it is basically true of macOS/iOS. More specifically, what the UBC ("Universal Buffer Cache") actually does is provide a single system wide cache for ALL file I/O. In practical terms, "all" file I/O is actually memory mapped I/O- "read()" works issues requests to the UBC, which either returns copies of the data or copy on write references to it's own memory (depending on size). mmap basically just does same UBC mapping directly. Similarly, "write()" works by performing modifications to the UBC's pages, while memory mapped writes make the same changes "directly".
This in turn means that after app process terminates by any reason - content of that memory will not be discarded but rather will be available on next app start via open()/read() or mmap() for that file.
Yes, assuming the I/O reached the UBC and ignoring kernel panics.
In my case it appears to be important as if kernel does some automatic writebacks on its own - intensive logger traffic would put unneeded IO load to a disk device. After some experiments I was able to figure out that e.g. Linux is able to issue periodical writebacks w/o explicit msync(). For OS X according to "fs_usage -f diskio" no writeback occurs until app terminates (better to say until last reference in the system to that MAP_SHARED area is dropped).
I don't think the result you found there is accurate, at least not in the broad/general case. What were the specifics of what you were actually testing? Size of mapping, rate of modifications, overall system load, etc? If you're only doing a very "minimal" test (a few pages, minimal modifications, etc) then it wouldn't surprise me if the data remained "non-flushed" for long periods of time. However, it's not difficult to mmap a file that's larger than all physical memory. If you then write to each page in sequence, the kernel will inevitably need to either flush your older writes to disk or panic when it runs out of memory. Also, this isn't accounting for external events like system sleep. For example, when writing out a hibernate file I'd expect the UBC to flush any unwritten data simply to avoid adding unnecessary data to the hibernate file.
Finally, this isn't an area where you can assume the "system" has a single, universal behavior. The details of the caching behavior are largely controlled by the VFS driver, so different file system can have very different behaviors.
I'm now interested to learn about iOS behavior. Is it the same as OS X (no automatic writebacks)?
They're broadly similar, but that doesn't mean "no automatic writebacks".
Having said that, I'm not sure what you're actual concern here is:
Though it is vague as user can not tell the kernel what is the nature of planned access - read, write or read-write. It is easy to imagine that this knowledge might allow some shortcuts in the UBC, including deferring writebacks.
What are you actually looking for/expecting here? How "long" do you want the system to defer your writes? The main issue developers have with the write cache is data NOT being written to disk, not that it's being written out to frequently. There isn't any formal contract on how long data can remain unwritten in the write cache (more on why shortly), but "seconds" isn't particularly unusual and "indefinitely" is probably possible. The general rule here is that unless you've flushed it to disk, you shouldn't assume it's been written to disk.
In the cases where excessive writeback is a serious concern, the typical solution is to add an intermediate layer which collects data first before committing it to the I/O layer. You're focused on time as the primary variable here but issue writes that are properly aligned at the ideal I/O size will do often do more than changing write frequency.
On the "why" side, there are a few different issues:
Actually it looks like a deficiency of madvise interface overall, which only operates by abstract "accesses", never clarifying about their direction.
-The point of madvise it allow you to guide VM policy, primarily for performance. From that perspective, VM writes aren't really relevant as (on their own) they don't really effect performance.
-Every VM system reserves the right to flush the VM system "at will", since the (worst case) alternative is to panic. More broadly, there are practical reasons (like preparing the system to sleep) and performance (creating large, contiguous writes) which would cause the VM system to flush independant from any policy.
-One of the more direct optimization points it to delay writes (this is why the write cache exists at all), so the kernel already has a strong incentive to delay writes.
is there anything else that could be used by iOS app to convince kernel to retain content of some app pages after app termination and then have access to this content upon next app start? I guess it still would boild down to mmap(MAP_SHARED) but for an object such that uncontrolled writebacks are not an issue, like ramfs file or some shmem.
No, not really. The kernel "holds on" to memory/data like this because some other "client" needs it to. For file, that's "the file system" but for all of the other cases the client is "some other process". The problem on iOS is that you can't really rely on any other process "being there" to hold the reference open, which means anything you try and build on this can is inherently unreliable.
__
Kevin Elliott
DTS Engineer, CoreOS/Hardware