Hi,
I’m required to identify file content type (e.g. - tell you that a file is in PDF format, even if the user forced its name to end with .docX, .txt, or even removed it altogether. In other words - identify file type by its real contents. I need to do this fast, for lots of files.
I searched in vain through the MacOS different APIs/Frameworks, from LaunchServices,
via MDLS,
NSWorkspace,
NSURL,
and NSFileManager
— to no avail. These all provide wonderful APIs for identifying file types - but miserably report the file type as “Microsoft Word” if its filename extension has been set to “.doc” or “.docx”, no matter the content.
I then found the ‘file’ command-line in Terminal which does EXACTLY what I want, and reports the correct type every time (well maybe it fails somethings, but I haven’t seen it fail once so far.)
Reading ‘man file’ I leaned that it examines a file in 3 stages. stat(2)
to start with (identifying Unix things like pipes, sockets, symbolic links etc.) then, it works using some 'unix style' thing called “magic number” mechanism, that employs a “compiled magic file” /usr/share/file/magic.mgc
containing “binary signatures” or special “magic numbers” at known offsets that allow quick identification of file formats.
Tiny hacking into this file using ’strings’ command I found a rather huge list of formats identifiable by MacOS out of the box - plus - according to man page
of file
, you should be able to add more “magic” files yourself!
However, I wouldn't want to spawn a 'file' command process every time I need to identify a file. I'd rather call some code, or framework from within my process. (This process is of high sensitivity - it is an "Endpoint Security Client" and has lots of restrictions.
Is there any public API (Cocoa, Unix, Posix, Core-Foundation, anything!) that will use this "Magic" mechanism to tell me the type of a file?
Thank you very much.
There is no public API for doing this. If you’re curious how file
actually works, you can see it in Darwin. The version aligned with macOS 12.2 starts here. AFAIK it links the libmagic
code directly into the binary.
I think it’d be reasonable for Apple to provide such an API and, if you’d like to see this supported in the future, I encourage you to file an enhancement request describing your requirements. Please post your bug number, just for the record.
In the meantime your options are kinda limited. It’s clear that running file
is way too heavyweight for an ES client. Another option is to roll the libmagic
code into your product [1]. The main question here is whether you bring your own magic database or not. On the one hand, bringing your own means that you miss out on updates. On the other, relying on the system’s copy of this database seems kinda sketchy compatibility-wise. If, for example, the magic format changed in an incompatible way, file
would continue working because its copy of libmagic
is rev-locked to the system, but your ES client would break.
Share and Enjoy
—
Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"
[1] Subject to licence constraints, of course. I can’t comment on that side of things.