Is there any public API on MacOS to identify File-type by its contents, like 'file' command-line does?

Hi,

I’m required to identify file content type (e.g. - tell you that a file is in PDF format, even if the user forced its name to end with .docX, .txt, or even removed it altogether. In other words - identify file type by its real contents. I need to do this fast, for lots of files. 

I searched in vain through the MacOS different APIs/Frameworks, from LaunchServices, via MDLS, NSWorkspace, NSURL, and NSFileManager — to no avail. These all provide wonderful APIs for identifying file types - but miserably report the file type as “Microsoft Word” if its filename extension has been set to “.doc” or “.docx”, no matter the content.

I then found the ‘file’ command-line in Terminal which does EXACTLY what I want, and reports the correct type every time (well maybe it fails somethings, but I haven’t seen it fail once so far.)

Reading ‘man file’ I leaned that it examines a file in 3 stages. stat(2) to start with (identifying Unix things like pipes, sockets, symbolic links etc.) then, it works using some 'unix style' thing called “magic number” mechanism, that employs a “compiled magic file” /usr/share/file/magic.mgc containing “binary signatures” or special “magic numbers” at known offsets that allow quick identification of file formats.

Tiny hacking into this file using ’strings’ command I found a rather huge list of formats identifiable by MacOS out of the box - plus - according to man page of file , you should be able to add more “magic” files yourself!

However, I wouldn't want to spawn a 'file' command process every time I need to identify a file. I'd rather call some code, or framework from within my process. (This process is of high sensitivity - it is an "Endpoint Security Client" and has lots of restrictions.

Is there any public API (Cocoa, Unix, Posix, Core-Foundation, anything!) that will use this "Magic" mechanism to tell me the type of a file?

Thank you very much.

Answered by DTS Engineer in 710127022

There is no public API for doing this. If you’re curious how file actually works, you can see it in Darwin. The version aligned with macOS 12.2 starts here. AFAIK it links the libmagic code directly into the binary.

I think it’d be reasonable for Apple to provide such an API and, if you’d like to see this supported in the future, I encourage you to file an enhancement request describing your requirements. Please post your bug number, just for the record.

In the meantime your options are kinda limited. It’s clear that running file is way too heavyweight for an ES client. Another option is to roll the libmagic code into your product [1]. The main question here is whether you bring your own magic database or not. On the one hand, bringing your own means that you miss out on updates. On the other, relying on the system’s copy of this database seems kinda sketchy compatibility-wise. If, for example, the magic format changed in an incompatible way, file would continue working because its copy of libmagic is rev-locked to the system, but your ES client would break.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] Subject to licence constraints, of course. I can’t comment on that side of things.

According to the Wikipedia page for the file command:

https://en.wikipedia.org/wiki/File_%28command%29

As of version 4.00 of the Ian Darwin/Christos Zoulas version of file, the functionality of file is incorporated into a libmagic library that is accessible via C (and C-compatible) linking

I don't think that macOS has this libmagic (maybe the file command is statically linked with it?), but I've not investigated properly; maybe someone else can confirm. In any case it is BSD-licensed so you could compile it yourself.

Accepted Answer

There is no public API for doing this. If you’re curious how file actually works, you can see it in Darwin. The version aligned with macOS 12.2 starts here. AFAIK it links the libmagic code directly into the binary.

I think it’d be reasonable for Apple to provide such an API and, if you’d like to see this supported in the future, I encourage you to file an enhancement request describing your requirements. Please post your bug number, just for the record.

In the meantime your options are kinda limited. It’s clear that running file is way too heavyweight for an ES client. Another option is to roll the libmagic code into your product [1]. The main question here is whether you bring your own magic database or not. On the one hand, bringing your own means that you miss out on updates. On the other, relying on the system’s copy of this database seems kinda sketchy compatibility-wise. If, for example, the magic format changed in an incompatible way, file would continue working because its copy of libmagic is rev-locked to the system, but your ES client would break.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

[1] Subject to licence constraints, of course. I can’t comment on that side of things.

Is libmagic part of Darwin?

Yes, embedded within the file project itself.

Is there a way to "wrap" an instance of file as a daemon, regularly receiving input via stdin, and emitting its result to stdout, sparing the need to spawn it again and again?

Possibly. The file command won’t accept files from stdin by default but you can trigger that by passing - to the -f option:

% cat hello.txt 
Hello Cruel World!
% cat goodbye.txt 
Goodbye Cruel World!
% file -f -
hello.txt
hello.txt:              ASCII text
goodbye.txt
goodbye.txt:            ASCII text
^D

As per usual with Unix-y filter commands, you have to watch out for buffering issues.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"

Is there any public API on MacOS to identify File-type by its contents, like 'file' command-line does?
 
 
Q