й symbol is represented incorrectly in file names

When rename file in Finder and use symbol й

U+0439 : CYRILLIC SMALL LETTER SHORT I

it is then converted to a combination of two unicode characters by macOS/Finder

U+0438 : CYRILLIC SMALL LETTER I
U+0306 : COMBINING BREVE {short; Greek vrachy}

Does anybody know why it is so? Such behaviour causes troubles to regular expressions on different web pages as they can't detect й as a correct cyrillic symbol. My macOS version is 14.1.2 (23B92).

Answered by DTS Engineer in 774063022

It's likely the result of Unicode "normalization", which chooses a preferred underlying representation of a Unicode code point when there are multiple possible representations. Note that there's more than one kind of normalization, so the normalization that the file system does may not be the same one as on other platforms.

[To be explicit: File systems do normalization so that filenames consisting of "the same characters" are treated as the same filename, regardless of the underlying representation of the characters. This doesn't necessarily have to be true of a file system, but it's poor experience for users when they can have multiple files of what is apparently the "same" name.]

All that aside, the normalization form should not really cause any problems for properly-written code for doing string processing, such as regular expressions.

Tools that search text (whether by regular expressions or not) have to be explicit about whether they're doing normalized-form searching or not. If you're writing your own code to match against text copied from web pages, you'll have to give this issue some attention, too. It's hard to be more explicit without knowing more about what scenarios you're seeing that have these "troubles".

Accepted Answer

It's likely the result of Unicode "normalization", which chooses a preferred underlying representation of a Unicode code point when there are multiple possible representations. Note that there's more than one kind of normalization, so the normalization that the file system does may not be the same one as on other platforms.

[To be explicit: File systems do normalization so that filenames consisting of "the same characters" are treated as the same filename, regardless of the underlying representation of the characters. This doesn't necessarily have to be true of a file system, but it's poor experience for users when they can have multiple files of what is apparently the "same" name.]

All that aside, the normalization form should not really cause any problems for properly-written code for doing string processing, such as regular expressions.

Tools that search text (whether by regular expressions or not) have to be explicit about whether they're doing normalized-form searching or not. If you're writing your own code to match against text copied from web pages, you'll have to give this issue some attention, too. It's hard to be more explicit without knowing more about what scenarios you're seeing that have these "troubles".

Thanks, Polyphonic. Unicode normalisation explains the observed behaviour.

As for practical example of the issue. Website https://id.court.gov.ua/ that I do not own and has no access to code has following regexp for validating file name when user selects the file. It works on windows for files with й in name but not on macOS

const testFileName = /^[0-9A-Za-zА-Яа-яёїЁЇіІЄє\- /_/'/+/(/)/=/;/^/$/!.A-Za-z]+$/i;

In Java I would use CANON_EQ flag. But for JavaScript I don't know what would be a reliable solution

й symbol is represented incorrectly in file names
 
 
Q