URL(string:..) and URLComponents(string:...) succeeding when they shouldn't?

Hi
My code reads a bunch of URLs from a file and does something with each one.

Code Block swift
for line in everyLineFromSomeTextFile {
guard let url = URL(string: line) else {continue}
doSomethingWith(url)
}


I recently noticed that some of the lines in the file had wrongly encoded path portions.

e.g:
Code Block
https://www.apple.com/us/search/caf%e9

(Or see here if the forum munges that up)


This looks like it was incorrectly percent-encoded from ISO 8859 character set instead of UTF8 as I am sure the last path component is meant to be 'café' when unencoded .

However my code didn't skip that line as URL(string: line) didn't return nil. But url only contained the scheme and the host. The path was empty.

I tested what URLComponents also did with that string and it gave a similar result - valid scheme and host but empty path.
However URLComponents.percentEncodedPath actually returns the original malformed path:
Code Block
/us/search/caf%e9

To complicate things further:
Code Block
url.absoluteString ==> "https://www.apple.com/us/search/caf%e9"

Surely both of those initializers should fail if the string can't be properly and fully parsed?

As it happened, my code went ahead and incorrectly did:
Code Block
doSomethingWith(url)

where url was
Code Block
https://www.apple.com


I haven't even looked at what would happen if the host, query or fragment components were also incorrectly encoded in my source.

I realise that URL and URLComponents are just wrappers around NSURL and NSURLComponents, but they behave the same too.

(the url string wasn't really an Apple one - I used that for simplicity)

Replies

Surely both of those initializers should fail if the string can't be
properly and fully parsed?

Parsing URLs is a complex issue. URL and URLComponents use different parsers (the former is based on RFC 1808 and the latter on RFC 3986). Neither can be too strict about rejecting seemingly-malformed URLs because there’s a huge variety of those out there on the ’net. In fact, the example you gave could be considered well-formed if you go back to the RFC 1808 era, where text encodings were not part of the standard. And URL specifically tends to focus on round-trip fidelity, that is, if you go String > URL > String you get back what you put in.

If you really care about URL correctness then I generally recommend URLComponents. URL has a lot of compatibility baggage history and URLComponents, being more modern, was designed with that history in mind. To illustrate this with your example, consider this:

Code Block
let uc = URLComponents(string: "https://www.apple.com/us/search/caf%e9")!
print("'\(uc.path)'")
print("'\(uc.percentEncodedPath)'")


URLComponents knows that it’s not possible to reliably undo percent encoding and thus it offers both a convenience property, path, that assumes UTF-8, and a fundamental property, percentEncodedPath, that does not.

Share and Enjoy

Quinn “The Eskimo!” @ Developer Technical Support @ Apple
let myEmail = "eskimo" + "1" + "@" + "apple.com"