Background thread taking too long?

I'm using NLTokenizer in this code to extract words from text files:


func loadData() {
  var wordTokens: Set<String> = []
  let tokenizer = NLTokenizer(unit: .word)
  tokenizer.string = TextContent.sharedInstance.text.uppercased()
  let tokenRanges = tokenizer.tokens(for: tokenizer.string!.startIndex..<tokenizer.string!.endIndex)
  for r in tokenRanges {
    let word = String(tokenizer.string![r]).trimmingCharacters(in: .whitespacesAndNewlines)
    if word.count > 0 {
      wordTokens.insert(word)
    }
  }


It's been working fine for most files, including some that are over 800KB in size. But when I input an even larger one (1.4MB), I don't get anything in the tokenRanges array at line #5. I've checked the tokenizer string, and it is initialized.


I have a limited understanding of threads. But I'm wondering whether the tokenizer starts a background thread at line #5 to do its work, and this thread isn't complete yet when line #6 is executed. If this is what's happening, is it possible to somehow require the thread to complete before proceeding?


I've also tried this using the enumerateTokens function with a closure at line #5, with the same result.

Don't think that could be the problem.


I looked at doc:

https://developer.apple.com/documentation/foundation/nslinguistictagger/tokenizing_natural_language_text


They set the range a but differently:

let range = NSRange(location: 0, length: text.utf16.count)


So, could you try:

let range = NSRange(location: 0, length: string!.utf16.count)
let tokenRanges = tokenizer.tokens(for: range)


Note: why do you need to unwrap string ?

The string property of NLTokenizer is an optional, so that's why it needs to be unwrapped. Originally, I was using the range of the text string itself (the one that I assigned to the tokenizer), but this produced some occasional "out of bounds" crashes.


NLTokenizer requires a Range<String.Index>, and won't compile with an NSRange.


You're right: the problem doesn't seem to originate with the thread. I've just got some other relevant results, which I'll put in a reply to my original post.

As Claude said, it wasn't a background thread problem.


At present, the app will handle PDF and HTML files, as well as plain text. My problem file was a 1.4MB PDF, but my app handled an 800KB PDF with no problem. I thought perhaps it was the size.


So I exported the problem PDF file to a plain text file, which was under 600KB. Since I got the same result with that text file, I guess it's not the size that's the problem. Somewhere in those files there's some text that the NLTokenizer can't handle. But it isn't producing any error messages.

You could convert NSRange to Range with this extension:


extension String {
    func rangeFromNSRange(nsRange : NSRange) -> Range? {
        return Range(nsRange, in: self)
    }
}


Credit: https://stackoverflow.com/questions/25138339/nsrange-to-rangestring-index

You may have a character in text that causes the problem ? Could you filter the string to keep only « basic » characters ?

I filtered the text string with .isASCII and it made no difference. However, I have discovered that NLTokenizer is able to tokenize sentences in the problem files, but not words. So it will still be of some use.

That may be a bug of NSTokenizer (or an undocumeted limitation). You can send a bug report with issue-reproducible codes and data.

So, could you make a further test.


Tokenize sentences.

Then loop on sentences to tokenize words, in order to find if there is a problematic sentence.

Background thread taking too long?
 
 
Q