Background thread taking too long?

Question

Albinus OP

Created Jul ’19

Replies 8

Boosts 0

Participants 3

I'm using NLTokenizer in this code to extract words from text files:

func loadData() {
  var wordTokens: Set<String> = []
  let tokenizer = NLTokenizer(unit: .word)
  tokenizer.string = TextContent.sharedInstance.text.uppercased()
  let tokenRanges = tokenizer.tokens(for: tokenizer.string!.startIndex..<tokenizer.string!.endIndex)
  for r in tokenRanges {
    let word = String(tokenizer.string![r]).trimmingCharacters(in: .whitespacesAndNewlines)
    if word.count > 0 {
      wordTokens.insert(word)
    }
  }

It's been working fine for most files, including some that are over 800KB in size. But when I input an even larger one (1.4MB), I don't get anything in the tokenRanges array at line #5. I've checked the tokenizer string, and it is initialized.

I have a limited understanding of threads. But I'm wondering whether the tokenizer starts a background thread at line #5 to do its work, and this thread isn't complete yet when line #6 is executed. If this is what's happening, is it possible to somehow require the thread to complete before proceeding?

I've also tried this using the enumerateTokens function with a closure at line #5, with the same result.

Boost

Answer 1

Claude31 OP

Jul ’19

Don't think that could be the problem.

I looked at doc:

https://developer.apple.com/documentation/foundation/nslinguistictagger/tokenizing_natural_language_text

They set the range a but differently:

let range = NSRange(location: 0, length: text.utf16.count)

So, could you try:

let range = NSRange(location: 0, length: string!.utf16.count)
let tokenRanges = tokenizer.tokens(for: range)

Note: why do you need to unwrap string ?

0

Answer 2

Albinus OP

Jul ’19

The string property of NLTokenizer is an optional, so that's why it needs to be unwrapped. Originally, I was using the range of the text string itself (the one that I assigned to the tokenizer), but this produced some occasional "out of bounds" crashes.

NLTokenizer requires a Range<String.Index>, and won't compile with an NSRange.

You're right: the problem doesn't seem to originate with the thread. I've just got some other relevant results, which I'll put in a reply to my original post.

0

Answer 3

Albinus OP

Jul ’19

As Claude said, it wasn't a background thread problem.

At present, the app will handle PDF and HTML files, as well as plain text. My problem file was a 1.4MB PDF, but my app handled an 800KB PDF with no problem. I thought perhaps it was the size.

So I exported the problem PDF file to a plain text file, which was under 600KB. Since I got the same result with that text file, I guess it's not the size that's the problem. Somewhere in those files there's some text that the NLTokenizer can't handle. But it isn't producing any error messages.

0

Answer 4

Claude31 OP

Jul ’19

You could convert NSRange to Range with this extension:

extension String {
    func rangeFromNSRange(nsRange : NSRange) -> Range? {
        return Range(nsRange, in: self)
    }
}

Credit: https://stackoverflow.com/questions/25138339/nsrange-to-rangestring-index

0

Answer 5

Claude31 OP

Jul ’19

You may have a character in text that causes the problem ? Could you filter the string to keep only « basic » characters ?

0

Answer 6

Albinus OP

Jul ’19

I filtered the text string with .isASCII and it made no difference. However, I have discovered that NLTokenizer is able to tokenize sentences in the problem files, but not words. So it will still be of some use.

0

Answer 7

OOPer OP

Jul ’19

That may be a bug of NSTokenizer (or an undocumeted limitation). You can send a bug report with issue-reproducible codes and data.

0

Answer 8

Claude31 OP

Jul ’19

So, could you make a further test.

Tokenize sentences.

Then loop on sentences to tokenize words, in order to find if there is a problematic sentence.

0