NLTagger errors

Question

I've recently started using the Natural Language framework, and I'm fascinated by NLTokenizer, and particularly NLTagger. I've written a simple app that takes a text file as input, then produces a table listing its tokens, lemmas, and lexical classes. I'm impressed by how well the tagging works. But there are some occasional quirks.

For example, it's unable to recognize won't as a form or will, or can't as a form of can. Instead, it tokenizes them as wo and ca respectively; but it does recognize both of them as verbs.

Is there any way of gaining access to the model NLTagger uses, and doing some further training on it?

Core ML

779

Posted by

Albinus

Reply

Add a Comment

Answer 1

NLTaggerOptions provides an option NLTaggerJoinContractions, which might help here. However, I agree that some of the tagging is erratic. It’s quite difficult to know where the limitation lie, or how to cure them, without some insight into what the system is doing.

Posted by

eastgate

Add a Comment

Answer 2

What's the result if you omitPunctuation as an option ? Do you get cant and wont ?

Posted by

Claude31

Add a Comment

Answer 3

No, those don't show up in either case. I get "can't" and "won't" if i enable combineContractions.

Posted by

Albinus

Add a Comment

NLTagger errors

Replies