NLTagger errors

I've recently started using the Natural Language framework, and I'm fascinated by NLTokenizer, and particularly NLTagger. I've written a simple app that takes a text file as input, then produces a table listing its tokens, lemmas, and lexical classes. I'm impressed by how well the tagging works. But there are some occasional quirks.


For example, it's unable to recognize won't as a form or will, or can't as a form of can. Instead, it tokenizes them as wo and ca respectively; but it does recognize both of them as verbs.


Is there any way of gaining access to the model NLTagger uses, and doing some further training on it?

Replies

NLTaggerOptions provides an option NLTaggerJoinContractions, which might help here. However, I agree that some of the tagging is erratic. It’s quite difficult to know where the limitation lie, or how to cure them, without some insight into what the system is doing.

What's the result if you omitPunctuation as an option ? Do you get cant and wont ?

No, those don't show up in either case. I get "can't" and "won't" if i enable combineContractions.