NLTagger: whimsical lemmatization

I’m working with NLTagger as an easy way to stem words, in order (as they say) to improve our users’ search experience. The results sometimes seem odd.


First, suppose the user searches for “strike”, hoping to find “the Bread and Roses strike” and also “Casey has struck out.” We need first to lemmatize the user’s search term, but NLTagger won’t lemmatize the isolated word “strike”. (Appending a space and “this” resolves the issue, but that’s clumsy.)


Second, let’s search for “in” in the string “IN THE WEEDS”. The lemmatizer thinks the first word in the string is “Indiana”! OK: all caps is arguably unusual. Let’s try “In The Weeds”. Now, the lemmatizer declines to tag the first word at all.


Both these examples are organic — they arose in adapting unit tests for our current, regex-based search. I expect that I'm Doing It Wrong™, but documentation is thin on the ground. (10.14.6 Beta (18G29g) )

Accepted Reply

A question, to quench my curiosity.


What do you get with

in the weeds

and with

we are in the weeds.

Replies

I've just started working with the Natural Language framework, and with NLTagger. I believe that NLTagger will only give reliable results when it has the context of an entire sentence as input; and perhaps it requires even more context than that. Hopefully some experts will be able to tell us about how the model works.


I'm having my own troubles with NLTagger, which I'm discussing in a separate topic.

A question, to quench my curiosity.


What do you get with

in the weeds

and with

we are in the weeds.

Aha. Both "in the weeds" and "we are in the weeds" get lemmatized to "weed". Much better, thanks.


Worse, my code was looking at the wrong index, so "IN THE WEEDS" was lemmatizing “IN”. not “WEEDS”. And “Indiana” is not the correct lemmatization here, but it's not insane.