3 Replies
      Latest reply on Jul 9, 2019 11:08 AM by eastgate
      eastgate Level 1 Level 1 (0 points)

        I’m working with NLTagger as an easy way to stem words, in order (as they say) to improve our users’ search experience.  The results sometimes seem odd.


        First, suppose the user searches for “strike”, hoping to find “the Bread and Roses strike” and also “Casey has struck out.”  We need first to lemmatize the user’s search term, but NLTagger won’t lemmatize the isolated word “strike”. (Appending a space and “this” resolves the issue, but that’s clumsy.)


        Second, let’s search for “in” in the string “IN THE WEEDS”.  The lemmatizer thinks the first word in the string is “Indiana”!  OK: all caps is arguably unusual. Let’s try “In The Weeds”. Now, the lemmatizer declines to tag the first word at all.


        Both these examples are organic — they arose in adapting unit tests for our current, regex-based search.  I expect that I'm Doing It Wrong™, but documentation is thin on the ground. (10.14.6 Beta (18G29g) )

        • Re: NLTagger: whimsical lemmatization
          Albinus Level 1 Level 1 (0 points)

          I've just started working with the Natural Language framework, and with NLTagger. I believe that NLTagger will only give reliable results when it has the context of an entire sentence as input; and perhaps it requires even more context than that. Hopefully some experts will be able to tell us about how the model works.


          I'm having my own troubles with NLTagger, which I'm discussing in a separate topic.

          • Re: NLTagger: whimsical lemmatization
            Claude31 Level 8 Level 8 (6,585 points)

            A question, to quench my curiosity.


            What do you get with

            in the weeds

            and with

            we are in the weeds.

              • Re: NLTagger: whimsical lemmatization
                eastgate Level 1 Level 1 (0 points)

                Aha. Both "in the weeds" and "we are in the weeds" get lemmatized to "weed".  Much better, thanks.


                Worse, my code was looking at the wrong index, so "IN THE WEEDS" was lemmatizing “IN”. not “WEEDS”.  And “Indiana” is not the correct lemmatization here, but it's not insane.