Trying (failing) to use CreateML to create model for named entity rec w/ Natural Language framework

I'm trying to use CreateML to build a model I can use with the new Natural Language framework for domain-specific named entity recognition in scanning some text. It's actually very very similar to the example in the WWDC '18 video introducing the Natural Language Framework where they add a bunch of products and recognize them.


The problem I'm having is that the results I'm getting when I run text through an NLTagger with this model are very inaccurate.


Imagine an app for people visiting Las Vegas, NV. I want to be able to identify names of hotels, restaurants and other activities as such. I have training data that looks like this (there's a lot more but it all follows this pattern).


{"tokens":["Bellagio","Buffet at Bellagio","Fix Restaurant and Bar","Harvest","Jasmine","Lago","Le Cirque","Michael Mina","Noodles","Picasso","Prime","Spago","Yellowtail","Spa \u0026 Salon","Fountains of Bellagio","Gallery of Fine Art","Hyde Lounge","Lily Bar and Lounge","O ","Petrossian Bar"],"labels":["Hotel","Restaurant","Restaurant","Restaurant","Restaurant","Restaurant","Restaurant","Restaurant","Restaurant","Restaurant","Restaurant","Restaurant","Restaurant","Activity","Activity","Activity","Activity","Activity","Activity","Activity"]}


Here's a Playground with my test code. With the below, I'd expect 'Bellagio' to come back as 'Hotel' but when I print the tokens and tags, they all come back as... not that. And some times, the same token comes back as two different tags (i.e. 'MGM Grand' below).


What am I doing wrong? Bad training data? Bad training data format? Unrealistic expectations? I have no idea what I'm doing?


The last one is definitely true.


In the WWDC video demo, it seems to work great and it seems very similar to what I'm doing so not sure where I'm off.


import CreateML
import Foundation
import NaturalLanguage

let wordFilePath = Bundle.main.path(forResource:"vegas_words", ofType: "json")!
let wordFileURL = URL(fileURLWithPath: wordFilePath)

let trainingData = try MLDataTable(contentsOf: wordFileURL)
let model = try MLWordTagger(trainingData: trainingData, tokenColumn: "tokens", labelColumn: "labels")

let compiledModel = try NLModel(mlModel: model.model)

let text = "When in Las Vegas I like to stay at the luxury hotel Bellagio or perhaps Wynn Las Vegas but not MGM Grand or the Luxor. Sometimes I like to dine at Delmonico at The Venetian or at one of the places at MGM Grand."
let range = text.startIndex..<text.endIndex

var vegasTagScheme = NLTagScheme("Vegas")

var tagger = NLTagger(tagSchemes: [.nameType, vegasTagScheme])
tagger.string = text
tagger.setModels([compiledModel], forTagScheme: vegasTagScheme)
tagger.setLanguage(NLLanguage("en"), range: range)

tagger.enumerateTags(in: range, unit: .word, scheme: vegasTagScheme, options: [.omitWhitespace, .joinNames, .omitPunctuation]) { (tag, tokenRange) -> Bool in
    let token = text[tokenRange]
    
    if let tag = tag {
        print("\(token): \(tag.rawValue)")
    }

    return true
}
When: Hotel
in: Hotel
Las Vegas: Hotel
I: Restaurant
like: Restaurant
to: Restaurant
stay: Restaurant
at: Restaurant
the: Restaurant
luxury: Restaurant
hotel: Restaurant
Bellagio: Restaurant
or: Restaurant
perhaps: Restaurant
Wynn Las Vegas: Restaurant
but: Restaurant
not: Restaurant
MGM Grand: Restaurant
or: Restaurant
the: Restaurant
Luxor: Restaurant
Sometimes: Activity
I: Activity
like: Activity
to: Activity
dine: Activity
at: Activity
Delmonico: Activity
at: Activity
The: Activity
Venetian: Activity
or: Activity
at: Activity
one: Activity
of: Activity
the: Activity
places: Activity
at: Activity
MGM Grand: Activity

Replies

Lots of views, no replies.


I submitted this as one of my tech-support incidents. If I get a reply that's useful, I'll update this in case it can help others.

In talking with DTS, it sounds like my training data is such that it's very difficult for NLTagger to generalize - there's not enough variety and there's just not enough of a sample.

I'm also have quite similar problems when I tried to combine all of the sentences into one single tokens and labels which is different than .JSON file that is seen on the MLWordTagger example. However, when I tried to separate it into multiple samples / sentences, the default CRF training algorithm is seems to running indefinitely.