Word tagging tools for CreateML?

Question

Created Sep ’22

Replies 2

Boosts 0

Views 2.6k

Participants 3

What tools are folks using to create the json file needed to train a custom word tagger model? I've tried Doccano, but it exports JSONL which is very different than what CreateML is expecting.

(example of the required format here: https://developer.apple.com/documentation/naturallanguage/creating_a_word_tagger_model).

Are there standard tools or utilities that export/convert to the CreateML format?

Thanks.

Boost

Answer 1

OP

Apple

Nov ’22

It looks like the ) at the end might be impeding the URL, so re-linking the format for context to anyone else seeing this post.

https://developer.apple.com/documentation/naturallanguage/creating_a_word_tagger_model

For JSONL to JSON, it looks like there are some standard conversion tools between the two. A small Python script might be the easiest way to input any annotation file and convert it to the required JSON format.

0

Answer 2

roroCoder OP

Apr ’23

Hey there, I don't know if you've solved your issue, but this was a major problem for me that took some time to work out. For anyone else who runs into this problem, maybe consider this solution.

What I did was manually write JSON format arrays and dictionaries to put labels for each token (singular word that makes up a sentence - image you take your input sentence and say textString.split(by: " ") (by space))

What really sped things up was bringing chat gpt in. It took awhile to teach it what I needed to do, but you can get it to spit out correct format json after giving it around 15-20 items you've written manually.

Here's the format:


[
    {
        "tokens": ["Remind", "me", "tomorrow", "at", "8", "am", "to", "leave", "for", "work"],
        "labels": ["NONE", "NONE", "TIME", "TIME", "TIME", "TIME", "NONE", "REMINDER", "REMINDER", "REMINDER"]
    },
    {
        "tokens": ["Set", "a", "reminder", "next", "tuesday", "to", "buy", "a", "large", "ruler"],
        "labels": ["NONE", "NONE", "NONE", "TIME", "TIME", "NONE", "REMINDER", "REMINDER", "REMINDER", "REMINDER"]
    }
]

Each word in the "tokens" array lines up with a label in the "labels" array. The ml model, whence you successfully train it, will take in a sentence and spit out an array of labels that you can do a lot of things with.. that's another discussion.

Hope this helps, roroDevelopment

0