Create ML Trouble Loading CSV to Train Word Tagger With Commas in Training Data

I'm using Numbers to build a spreadsheet that I'm exporting as a CSV. I then import this file into Create ML to train a word tagger model. Everything has been working fine for all the models I've trained so far, but now I'm coming across a use case that has been breaking the import process: commas within the training data. This is a case that none of Apple's examples show.

My project takes Navajo text that has been tokenized by syllables and labels the parts-of-speech.

Case that works...

Raw text:

Naaltsoos yídéeshtah.

Tokens column:

Naal,tsoos, ,yí,déesh,tah,.

Labels column:

NObj,NObj,Space,Verb,Verb,VStem,Punct

Case that breaks...

Raw text:

óola, béésh łigaii, tłʼoh naadą́ą́ʼ, wáin, akʼah, dóó á,shįįh

Tokens column with tokenized text (commas quoted):

óo,la,",", ,béésh, ,łi,gaii,",", ,tłʼoh, ,naa,dą́ą́ʼ,",", ,wáin,",", ,a,kʼah,",", ,dóó, ,á,shįįh

(Create ML reports mismatched columns)

Tokens column with tokenized text (commas escaped):

óo,la,\,, ,béésh, ,łi,gaii,\,, ,tłʼoh, ,naa,dą́ą́ʼ,\,, ,wáin,\,, ,a,kʼah,\,, ,dóó, ,á,shįįh

(Create ML reports mismatched columns)

Tokens column with tokenized text (commas escape-quoted):

óo,la,\",\", ,béésh, ,łi,gaii,\",\", ,tłʼoh, ,naa,dą́ą́ʼ,\",\", ,wáin,\",\", ,a,kʼah,\",\", ,dóó, ,á,shįįh

(record not detected by Create ML)

Tokens column with tokenized text (commas escape-quoted):

óo,la,"","", ,béésh, ,łi,gaii,"","", ,tłʼoh, ,naa,dą́ą́ʼ,"","", ,wáin,"","", ,a,kʼah,"","", ,dóó, ,á,shįįh

(Create ML reports mismatched columns)

Labels column:

NSub,NSub,Punct,Space,NSub,Space,NSub,NSub,Punct,Space,NSub,Space,NSub,NSub,Punct,Space,NSub,Punct,Space,NSub,NSub,Punct,Space,Conj,Space,NSub,NSub

Sample From Spreadsheet

Solution Needed

It's simple enough to escape commas within CSV files, but the format needed by Create ML essentially combines entire CSV records into single columns, so I'm ending up needing a CSV record that contains a mixture of commas to use for parsing and ones to use as character literals. That's where this gets complicated.

For this particular use case (which seems like it would frequently arise when training a word tagger model), how should I properly escape a comma literal?

In the (hopefully) short-term, I am able to export the Numbers spreadsheet as a TSV and have written up a crude converter to generate JSON from it that Create ML can properly handle. However, that adds an extra step that I would hope could be eliminated by directly exporting from Numbers for use in Create ML.

Can you share the CSV file as exported by Numbers? All commas within a cell need to be quoted or escaped. In your example there are only a few quoted or escaped commas.

I have attached the CSV. The lines in question are 290 and 291. Create ML will interpret those as 288 and 289, since it's zero-based and excludes the header row. Importing will work with no problems up to that point and will work with no problems if I delete both rows before exporting to CSV. If you open the CSV in Numbers, everything looks fine.

The CSV file is correct, but I'm having a hard time understanding how the tokens are encoded. The word tagger needs an array of strings. So you need to go from that representation to an array of strings.

For example this line "?,.,!,"","","",:,;", after CSV processing becomes ?,.,!,",",",:,;. I assume it represents ["?", ".", "!", <comma>, <double quote>, ":", ";"], but you'll need to write custom code to make that interpretation. My suggestion is to use a different escape character to make that translation easier, for instance "?,.,!,\,,\"",:,;" and then interpret \ as an escape.

The other option is to use JSON encoding: "[""?"", ""."", ""!"", "","", ""\"""", "":"", "";""]" and then use this to decode:

var dataFrame = try DataFrame(
    contentsOfCSVFile: url,
    types: ["TOKENS": .data]
)
try dataFrame.decode([String].self, inColumn: "TOKENS", using: JSONDecoder())

Here is a more recent case to show what I'm trying to do, as that example with the punctuation was a proof-of-concept for testing. This includes a few commas within the text to be trained. Other examples include quotation marks.

Hodeeyáádą́ą́ʼ,Diyin,God,yótʼááh,hiníláii,índa,nahasdzáán,áyiilaa,.,Nahasdzáán,tʼáadoo,ánoolniní,da,",",índa,tʼáadoo,bikááʼ,siláhí,da,;,bikáaʼgi,tʼáá,átʼéé,nítʼééʼ,chahałheełgo,Diyin,God,biNíłchʼi,Diyinii,tó,yikááʼgóó,nahazleʼ,.,Áádóó,Diyin,God,ádííniid,",",Adinídíin,leʼ,.,Tʼáá,áko,adinídíín,hazlį́į́ʼ,.,Áko,Diyin,God,éí,adinídínígíí,yinééłʼį́įʼgo,bił,yáʼíítʼééh,",",áádóó,adinídínígíí,chahałheeł,yił,ałtsʼáyíínil,.
Adv,Adj,NSub,NObjPos,VPerf,Conj,NObj,VPerf,Punct,NSub,AdvNeg,VProg,PartNeg,Punct,Conj,AdvNeg,Adp,VImpf,PartNeg,Punct,Adp,Adv,VImpf,Adv,Adv,Adj,NSub,NSubPos,NSubPos,NAdp,AdpPos,VImpf,Punct,Conj,Adj,NSub,VPerf,Punct,NObj,VImp,Punct,Adv,Adv,NSub,VPerf,Punct,Adv,Adj,NSub,Pro,NObj,VPerfAdv,ProAdp,VPerf,Punct,Conj,NSub,NAdp,Adp,VPerf,Punct

The core problem is that Create ML does not seem to support several CSV escaping formats that various spreadsheet tools do (including Apple's own Numbers). Additionally, it does not support other file formats directly exported from Numbers that I could find. That makes commas and quotation marks difficult to include in any training data.

I've been able to get around this by writing my own tool that imports TSV files from Numbers and converts them to JSON files that Create ML accepts, adding 2 more steps to the training process each time. However, this post was originally about how to get Create ML to directly accept a Numbers CSV file without added steps every time. If it is not a bug, and Create ML just lacks the functionality, I will continue as I have with my custom work-around, and we can consider this issue resolved.

Create ML Trouble Loading CSV to Train Word Tagger With Commas in Training Data
 
 
Q