MLWordEmbedding compression

When generating a MLWordEmbedding model, there seems to be some kind of compression happening with the original input vectors.

Just take the example from the documentation:

Code Block swift
let vectors = [
  "Hello"   : [0.0, 1.2, 5.0, 0.0],
  "Goodbye" : [0.0, 1.3, -6.2, 0.1]
]
let embedding = try! MLWordEmbedding(dictionary: vectors)
embedding.vector(for: "Hello") == vectors["Hello"] // false
embedding.vector(for: "Goodbye") == vectors["Goodbye"] // false
// unexpectedly compressed to same vector
embedding.vector(for: "Hello") == embedding.vector(for: "Goodbye") // true
embedding.distance(between: "Hello", and: "Goodbye") // 0


Larger datasets, like word2vec, seem to work a bit better. But input vectors are still changed in unexpected ways and more vector collisions occur.

I'm curious what spacial properties are expected to hold after compression? Is there some way to tune or disable this?

Thanks!

It also seems the vocabulary needs to be the same size as the embeddings which is frustrating. This e.g. only prints the first three values in each vector.


let vectors = [
    
    "Hello" : [0.14456832, 0.100063734, 0.15700547, 0.14713864, 0.110189505, 0.1219601, 0.081854515, 0.07345803, 0.19184813, 0.15609136,0.16970365, 0.14202964, 0.07074278, 0.15143657, 0.109310314, 0.05455876, 0.15056399, 0.16634032, 0.08465124,0.16243581,0.035854265, 0.10904387, 0.09732084, 0.12968284, 0.14430353, 0.061719917, 0.1193506, 0.14363676, 0.14923467,0.1795261, 0.13087963],
    
    "Goodbye" : [0.14456832, 0.100063734, 0.15700547, 0.14713864, 0.110189505, 0.1219601, 0.081854515, 0.07345803, 0.19184813, 0.15609136,0.16970365, 0.14202964, 0.07074278, 0.15143657, 0.109310314, 0.05455876, 0.15056399, 0.16634032, 0.08465124,0.16243581,0.035854265, 0.10904387, 0.09732084, 0.12968284, 0.14430353, 0.061719917, 0.1193506, 0.14363676, 0.14923467,0.1795261, 0.13087963]
               ,
     "Third value" : [0.14456832, 0.100063734, 0.15700547, 0.14713864, 0.110189505, 0.1219601, 0.081854515, 0.07345803, 0.19184813, 0.15609136,0.16970365, 0.14202964, 0.07074278, 0.15143657, 0.109310314, 0.05455876, 0.15056399, 0.16634032, 0.08465124,0.16243581,0.035854265, 0.10904387, 0.09732084, 0.12968284, 0.14430353, 0.061719917, 0.1193506, 0.14363676, 0.14923467,0.1795261, 0.13087963]
           ]


let wordEmbedding = try! MLWordEmbedding(dictionary: vectors)



print(wordEmbedding.vector(for: "Hello"))

Where the output is

Optional([0.1445683240890503, 0.1000637337565422, 0.15700547397136688, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
MLWordEmbedding compression
 
 
Q