MLWordEmbedding compression

Question

Created Aug ’20

Replies 1

Boosts 0

Participants 2

When generating a MLWordEmbedding model, there seems to be some kind of compression happening with the original input vectors.

Just take the example from the documentation:

Code Block swiftlet vectors = [
  "Hello"   : [0.0, 1.2, 5.0, 0.0],
  "Goodbye" : [0.0, 1.3, -6.2, 0.1]
]
let embedding = try! MLWordEmbedding(dictionary: vectors)
embedding.vector(for: "Hello") == vectors["Hello"] // false
embedding.vector(for: "Goodbye") == vectors["Goodbye"] // false
// unexpectedly compressed to same vector
embedding.vector(for: "Hello") == embedding.vector(for: "Goodbye") // true
embedding.distance(between: "Hello", and: "Goodbye") // 0

Larger datasets, like word2vec, seem to work a bit better. But input vectors are still changed in unexpected ways and more vector collisions occur.

I'm curious what spacial properties are expected to hold after compression? Is there some way to tune or disable this?

Thanks!

Boost

Answer 1

StigHebbelstrup OP

Oct ’24

It also seems the vocabulary needs to be the same size as the embeddings which is frustrating. This e.g. only prints the first three values in each vector.


let vectors = [
    
    "Hello" : [0.14456832, 0.100063734, 0.15700547, 0.14713864, 0.110189505, 0.1219601, 0.081854515, 0.07345803, 0.19184813, 0.15609136,0.16970365, 0.14202964, 0.07074278, 0.15143657, 0.109310314, 0.05455876, 0.15056399, 0.16634032, 0.08465124,0.16243581,0.035854265, 0.10904387, 0.09732084, 0.12968284, 0.14430353, 0.061719917, 0.1193506, 0.14363676, 0.14923467,0.1795261, 0.13087963],
    
    "Goodbye" : [0.14456832, 0.100063734, 0.15700547, 0.14713864, 0.110189505, 0.1219601, 0.081854515, 0.07345803, 0.19184813, 0.15609136,0.16970365, 0.14202964, 0.07074278, 0.15143657, 0.109310314, 0.05455876, 0.15056399, 0.16634032, 0.08465124,0.16243581,0.035854265, 0.10904387, 0.09732084, 0.12968284, 0.14430353, 0.061719917, 0.1193506, 0.14363676, 0.14923467,0.1795261, 0.13087963]
               ,
     "Third value" : [0.14456832, 0.100063734, 0.15700547, 0.14713864, 0.110189505, 0.1219601, 0.081854515, 0.07345803, 0.19184813, 0.15609136,0.16970365, 0.14202964, 0.07074278, 0.15143657, 0.109310314, 0.05455876, 0.15056399, 0.16634032, 0.08465124,0.16243581,0.035854265, 0.10904387, 0.09732084, 0.12968284, 0.14430353, 0.061719917, 0.1193506, 0.14363676, 0.14923467,0.1795261, 0.13087963]
           ]


let wordEmbedding = try! MLWordEmbedding(dictionary: vectors)



print(wordEmbedding.vector(for: "Hello"))

Where the output is

Optional([0.1445683240890503, 0.1000637337565422, 0.15700547397136688, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])

0