Return character's position in alphabet in Swift

Hi,


I'm writing a form of binary search to optimise an algorhythm to build a database from its core data entities. We have to look up whether we have built an entity previously, and if not then we add it.


The entities are in an array, and I can sort the array by inserting them in alphabetical order.


I am wondering if there is a further optimisation to make: namely that if we know that the new entity name begins with, say a 'D', then we know it should be approximately 4/26s of the way into the index of existing entities. A traditional binary search would always start at half way through.


In ye olde days of 8-bit ASCII text this would clearly be easy - you'd treat the character as a byte, subtract the offset of ASCII to the first character, etc etc.


Is there an 'easy' way to do this in Swift that isn't more computationally expensive than letting the search begin at halfway?


Best wishes,


Andrew

Replies

as xharacters in String may be UTF-16 or unicode scalar, I do not see how to optimize. In addition, trying to turn around language own optimizations may be a bit risky for your algorithm.

>> I am wondering if there is a further optimisation to make: namely that if we know that the new entity name begins with, say a 'D', then we know it should be approximately 4/26s of the way into the index of existing entities.


One major problem is that you don't know (or if you do, you haven't mentioned anything about it) that all initial letters are equally probable. "D"s might actually start 19/26ths of the way down the array.


Let's say that it's worth spending time optimizing the search — that there's enough to be gained. Then, it'd probably work better to optimized a little harder than you suggest. For example, when constructing the sorted array, you can keep track of the array index at which the entries for each letter start, which tells you where to looking when searching the array.


That would also help solve the other problem with your 4/26ths solution. If the initial comparison finds the search value is greater than the value at 4/26ths, the binary search is going to take a huge jump 11/26ths down the alphabet. What you really want to do is to limit the search to the "D"s.


However, this kind of seat-of-the-pants optimization is terribly fragile, because sorting and searching is a pretty hard problem. You might, for example, do better to research some standard techniques, such as a B-tree (en.wikipedia.org/wiki/B-tree), or leverage existing solutions, such as using a dictionary instead of an array.


For that matter, if you have a sorted array, shouldn't the entries you want to skip be adjacent to the previous entries you created a Core Data entity from?

You have to be really careful of Unicode issues here. For example, your first step, sorting the strings, is tricky when dealing with Unicode because the sort order is locale depending. Thus, you have to make sure that either your sort is locale independent or you re-sort on locale changes. You also have to make sure that your binary search uses exactly the same comparison as the sort.

With regards your main question — setting the binary search start point based on the first letter — I’d only attempt that if you know that the strings are limited in some way (for example, you know that you’re dealing with alphabetic languages). Trying to do that over the full range of Unicode would be quite challenging.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

WWDC runs Mon, 5 Jun through to Fri, 9 Jun. During that time all of DTS will be at the conference, helping folks out face-to-face. http://developer.apple.com/wwdc/