Partitioning arrays of doubles

I want to write a function that takes an array of doubles and partitions them into a given number of "bins", based on their value. Obviously this would be useful for graphing the distribution of the elements of a data set. Before writing all the code myself, I'm wondering if there are some built-in features of Swift collections that might save me a lot of time.


I'm asking because I remember the joy I felt when someone here introduced me to the map() and filter() functions.

Accepted Reply

And a more robust, dealing correctly with 0 or 1 bin:


struct ValueDistributor {


var dataArray: [Double]

var numberOfBins: Int


func feedBins() -> [(Range<Double>, Int)] {

var collector: [(Range<Double>, Int)] = []

if numberOfBins < 1 { return collector }

let minValue = dataArray.min() ?? 0

let maxValue = dataArray.max() ?? 0

let maxValueForRange = maxValue + 0.000000000001 // to include maxValue in the top bin

let binSize = (maxValue - minValue) / Double(numberOfBins)

var binRanges: [Range<Double>] = []

if numberOfBins == 1 {

binRanges.append(minValue ..< maxValueForRange)

} else {

binRanges.append(minValue ..< minValue + binSize)

}

for binNum in 1..<numberOfBins {

if let lastRange = binRanges.last {

if binNum < numberOfBins - 1 {

binRanges.append(lastRange.upperBound ..< lastRange.upperBound + binSize)

} else {

binRanges.append(lastRange.upperBound ..< maxValueForRange)

}

}

}

for r in binRanges {

let countedRanges = dataArray.filter { r.contains($0) }

collector.append((r, countedRanges.count))

}

let lastLower = collector[numberOfBins-1].0.lowerBound

collector[numberOfBins-1].0 = lastLower ..< maxValue // Hide the artefact

return collector

}


}


----------------------------------------


If you prefer getting the result as a property and not a func, you can make bins a computed var

struct ValueDistributor {


var dataArray: [Double]

var numberOfBins: Int


var bins : [(Range<Double>, Int)] {

var collector: [(Range<Double>, Int)] = []

if numberOfBins < 1 { return collector }

let minValue = dataArray.min() ?? 0

let maxValue = dataArray.max() ?? 0

let maxValueForRange = maxValue + 0.000000000001 // to include maxValue in the top bin

let binSize = (maxValue - minValue) / Double(numberOfBins)

var binRanges: [Range<Double>] = []

if numberOfBins == 1 {

binRanges.append(minValue ..< maxValueForRange)

} else {

binRanges.append(minValue ..< minValue + binSize)

}

for binNum in 1..<numberOfBins {

if let lastRange = binRanges.last {

if binNum < numberOfBins - 1 {

binRanges.append(lastRange.upperBound ..< lastRange.upperBound + binSize)

} else {

binRanges.append(lastRange.upperBound ..< maxValueForRange)

}

}

}

for r in binRanges {

let countedRanges = dataArray.filter { r.contains($0) }

collector.append((r, countedRanges.count))

}

let lastLower = collector[numberOfBins-1].0.lowerBound

collector[numberOfBins-1].0 = lastLower ..< maxValue // Hide the artefact

return collector

} // end of bins computed var


}

Replies

Please be more specific. partitions them into a given number of "bins", based on their value is not enough to judge if there is some simple way or not.


Showing your example input and the desired output would be some help.


Generally, there's no built-in feature to partition an array in Swift Standard Library, you should better start writing your code.

If you can explain what you really want to do, there may be some possible combinations of built-in features.

When you say doubles, do you mean Double numeric type (if so, how do you want to split it ?) or do you mean a pair of numbers (12.5, 22.6) ?

In that case, map function could be useful.


I don't either understand what bins are.

What I want to do is count the number of values in an arbitrary data set that fall into each of a given number of value ranges (the "bins"). So if I have an array of 1000 integers with a maximum value of 100, I might want to know how many of them fall into each the ranges 1...9, 10...19, 20...29, and so on. This is the sort of thing you'd want to do when creating a graph showing how some measurable property is distributed through a population.


Here's the code I wrote to do this:


struct ValueDistributor {

  var dataArray: [Double]
  var numberOfBins: Int
  var minValue = 0.0
  var maxValue = 0.0
  var bins: [(Range, Int)] = []

  init(dataArray: [Double], numberOfBins: Int) {

       self.dataArray = dataArray
       self.numberOfBins = numberOfBins
       minValue = dataArray.min() ?? 0
       maxValue = dataArray.max() ?? 0

       let binSize = (maxValue - minValue) / Double(numberOfBins)
       var binRanges: [Range] = []
       binRanges.append(minValue ..< minValue + binSize)
       for _ in 1 ... numberOfBins {
            if let lastRange = binRanges.last {
                 binRanges.append(lastRange.upperBound ..< lastRange.upperBound + binSize)
            }
       }

       var collector: [(Range, Int)] = []
       for item in dataArray {
            for r in binRanges {
                 if r.contains(item) {
                      collector.append((r, 1))
                 }
            }     
       }
       let unsortedBins = Dictionary(collector, uniquingKeysWith: {$0 + $1})
       bins = unsortedBins.sorted(by: {$0.key.lowerBound < $1.key.lowerBound})
  }
}


The struct is initialized with the data to be "binned", and the number of bins to put it in. The bins are created, then the number of values that fit into each one are counted. In the end, I end up with an array of tuples which pair each of the bins with the number of values it has.


It seems to work as it is, but no doubt there are places where it could be made much more efficient.

If that works well, there may be no need to optimize.


But it will probably not compile.

Should replace Range by Range<Double> everywhere

You get one more bin than specified number


        binRanges.append(minValue..        for _ in 1 ... numberOfBins {                                                // numberOfBins more bins
            if let lastRange = binRanges.last {
                binRanges.append(lastRange.upperBound..            }
       // At the end numberOfBins + 1


Should test also that numberOfBins is at least 1.


A few remarks.


I would not declare bins as part of the struct definition.


In fact, it is a result.

So, for better readability, you could rwrite as:


struct ValueDistributor {

    var dataArray: [Double]
    var numberOfBins: Int
   
    init(dataArray: [Double], numberOfBins: Int) {
       
        self.dataArray = dataArray
        self.numberOfBins = numberOfBins
    }
   
    func feedBins() -> [(Range, Int)] {
       
        let minValue = dataArray.min() ?? 0
        let maxValue = dataArray.max() ?? 0
        let maxValueForRange = maxValue + 0.000000000001   // to include maxValue in the top bin
        var bins: [(Range, Int)] = []
        let binSize = (maxValue - minValue) / Double(numberOfBins)
        var binRanges: [Range] = []
       
        binRanges.append(minValue ..< minValue + binSize)
        for binNum in 1..            if let lastRange = binRanges.last {
                if binNum < numberOfBins - 1 {
                    binRanges.append(lastRange.upperBound ..< lastRange.upperBound + binSize)
                } else {
                    binRanges.append(lastRange.upperBound ..< maxValueForRange)
                }
               
            }
        }
       
        var collector: [(Range, Int)] = []
        for item in dataArray {
            for r in binRanges {
                if r.contains(item) {
                    collector.append((r, 1))
                }
            }
        }
        let unsortedBins = Dictionary(collector, uniquingKeysWith: {$0 + $1})
        bins = unsortedBins.sorted(by: {$0.key.lowerBound < $1.key.lowerBound})
       
        let lastLower = bins[numberOfBins-1].0.lowerBound
        bins[numberOfBins-1].0 = lastLower ..< maxValue       // Hide the artefact
        return bins
    }

}


let collector = ValueDistributor2(dataArray: [5.0, 2.0, 3.0, 11.0, -5, 7.2], numberOfBins: 2)

print(collector.feedBins())


You get

[(Range(-5.0..<3.0), 2), (Range(3.0..<11.0), 4)]



Your code (after corrections to compile), gives

[(Range(-5.0..<3.0), 2), (Range(3.0..<11.0), 3), (Range(11.0..<19.0), 1)]

That's not necessariliy an optimization, but you could simplify a bit further with filter() :

Note: formatter does not like lines with Ranges…


struct ValueDistributor {


var dataArray: [Double]

var numberOfBins: Int


func feedBins() -> [(Range<Double>, Int)] {

let minValue = dataArray.min() ?? 0

let maxValue = dataArray.max() ?? 0

let maxValueForRange = maxValue + 0.000000000001 // to include maxValue in the top bin

let binSize = (maxValue - minValue) / Double(numberOfBins)

var binRanges: [Range<Double>] = []

binRanges.append(minValue ..< minValue + binSize)

for binNum in 1..<numberOfBins {

if let lastRange = binRanges.last {

if binNum < numberOfBins - 1 {

binRanges.append(lastRange.upperBound ..< lastRange.upperBound + binSize)

} else {

binRanges.append(lastRange.upperBound ..< maxValueForRange)

}

}

}

var collector: [(Range<Double>, Int)] = []

for r in binRanges {

let countedRanges = dataArray.filter { r.contains($0) }

collector.append((r, countedRanges.count))

}

let lastLower = collector[numberOfBins-1].0.lowerBound

collector[numberOfBins-1].0 = lastLower ..< maxValue // Hide the artefact

return collector

}


}

And a more robust, dealing correctly with 0 or 1 bin:


struct ValueDistributor {


var dataArray: [Double]

var numberOfBins: Int


func feedBins() -> [(Range<Double>, Int)] {

var collector: [(Range<Double>, Int)] = []

if numberOfBins < 1 { return collector }

let minValue = dataArray.min() ?? 0

let maxValue = dataArray.max() ?? 0

let maxValueForRange = maxValue + 0.000000000001 // to include maxValue in the top bin

let binSize = (maxValue - minValue) / Double(numberOfBins)

var binRanges: [Range<Double>] = []

if numberOfBins == 1 {

binRanges.append(minValue ..< maxValueForRange)

} else {

binRanges.append(minValue ..< minValue + binSize)

}

for binNum in 1..<numberOfBins {

if let lastRange = binRanges.last {

if binNum < numberOfBins - 1 {

binRanges.append(lastRange.upperBound ..< lastRange.upperBound + binSize)

} else {

binRanges.append(lastRange.upperBound ..< maxValueForRange)

}

}

}

for r in binRanges {

let countedRanges = dataArray.filter { r.contains($0) }

collector.append((r, countedRanges.count))

}

let lastLower = collector[numberOfBins-1].0.lowerBound

collector[numberOfBins-1].0 = lastLower ..< maxValue // Hide the artefact

return collector

}


}


----------------------------------------


If you prefer getting the result as a property and not a func, you can make bins a computed var

struct ValueDistributor {


var dataArray: [Double]

var numberOfBins: Int


var bins : [(Range<Double>, Int)] {

var collector: [(Range<Double>, Int)] = []

if numberOfBins < 1 { return collector }

let minValue = dataArray.min() ?? 0

let maxValue = dataArray.max() ?? 0

let maxValueForRange = maxValue + 0.000000000001 // to include maxValue in the top bin

let binSize = (maxValue - minValue) / Double(numberOfBins)

var binRanges: [Range<Double>] = []

if numberOfBins == 1 {

binRanges.append(minValue ..< maxValueForRange)

} else {

binRanges.append(minValue ..< minValue + binSize)

}

for binNum in 1..<numberOfBins {

if let lastRange = binRanges.last {

if binNum < numberOfBins - 1 {

binRanges.append(lastRange.upperBound ..< lastRange.upperBound + binSize)

} else {

binRanges.append(lastRange.upperBound ..< maxValueForRange)

}

}

}

for r in binRanges {

let countedRanges = dataArray.filter { r.contains($0) }

collector.append((r, countedRanges.count))

}

let lastLower = collector[numberOfBins-1].0.lowerBound

collector[numberOfBins-1].0 = lastLower ..< maxValue // Hide the artefact

return collector

} // end of bins computed var


}

I'd missed having to add something to maxValue in order to count the maximum value in the data array: thanks!

hi,


without getting into boundary cases and the problems of counting double values, let me suggest a way to do this for the simpler problem you posed:


"So if I have an array of 1000 integers with a maximum value of 100, I might want to know how many of them fall into each the ranges 1...9, 10...19, 20...29, and so on."


assuming that you can turn each value into a "binNumber" in a known range (say, 0 to maxBinNumber -- in this case, it's easy: just divide by 10 (integer division) -- you can use a built-in dictionary method to "partition" your array by binNumber.


import Foundation

let myArray = [13, 27, 83, 19, 37, 43, 59, 51, 53, 13, 99, 1, 10, 9, 100 ]

func binNumber(_ x: Int) -> Int {
  return x / 10
}
let maxBinNumber = 10

let myDictionary = Dictionary(grouping: myArray, by: { binNumber($0) })
for bin in 0...maxBinNumber {
  if let numbers = myDictionary[bin] {
  print (bin, numbers)
  } else {
  print (bin, "(None)")
  }
}


the output of the above is


0 [1, 9]

1 [13, 19, 13, 10]

2 [27]

3 [37]

4 [43]

5 [59, 51, 53]

6 (None)

7 (None)

8 [83]

9 [99]

10 [100]


of course, you'd simply want to look at the count of the values in each bin, not the contents of the bin.


hope that helps,

DMG

If you prefer `map` or `filter` or something like that, you can re-write your code like this:

struct ValueDistributor2 {
    
    var dataArray: [Double]
    var numberOfBins: Int
    var minValue = 0.0
    var maxValue = 0.0
    var bins: [(Range<Double>, Int)] = []
    
    init(dataArray: [Double], numberOfBins: Int) {
        
        self.dataArray = dataArray
        self.numberOfBins = numberOfBins
        minValue = dataArray.min() ?? 0
        maxValue = dataArray.max() ?? 0
        
        let binSize = (maxValue - minValue) / Double(numberOfBins)
        assert(binSize != 0.0) //Ignore special case

        func binIndex(_ number: Double) -> Int {
            let index = Int(floor((number-minValue)/binSize))
            return index
        }
        
        func range(ofBin index: Int) -> Range<Double> {
            let lowerBound = minValue + binSize * Double(index)
            let upperBound = minValue + binSize * Double(index+1)
            return lowerBound..<upperBound
        }
        
        let counts = dataArray.reduce(into: [:]) {$0[binIndex($1), default: 0] += 1}
        bins = counts.sorted(by: {$0.key < $1.key}).map{(range(ofBin: $0.key), $0.value)}
    }
}


If you do not like that the last range may always contain the max value, you can write something like this:

struct GeneralizedRange<T>: Hashable, CustomStringConvertible
where T: Hashable, T: Comparable {
    var lowerBound: T
    var upperBound: T
    var isClosed: Bool
    
    init(_ range: Range<T>) {
        self.lowerBound = range.lowerBound
        self.upperBound = range.upperBound
        self.isClosed = false
    }
    
    init(_ range: ClosedRange<T>) {
        self.lowerBound = range.lowerBound
        self.upperBound = range.upperBound
        self.isClosed = true
    }
    
    var description: String {
        if isClosed {
            return "\(lowerBound)...\(upperBound)"
        } else {
            return "\(lowerBound)..<\(upperBound)"
        }
    }
    
    func contains(_ value: T) -> Bool {
        if isClosed {
            return lowerBound<=value && value<=upperBound
        } else {
            return lowerBound<=value && value<upperBound
        }
    }
    
    static func ~= (lhs: GeneralizedRange<T>, rhs: T) -> Bool {
        return lhs.contains(rhs)
    }
}
struct ValueDistributor3 {
    
    var dataArray: [Double]
    var numberOfBins: Int
    var minValue = 0.0
    var maxValue = 0.0
    var bins: [(GeneralizedRange<Double>, Int)] = []
    
    init(dataArray: [Double], numberOfBins: Int) {
        
        self.dataArray = dataArray
        self.numberOfBins = numberOfBins
        minValue = dataArray.min() ?? 0
        maxValue = dataArray.max() ?? 0
        
        let binSize = (maxValue - minValue) / Double(numberOfBins)
        assert(binSize != 0.0) //Ignore special case

        func binIndex(_ number: Double) -> Int {
            let index = Int(floor((number-minValue)/binSize))
            return index >= numberOfBins ? numberOfBins-1 : index
        }
        
        func range(ofBin index: Int) -> GeneralizedRange<Double> {
            let lowerBound = minValue + binSize * Double(index)
            let upperBound = minValue + binSize * Double(index+1)
            if index == numberOfBins-1 {
                return GeneralizedRange(lowerBound...upperBound)
            } else {
                return GeneralizedRange(lowerBound..<upperBound)
            }
        }
        
        let counts = dataArray.reduce(into: [:]) {$0[binIndex($1), default: 0] += 1}
        bins = counts.sorted(by: {$0.key < $1.key}).map{(range(ofBin: $0.key), $0.value)}
    }
}

Testing code and Output example:

var dataArray: [Double] = []
for _ in 0..<20 {
    dataArray.append(Double(Int.random(in: 0...100)))
}
print(dataArray)
let vd = ValueDistributor(dataArray: dataArray, numberOfBins: 10)
print(vd.bins)

let vd2 = ValueDistributor2(dataArray: dataArray, numberOfBins: 10)
print(vd2.bins, vd.bins.map{$0.1} == vd2.bins.map{$0.1})

let vd3 = ValueDistributor3(dataArray: dataArray, numberOfBins: 10)
print(vd3.bins, vd.bins.map{$0.1} == vd3.bins.map{$0.1})


[25.0, 66.0, 6.0, 0.0, 81.0, 71.0, 10.0, 84.0, 92.0, 56.0, 64.0, 71.0, 73.0, 66.0, 100.0, 19.0, 70.0, 51.0, 59.0, 19.0]

[(Range(0.0..<10.0), 2), (Range(10.0..<20.0), 3), (Range(20.0..<30.0), 1), (Range(50.0..<60.0), 3), (Range(60.0..<70.0), 3), (Range(70.0..<80.0), 4), (Range(80.0..<90.0), 2), (Range(90.0..<100.0), 1), (Range(100.0..<110.0), 1)]

[(Range(0.0..<10.0), 2), (Range(10.0..<20.0), 3), (Range(20.0..<30.0), 1), (Range(50.0..<60.0), 3), (Range(60.0..<70.0), 3), (Range(70.0..<80.0), 4), (Range(80.0..<90.0), 2), (Range(90.0..<100.0), 1), (Range(100.0..<110.0), 1)] true

[(0.0..<10.0, 2), (10.0..<20.0, 3), (20.0..<30.0, 1), (50.0..<60.0, 3), (60.0..<70.0, 3), (70.0..<80.0, 4), (80.0..<90.0, 2), (90.0...100.0, 2)] false


Please remember, there may be some cases calculation errors affect differently in each implementation.

I've got another version of the "data binning" struct. This one originally had Int in place of T, and it compiled without complaint. But when I tried to make it generic, I started getting an Ambiguous reference to member '+' error on line #19. I don't know what's happening there, probably because I'm not very clear on how the uniquingKeysWith closure is structured. I could do the counting with a loop, but I really like the elegance of that Dictionary trick.


struct DataDistribution<T: Comparable> {
  var dataArray: [T] = []
  var binRanges: [Range<T>] = []
  init(dataArray: [T], binRanges: [Range<T>]) {
       self.dataArray = dataArray
       self.binRanges = binRanges
  }
  var bins: [(Range<T>, Int)] {
  var collector = binRanges.map { (r) -> (Range<T>, Int) in
       return (r, 0)
  }
  for item in dataArray {
       for r in binRanges {
            if r.contains(item) {
                 collector.append((r, 1))
            }
       }
  }
  let unsortedBins = Dictionary(collector, uniquingKeysWith: {$0 + $1})
  return unsortedBins.sorted(by: {$0.key.lowerBound < $1.key.lowerBound})
  }
}


On another matter: is there a way of getting the code to indent properly when I paste it in here? Or does it have to be done by hand?


And thanks for all the code that's been provided in the replies here: I've learned a lot of useful stuff from it.

I would solve this problem by mapping each item to a ‘lead’ item and then grouping based on that. For example:

struct Group<Item, Lead> where
    Lead: Comparable & Hashable
{
    var lead: Lead
    var items: [Item]
}

func groupItems<Item, Lead>(_ items: [Item], by lead: (Item) -> Lead) -> [Group<Item, Lead>] where
    Lead: Comparable & Hashable
{
    let groups = Dictionary(grouping: items, by: lead)
    let sortedLeads = groups.keys.sorted()
    return sortedLeads.map { lead in
        Group(lead: lead, items: groups[lead]!)
    }
}

The client then decides on the item-to-lead mapping. For example

let groups = groupItems([2.1, 1.3, 2.0, 2.5, 7.9, 6.2, 7.8]) { item -> Double in
    item.rounded(.towardZero)
}
for g in groups {
    print(g.lead, "->", g.items)
}
// prints:
// 1.0 -> [1.3]
// 2.0 -> [2.1, 2.0, 2.5]
// 6.0 -> [6.2]
// 7.0 -> [7.9, 7.8]

is there a way of getting the code to indent properly when I paste it in here?

Alas, I don’t have any experience with the DevForums editor. See this post for an explanation of my workflow.

Share and Enjoy

Quinn “The Eskimo!”
Apple Developer Relations, Developer Technical Support, Core OS/Hardware

let myEmail = "eskimo" + "1" + "@apple.com"

Very nice! I'd never considered that Dictionary(grouping:by:) before. But I'd still like to understand the error that was generated by my original code.

You should open a new thread for this and reformulate precidely the new problem you face.

Solved. I just needed to add <Range<T>, Int> to Dictionary in line 19. I guess that was the ambiguity the compiler was worried about. I also had to make the struct Hashable as well as Comparable (line 1).

I found a simple solution, mostly by chance (see my latest reply); but at least I now understand what the problem was.


The next time something like this happens, I will start a new thread, as you advise.