Online Preview

Optimizing Collections

Write custom collections in Swift with a strong focus on performance

by Károly Lőrentey

See Buying Options

Table of Contents

Preface

On the surface level, this book is about making fast collection implementations. It presents many different solutions to the same simple problem, explaining each approach in detail, and constantly trying to find new ways to push the performance of the next variation even higher than the last.

But secretly, this book is really just a joyful exploration of the many tools Swift gives us for expressing our ideas. This book won’t tell you how to ship a great iPhone app; rather it teaches you tools and techniques that will help you become better at expressing your ideas in the form of Swift code.

The book has grown out of notes and source code I made in preparation for a talk I gave at the dotSwift 2017 conference. I ended up with so much interesting material that I couldn’t possibly fit into a single talk, so I made a book out of it. (You don’t need to see the talk to understand the book, but the video is only 20 minutes or so, and dotSwift has done a great job with editing so that I almost pass for a decent presenter. Also, I’m sure you’ll love my charming Hungarian accent!)

Who Is This Book For?

At face value, this book is for people who want to implement their own collection types, but its contents are useful for anyone who wants to learn more about some of the idiosyncratic language facilities that make Swift special. Getting used to working with algebraic data types, or knowing how to create swifty value types with copy-on-write reference-counted storage will help you become a better programmer for everyday tasks too.

This book assumes you’re a somewhat experienced Swift programmer. You don’t need to be an expert, though: if you’re familiar with the basic syntax of the language and you’ve written, say, a thousand lines of Swift code, you’ll be able to understand most of it just fine. If you need to get up to speed on Swift, I highly recommend another objc.io book, Advanced Swift by Chris Eidhof, Ole Begemann, and Airspeed Velocity. It picks up where Apple’s The Swift Programming Language left off, and it dives deeper into Swift’s features, explaining how to use them in an idiomatic (aka swifty) way.

Most of the code in this book will work on any platform that can run Swift code. In the couple of cases where I needed to use features that are currently available in neither the standard library nor the cross-platform Foundation framework, I included platform-specific code supporting Apple platforms and GNU/Linux. The code was tested using the Swift 3.1 compiler that shipped in Xcode 8.3.2.

New Editions of This Book

From time to time, I may publish new editions of this book to fix bugs, to follow the evolution of the Swift language, or to add additional material. You’ll be able to download these updates from the book’s product page on Gumroad. You can also go there to download other variants of the book; the edition you’re currently reading is available in EPUB, PDF, and Xcode playground formats. (These are free to download as long as you’re logged in with the Gumroad account you used to purchase the book.)

I’ve set up a GitHub repository where you can find the full source code of all the algorithms explained in the text. This code is simply extracted from the book itself, so there’s no extra information there, but it’s nice to have the source available in a standalone package in case you want to experiment with your own modifications.

You’re welcome to use any code from this repository in your own apps, although doing so wouldn’t necessarily be a good idea: the code is simplified to fit the book, so it’s not quite production quality. Instead, I recommend you take a look at BTree, my elaborate ordered collections package for Swift. It includes a production-quality implementation of the most advanced data structure described in this book, providing tree-based analogues of the standard library’s Array, Set, and Dictionary collections, along with a flexible BTree type for lower-level access to the underlying structure.

Attabench is my macOS benchmarking app for generating microbenchmark charts. The actual benchmarks from this book are included in the app by default. I highly recommend you check out the app and try repeating my measurements on your own computer. You may even get inspired to experiment with benchmarking your own algorithms.

Contacting the Author

If you find a mistake, please help me fix it by filing a bug about it in the book’s GitHub repository. For other feedback, feel free to contact me on Twitter; my handle is @lorentey. If you prefer email, write to collections@objc.io.

How to Read This Book

I’m not breaking new ground here; this book is intended to be read from front to back. The text often refers back to a solution from a previous chapter, assuming the reader has, uhm, done their job. That said, feel free to read the chapters in any order you want. But try not to get upset if it makes less sense that way, OK?

This book contains lots of source code. In the Xcode playground version of the book, almost all of it is editable and your changes are immediately applied. Feel free to experiment with modifying the code — sometimes the best way to understand it is to see what happens when you change it.

For instance, here’s a useful extension method on Sequence that shuffles its elements into random order. It has a couple of FIXME comments describing problems in the implementation. Try modifying the code to fix these issues!

#if os(macOS) || os(iOS) || os(watchOS) || os(tvOS)
import Darwin // for arc4random_uniform()
#elseif os(Linux)
import Glibc // for random()
#endif
 
extension Sequence {
public func shuffled() -> [Iterator.Element] {
var contents = Array(self)
for i in 0 ..< contents.count {
#if os(macOS) || os(iOS) || os(watchOS) || os(tvOS)
// FIXME: This breaks if the array has 2^32 elements or more.
let j = Int(arc4random_uniform(UInt32(contents.count)))
#elseif os(Linux)
// FIXME: This has modulo bias. Also, `random` should be seeded by calling `srandom`.
let j = random() % contents.count
#endif
if i != j {
swap(&contents[i], &contents[j])
}
}
return contents
}
}

To illustrate what happens when a piece of code is executed, I sometimes show the result of an expression. As an example, let’s try running shuffled to demonstrate that it returns a new random order every time it’s run:

(0 ..< 20).shuffled()
[5, 4, 1, 17, 15, 12, 0, 7, 16, 6, 13, 10, 14, 19, 2, 8, 9, 18, 3, 11]
(0 ..< 20).shuffled()
[5, 11, 8, 12, 2, 4, 3, 9, 10, 6, 18, 0, 1, 16, 19, 7, 14, 13, 15, 17]
(0 ..< 20).shuffled()
[3, 4, 10, 6, 17, 18, 15, 11, 13, 19, 14, 5, 7, 1, 8, 2, 9, 0, 16, 12]

In the playground version, all this output is generated live, so you’ll get a different set of shuffled numbers every time you rerun the page.

Acknowledgments

This book wouldn’t be the same without the wonderful feedback I received from readers of its earlier drafts. I’m especially grateful to Chris Eidhof; he spent considerable time reviewing this book, and he provided detailed feedback that greatly improved the final result. (Naturally, I’m solely responsible for any issues that remain in the text.)

I’d be amiss not to mention Floppy, my six-year-old beagle: her ability to patiently listen to me describing various highly technical problems contributed a great deal to their solutions. Good girl!

1 Introduction

Swift’s concept of a collection is one of the core abstractions in the language. The primary collections in the standard library — arrays, sets, and dictionaries — are used in essentially all Swift programs, from the tiniest scripts to the largest apps. The specific ways they work feels familiar to all Swift programmers and gives the language a unique personality.

When we need to design a new general-purpose collection type, it’s important we follow the precedent established by the standard library. It’s not enough to simply conform to the documented requirements of the Collection protocol: we also need to go the extra mile to match behavior with standard collection types. Doing this correctly is the essence of the elusive property of swiftiness, which is often so hard to explain, yet whose absence is so painfully noticeable.

1.1 Copy-On-Write Value Semantics

The most important quality expected of a Swift collection is copy-on-write value semantics.

In essence, value semantics in this context means that each variable holding a value behaves as if it held an independent copy of it, so that mutating a value held by one variable will never modify the value of another:

var a = [2, 3, 4]
var b = a
a.insert(1, at: 0)
a
[1, 2, 3, 4]
b
[2, 3, 4]

In order to implement value semantics, the code above needs to copy the underlying array storage at some point to allow the two array instances to have different elements. For simple value types (like Int or CGPoint), the entire value is stored directly in the variable, and this copying happens automatically every time a new variable is initialized or whenever a new value is assigned to an existing variable.

However, putting an Array value into a new variable (e.g. when we assign to b) does not copy its underlying storage: it just creates a new reference to the same heap-allocated buffer, finishing in constant time. The actual copying is deferred until one of the values using the same shared storage is mutated (e.g. in insert). Note though that mutations only need to copy the storage if the underlying storage is shared. If the array value holds the only reference to its storage, then it’s safe to modify the storage buffer directly.

When we say that Array implements the copy-on-write optimization, we’re essentially making a set of promises about the performance of its operations so that they behave as described above.

(Note that full value semantics is normally taken to mean some combination of abstract concepts with scary names like referential transparency, extensionality, and definiteness. Swift’s standard collections violate each of these to a certain degree. For example, indices of one Set value aren’t necessarily valid in another set, even if both sets contain the exact same elements. Therefore, Swift’s Set isn’t entirely referentially transparent.)

1.2 The SortedSet Protocol

To get started, we first need to decide what task we want to solve. One common data structure currently missing from the standard library is a sorted set type, i.e. a collection like Set, but one that requires its elements to be Comparable rather than Hashable, and that keeps its elements sorted in increasing order. So let’s make one of these!

The sorted set problem is such a nice demonstration of the various ways to build data structures that this entire book will be all about it. We’re going to create a number of independent solutions, illustrating some interesting Swift coding techniques.

Let’s start by drafting a protocol for the API we want to implement. Ideally we’d want to create concrete types that conform to this protocol:

public protocol SortedSet: BidirectionalCollection, SetAlgebra {
associatedtype Element: Comparable
}

A sorted set is all about putting elements in a certain order, so it seems reasonable for it to implement BidirectionalCollection, allowing traversal in both front-to-back and back-to-front directions.

SetAlgebra includes all the usual set operations like union(_:), isSuperset(of:), insert(_:), and remove(_:), along with initializers for creating empty sets and sets with particular contents. If our sorted set was intended to be a production-ready implementation, we’d definitely want to implement it. However, to keep this book at a manageable length, we’ll only require a small subset of the full SetAlgebra protocol — just the two methods contains and insert, plus the parameterless initializer for creating an empty set:

public protocol SortedSet: BidirectionalCollection, CustomStringConvertible, CustomPlaygroundQuickLookable {
associatedtype Element: Comparable
 
init()
func contains(_ element: Element) -> Bool
mutating func insert(_ newElement: Element) -> (inserted: Bool, memberAfterInsert: Element)
}

In exchange for removing full SetAlgebra conformance, we added CustomStringConvertible and CustomPlaygroundQuickLookable; this is convenient when we want to display the contents of sorted sets in sample code and in playgrounds.

Note that BidirectionalCollection has about 30 member requirements (things like startIndex, index(after:), map, and lazy), most of which have default implementations. In this book, we’ll concentrate on the absolute minimum set of members — startIndex, endIndex, subscript, index(after:), index(before:), formIndex(after:), formIndex(before:), and count. For the most part, we’ll implement only these, leaving default implementations for everything else, even though we could often write specializations that would work much faster. We’ll make one exception, though: forEach is a nice companion to contains, so we’ll always specialize it.

1.3 Semantic Requirements

Implementing a Swift protocol usually means more than just conforming to its explicit requirements — most protocols come with an additional set of semantic requirements that aren’t expressible in the type system. These requirements need to be documented separately. Our SortedSet protocol is no different; here are six properties we expect all implementations to satisfy:

  1. Consistent element type: Iterator.Element and Element must always be the same type. It doesn’t make much sense to insert elements of one type only to get some other type back from the collection methods. (As of Swift 3, we can’t yet specify that these two types must be the same. We could’ve easily used Iterator.Element instead of Element; I chose to introduce a new associated type only to make the function signatures above a little bit shorter.)

  2. Ordering: Elements inside the collection must be kept sorted. To be more specific: if i and j are both valid and subscriptable indices in some set implementing SortedSet, then i < j must be equivalent to set[i] < set[j]. (This also implies that our sets won’t have duplicate elements.)

  3. Value semantics: Mutating an instance of a SortedSet type via one variable must not affect the value of any other variable of the same type. Conforming types must behave as if each variable held its own unique value, entirely independent of all other variables.

  4. Copy-on-write: Copying a SortedSet value into a new variable must be an \(O(1)\) operation. Storage may be partially or fully shared between different SortedSet values. Mutations must check for shared storage and create copies when necessary to satisfy value semantics. Therefore, mutations may take longer to complete when storage is shared.

  5. Index specificity: Indices are associated with a particular SortedSet instance; they’re only guaranteed to be valid for that specific instance and its unmutated direct copies. Even if a and b are SortedSets of the same type containing the exact same elements, indices of a may not be valid in b. (This unfortunate relaxation of true value semantics seems to be technically unavoidable in general.)

  6. Index invalidation: Any mutation of a SortedSet may invalidate all existing indices of it, including its startIndex and endIndex. Implementations aren’t required to always invalidate every index, but they’re allowed to do so. (This point isn’t really a requirement, because it’s impossible to violate it. It’s merely a reminder that unless we know better, collection indices are fragile and should be handled with care.)

Note that the compiler won’t complain if we forget to implement any of these requirements. But it’s still crucial that we implement them so that generic code working with sorted sets has consistent behavior.

If we were writing a real-life, production-ready implementation of sorted sets, we wouldn’t need the SortedSet protocol at all: we’d simply define a single type that implements all requirements directly. However, we’ll be writing several variants of sorted sets, so it’s nice to have a protocol that spells out our requirements and on which we can define extensions that are common to all such types.

Before we even have a concrete implementation of SortedSet, let’s get right into defining one such extension!

1.4 Printing Sorted Sets

It’s useful to provide a default implementation for description so that we need not spend time on it later. Because all sorted sets are collections, we can use standard collection methods to print sorted sets the same way as the standard library’s array or set values — as a comma-separated list of elements enclosed in brackets:

extension SortedSet {
public var description: String {
let contents = self.lazy.map { "\($0)" }.joined(separator: ", ")
return "[\(contents)]"
}
}

It’s also worth creating a default implementation of customPlaygroundQuickLook so that our collections print a little better in playgrounds. The default Quick Look view can be difficult to understand, especially at a glance, so let’s replace it with a simple variant that just sets the description in a monospaced font using an attributed string:

#if os(iOS)
import UIKit
 
extension PlaygroundQuickLook {
public static func monospacedText(_ string: String) -> PlaygroundQuickLook {
let text = NSMutableAttributedString(string: string)
let range = NSRange(location: 0, length: text.length)
let style = NSParagraphStyle.default.mutableCopy() as! NSMutableParagraphStyle
style.lineSpacing = 0
style.alignment = .left
style.maximumLineHeight = 17
text.addAttribute(NSFontAttributeName, value: UIFont(name: "Menlo", size: 13)!, range: range)
text.addAttribute(NSParagraphStyleAttributeName, value: style, range: range)
return PlaygroundQuickLook.attributedString(text)
}
}
#endif
 
extension SortedSet {
public var customPlaygroundQuickLook: PlaygroundQuickLook {
#if os(iOS)
return .monospacedText(String(describing: self))
#else
return .text(String(describing: self))
#endif
}
}

2 Sorted Arrays

Possibly the most straightforward way to implement SortedSet is by storing the set’s elements in an array. This leads to the definition of a simple structure like the one below:

public struct SortedArray<Element: Comparable>: SortedSet {
fileprivate var storage: [Element] = []
 
public init() {}
}

To help satisfy the protocol requirements, we’ll always keep the storage array sorted, hence the name SortedArray.

To implement insert and contains, we’ll need a method that, given an element, returns the position in the storage array that should hold the element.

To do this fast, we need to implement the binary search algorithm. This algorithm works by cutting the array into two equal halves, discarding the one that doesn’t contain the element we’re looking for, and repeating this process until the array is reduced to a single element. Here’s one way to do this in Swift:

extension SortedArray {
func index(for element: Element) -> Int {
var start = 0
var end = storage.count
while start < end {
let middle = start + (end - start) / 2
if element > storage[middle] {
start = middle + 1
}
else {
end = middle
}
}
return start
}
}

Note that the loop above needs to perform just one more iteration whenever we double the number of elements in the set. That’s rather cheap! This is what people mean when they say binary search has logarithmic complexity: its running time is roughly proportional to the logarithm of the size of the data. (The big-O notation for such a function is \(O(\log n)\).)

Binary search is deceptively short, but it’s a rather delicate algorithm that can be tricky to implement correctly. It includes a lot of subtle index arithmetic, and there are ample opportunities for off-by-one errors, overflow problems, or other mistakes. For example, the expression start + (end - start) / 2 seems like a strange way to calculate the middle index; we’d normally write (start + end) / 2 instead. However, these two expressions don’t always have the same results: the second version contains an addition that may overflow if the collection is huge, resulting in a runtime error.

Hopefully a generic binary search method will get added to the Swift standard library at some point. Meanwhile, if you ever need to implement it, please find a good algorithms book to use as a reference. (Though I guess this book will do in a pinch too.) Don’t forget to test your code; sometimes even books have bugs! I find that 100% unit test coverage works great for catching most of my own errors.

Our index(for:) function does something similar to Collection’s standard index(of:) method, except our version always returns a valid index, even if the element isn’t currently in the set. This subtle but very important difference will make index(for:) usable for insertion too.

2.2 Lookup Methods

Having mentioned index(of:), it’s a good idea to define it in terms of index(for:) so that it uses the better algorithm too:

extension SortedArray {
public func index(of element: Element) -> Int? {
let index = self.index(for: element)
guard index < count, storage[index] == element else { return nil }
return index
}
}

The default implementation in Collection executes a linear search by just enumerating all elements until it finds the one we’re looking for, or until it reaches the end of the collection. This specialized variant can be much, much faster than that.

Checking for membership requires a bit less code because we only need to see if the element is there:

extension SortedArray {
public func contains(_ element: Element) -> Bool {
let index = self.index(for: element)
return index < count && storage[index] == element
}
}

Implementing forEach is even easier because we can simply forward the call to our storage array. The array is already sorted, so the method will visit elements in the correct order:

extension SortedArray {
public func forEach(_ body: (Element) throws -> Void) rethrows {
try storage.forEach(body)
}
}

While we’re here, it’s a good idea to look for other Sequence and Collection members that would benefit from a specialized implementation. For example, sequences with Comparable elements have a sorted() method that returns a sorted array of all elements in the sequence. For SortedArray, this can be implemented by simply returning storage:

extension SortedArray {
public func sorted() -> [Element] {
return storage
}
}

2.3 Insertion

To insert a new element into a sorted set, we first find its corresponding index using index(for:) and then check if the element is already there. To maintain the invariant that a SortedSet may not contain duplicates, we only insert the element into the storage if it’s not present:

extension SortedArray {
@discardableResult
public mutating func insert(_ newElement: Element) -> (inserted: Bool, memberAfterInsert: Element)
{
let index = self.index(for: newElement)
if index < count && storage[index] == newElement {
return (false, storage[index])
}
storage.insert(newElement, at: index)
return (true, newElement)
}
}

2.4 Implementing Collection

Next up, let’s implement BidirectionalCollection. Since we’re storing everything in a single Array, the easiest way to do this is to simply share indices between SortedArray and its storage. By doing this, we’ll be able to forward most collection methods to the storage array, which drastically simplifies our implementation.

Array implements more than BidirectionalCollection: it is in fact a RandomAccessCollection, which has the same API surface but much stricter semantic requirements. RandomAccessCollection requires efficient index arithmetic: we must be able to both offset an index by any amount and measure the distance between any two indices in constant time.

Since we’re going to forward everything to storage anyway, it makes sense for SortedArray to implement the same protocol:

extension SortedArray: RandomAccessCollection {
public typealias Indices = CountableRange<Int>
 
public var startIndex: Int { return storage.startIndex }
public var endIndex: Int { return storage.endIndex }
 
public subscript(index: Int) -> Element { return storage[index] }
}

This completes the implementation of the SortedSet protocol. Yay!

2.5 Examples

Let’s check if it all works:

var set = SortedArray<Int>()
for i in (0 ..< 22).shuffled() {
set.insert(2 * i)
}
set
[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42]
 
set.contains(42)
true
 
set.contains(13)
false

It seems to work just fine. But does our new collection have value semantics?

let copy = set
set.insert(13)
 
set.contains(13)
true
 
copy.contains(13)
false

It sure looks like it! We didn’t have to do anything to implement value semantics; we got it by the mere fact that SortedArray is a struct consisting of a single array. Value semantics is a composable property: structs with stored properties that all have value semantics automatically behave the same way too.

2.6 Performance

When we talk about the performance of an algorithm, we often use the so-called big-O notation to describe how changing the input size affects the running time: \(O(1)\), \(O(n)\), \(O(n^2)\), \(O(\log n)\), \(O(n\log n)\), and so on. This notation has a precise mathematical definition, but you don’t really have to look it up — it’s enough to understand that we use this notation as shorthand for classifying the growth rate of our algorithms. When we double the size of its input, an \(O(n)\) algorithm will take no more than twice the time, but an \(O(n^2)\) algorithm might become as much as four times slower. We expect that an \(O(1)\) algorithm will run for roughly the same time no matter what input it gets.

We can mathematically analyze our algorithms to formally derive such asymptotic complexity estimates. This analysis provides useful indicators of performance, but it isn’t infallible; by its very nature, it relies on simplifications that may or may not match actual behavior for real-world working sets running on actual hardware.

To get an idea of the actual performance of our SortedSet implementations, it’s therefore useful to run some benchmarks. For example, here’s the code for one possible microbenchmark that measures four basic operations on a SortedArray: insert, contains, forEach, and iteration using a for statement:

func benchmark(count: Int, measure: (String, () -> Void) -> Void) {
var set = SortedArray<Int>()
let input = (0 ..< count).shuffled()
measure("SortedArray.insert") {
for value in input {
set.insert(value)
}
}
 
let lookups = (0 ..< count).shuffled()
measure("SortedArray.contains") {
for element in lookups {
guard set.contains(element) else { fatalError() }
}
}
 
measure("SortedArray.forEach") {
var i = 0
set.forEach { element in
guard element == i else { fatalError() }
i += 1
}
guard i == input.count else { fatalError() }
}
 
measure("SortedArray.for-in") {
var i = 0
for element in set {
guard element == i else { fatalError() }
i += 1
}
guard i == input.count else { fatalError() }
}
}

The measure parameter is some function that measures the execution time of the closure that’s given to it and files it under the name given as its first parameter. One simple way to drive this benchmark function is to call it in a loop of various sizes, printing measurements as we get them:

for size in (0 ..< 20).map({ 1 << $0 }) {
benchmark(size: size) { name, body in
let start = Date()
body()
let end = Date()
print("\(name), \(size), \(end.timeIntervalSince(start))")
}
}

This is a simplified version of the actual Attabench benchmarks I ran to draw the plots below. The real code has a lot more benchmarking boilerplate, but the actual measurements (the code inside the measure closures) are exactly the same.

Plotting our benchmark results gets us the chart in figure 2.1. Note that in this chart, we’re using logarithmic scales on both axes: moving one notch to the right doubles the number of input values, while moving up by one horizontal line indicates a tenfold increase in execution time.

Figure 2.1: Benchmark results for SortedArray operations, plotting input size vs. overall execution time on a log-log chart.
Figure 2.1: Benchmark results for SortedArray operations, plotting input size vs. overall execution time on a log-log chart.

Such log-log plots are usually the best way to display benchmarking results. They fit a huge range of data on a single chart without letting large values overwhelm small ones — in this case, we can easily compare execution times from a single element up to 4 million of them: a difference of 22 binary orders of magnitude!

Additionally, log-log plots make it easy to estimate the actual complexity costs an algorithm exhibits. If a section of a benchmark plot is a straight line segment, then the relationship between input size and execution time can be approximated by a multiple of a simple polynomial function, such as \(n\), \(n^2\), or even \(\sqrt n\). The exponent is related to the slope of the line: the slope of \(n^2\) is double that of \(n\). With some practice, you’ll be able to recognize the most frequently occurring relationships at a glance — there’s no need for complicated analysis.

In our case, we know that simply iterating over all the elements inside an array should take \(O(n)\) time, and this is confirmed by the plots. Array.forEach and the for-in loop cost pretty much the same, and after an initial warmup period, they both become straight lines. It takes a little more than three steps to the right for the line to go a single step up, which corresponds to \(2^{3.3} \approx 10\), proving a simple linear relationship.

Looking at the plot for SortedArray.insert, we find it gradually turns into a straight line at about 4,000 elements, and the slope of the line is roughly double that of SortedArray.forEach — so it looks like the execution time of insertion is a quadratic function of the input size. Luckily, this matches our expectations: each time we insert a random element into a sorted array, we have to make space for it by moving (on average) half of the existing elements one slot to the right. So a single insertion is a linear operation, which makes \(n\) insertions cost \(O(n^2)\).

SortedArray.contains does \(n\) binary searches, each taking \(O(\log n)\) time, so it’s supposed to be an \(O(n\log n)\) function. It’s hard to see this in in figure 2.1, but you can verify it if you look close enough: its curve is almost parallel to that of forEach, except it subtly drifts upward — it’s not a perfectly straight line. One easy way to verify this is by putting the edge of a piece of paper beside the contains plot: it curves away from the paper’s straight edge, indicating a superlinear relationship.

To highlight the difference between \(O(n)\) and \(O(n\log n)\) functions, it’s useful to divide execution times by the input size, resulting in a chart that displays the average execution time spent on a single element. (I like to call this type of plot an amortized chart. I’m not sure the word amortized fits this context, but it does sound impressive!) The division eliminates the consistent slope of \(O(n)\), making it easier to distinguish linear and logarithmic factors. Figure 2.2 shows such a chart for SortedArray. Note how contains now has a distinct (if slight) upward slope, while the tail of forEach is perfectly flat.

Figure 2.2: Benchmark results for SortedArray operations, plotting input size vs. average execution time for a single operation on a log-log chart.
Figure 2.2: Benchmark results for SortedArray operations, plotting input size vs. average execution time for a single operation on a log-log chart.

The curve for contains has a couple of surprises. First, it has clear spikes at powers-of-two sizes. This is due to an interesting interaction between binary search and the architecture of the level 2 (L2) cache in the MacBook running the benchmarks. The cache is divided into 64-byte lines, each of which may hold the contents of main memory from a specific set of physical addresses. By an unfortunate coincidence, if the storage size is close to a perfect power of two, successive lookup operations of the binary search algorithm tend to all fall into the same L2 cache line, quickly exhausting its capacity, while other lines remain unused. This effect is called cache line aliasing, and it can evidently lead to a quite dramatic slowdown: contains takes about twice as long to execute at the top of the spikes than at nearby sizes.

One way to eliminate these spikes is to switch to ternary search, dividing the buffer into three equal slices on each iteration. An even simpler solution is to perturb binary search, selecting a slightly off-center middle index. To do this, we just need to modify a single line in the implementation of index(for:), adding a small extra offset to the middle index:

let middle = start + (end - start) / 2 + (end - start) >> 6

This way, the middle index will fall \(33/64\) the way between the two endpoints, which is enough to prevent cache line aliasing. Unfortunately, the code is now a tiny bit more complicated, and these off-center middle indices generally result in slightly more storage lookups than regular binary search. So the price of eliminating the power-of-two spikes is a small overall slowdown, as demonstrated: by the chart in figure 2.3.

Figure 2.3: Comparing the performance of binary search (contains) to a variant that prevents cache line aliasing by selecting slightly off-center indices for the middle value (contains2).
Figure 2.3: Comparing the performance of binary search (contains) to a variant that prevents cache line aliasing by selecting slightly off-center indices for the middle value (contains2).

The other surprise of the contains curve is that it seems to turn slightly upward at about 64k elements or so. (If you look closely, you may actually detect a similar, although less pronounced, slowdown for insert, starting at about a million elements.) At this size, my MacBook’s virtual memory subsystem was unable to keep the physical address of all pages of the storage array in the CPU’s address cache (called the Translation Lookaside Buffer, or TLB for short). The contains benchmark randomizes lookups, so its irregular access patterns lead to frequent TLB cache misses, considerably increasing the cost of memory accesses. Additionally, as the storage array gets larger, its sheer size overwhelms L1 and L2 caches, and those cache misses contribute a great deal of additional latency.

At the end of the day, it looks like random memory accesses in a large enough contiguous buffer take \(O(\log n)\)-ish time, not \(O(1)\) — so the asymptotic execution time of our binary search is in fact more like \(O(\log n\log n)\), and not \(O(\log n)\) as we usually believe. Isn’t that interesting? (The slowdown disappears if we remove the shuffling of the lookups array in the benchmark code above. Try it!)

On the other hand, for small sizes, the contains curve keeps remarkably close to insert. Some of this is just a side effect of the logarithmic scale: at their closest point, contains is still 80% faster than insertion. But the insert curve looks surprisingly flat up to about 1,000 elements; it seems that as long as the sorted array is small enough, the average time it takes to insert a new element into it is basically independent of the size of the array. (I believe this is mostly a consequence of the fact that at these sizes, the entire array fits into the L1 cache of the CPU.)

So SortedArray.insert seems to be astonishingly fast as long as the array is small. For now, we can file this factoid away as mildly interesting trivia. Keep it in mind though, because this fact will have serious consequences in later parts of this book.

3 The Swiftification of NSOrderedSet

The Foundation framework includes a class called NSOrderedSet. It’s a relatively recent addition, first appearing in 2011 with iOS 5 and OS X 10.7 Lion. NSOrderedSet was added to Foundation specifically in support of ordered relationships in Core Data. It works like a combination of NSArray and NSSet, implementing the API of both classes. It also provides both NSSet’s super fast \(O(1)\) membership checks and NSArray’s speedy \(O(1)\) random access indexing. The tradeoff is that it also inherits NSArray’s \(O(n)\) insertions. Because NSOrderedSet is (most likely) implemented by wrapping an NSSet and an NSArray, it also has a higher memory overhead than either of these components.

NSOrderedSet hasn’t been bridged to Swift yet, so it seems like a nice subject for demonstrating how we can define thin wrappers around existing Objective-C classes to bring them closer to the Swift world.

Despite its promising name, NSOrderedSet isn’t really a good match for our use case. Although NSOrderedSet does keep its elements ordered, it doesn’t enforce any particular ordering relation — you can insert elements in whichever order you like and NSOrderedSet will remember it for you, just like an array. Not having a predefined ordering is the difference between “ordered” and “sorted,” and this is why NSOrderedSet is not called NSSortedSet. Its primary purpose is to make lookup operations fast, but it achieves this using hashing rather than comparisons. (There’s no equivalent to the Comparable protocol in Foundation; NSObject only provides the functionality of Equatable and Hashable.)

But if NSOrderedSet supports whichever ordering we may choose to use, then it also supports keeping elements sorted according to their Comparable implementation. Clearly, this won’t be the ideal use case for NSOrderedSet, but we should be able to make it work. So let’s import Foundation and start working on hammering NSOrderedSet into a SortedSet:

import Foundation

Right off the bat, we run into a couple of major problems.

First, NSOrderedSet is a class, so its instances are reference types. We want our sorted sets to have value semantics.

Second, NSOrderedSet is a heterogeneous sequence — it takes Any members. We could still implement SortedSet by setting its Element type to Any, rather than leaving it as a generic type parameter, but that wouldn’t feel like a real solution. What we really want is a generic homogenenous collection type, with the element type provided by a type parameter.

So we can’t just extend NSOrderedSet to implement our protocol. Instead, we’re going to define a generic wrapper struct that internally uses an NSOrderedSet instance as storage. This approach is similar to what the Swift standard library does in order to bridge NSArray, NSSet, and NSDictionary instances to Swift’s Array, Set, and Dictionary values. So we seem to be thinking on the right track.

What should we call our struct? It’s tempting to call it NSSortedSet, and it would technically be possible to do so — Swift-only constructs don’t depend on prefixes to work around (present or future) naming conflicts. On the other hand, for developers, NS still implies Apple-provided, so it’d be impolite and confusing to use it. So let’s just go the other way and name our struct OrderedSet. (This name isn’t quite right either, but at least it does resemble the name of the underlying data structure.)

public struct OrderedSet<Element: Comparable>: SortedSet {
fileprivate var storage = NSMutableOrderedSet()
}

We want to be able to mutate our storage, so we need to declare it as an instance of NSMutableOrderedSet, which is NSOrderedSet’s mutable subclass.

3.1 Looking Up Elements

We now have the empty shell of our data structure. Let’s begin to fill it with content, starting with the two lookup methods, forEach and contains.

NSOrderedSet implements Sequence, so it already has a forEach method. Assuming elements are kept in the correct order, we can simply forward forEach calls to storage. However, we need to manually downcast values provided by NSOrderedSet to the correct type:

extension OrderedSet {
public func forEach(_ body: (Element) -> Void) {
storage.forEach { body($0 as! Element) }
}
}

OrderedSet is in full control of its own storage, so it can guarantee the storage will never contain anything other than an Element. This ensures the forced downcast will always succeed. But it sure is ugly!

NSOrderedSet also happens to provide an implementation for contains, and it seems perfect for our use case. In fact, it’s simpler to use than forEach because there’s no need for explicit casting:

extension OrderedSet {
public func contains(_ element: Element) -> Bool {
return storage.contains(element) // BUG!
}
}

The code above compiles with no warnings, and it appears to work great when Element is Int or String. However, as we already mentioned, NSOrderedSet uses NSObject’s hashing API to speed up element lookups. But we don’t require Element to be Hashable! How could this work at all?

When we supply a Swift value type to a method taking an Objective-C object, like storage.contains above, the compiler needs to box the value in a private NSObject subclass that it generates for this purpose. Remember that NSObject has built-in hashing APIs; you can’t have an NSObject instance that doesn’t support hash. So these autogenerated bridging classes must always have an implementation for hash that’s consistent with isEqual(:).

If Element happens to be Hashable, then Swift is able to use the type’s own == and hashValue implementations in the bridging class, so Element values get hashed the same way in Objective-C as in Swift, and everything works out perfectly.

However, if Element doesn’t implement hashValue, then the bridging class has no choice but to use the default NSObject implementations of both hash and isEqual(_:). Because there’s no other information available, these are based on the identity (i.e. the physical address) of the instances, which is effectively random for boxed value types. So two different bridged instances holding the exact same value will never be considered isEqual(_:) (or return the same hash).

The upshot of all this is that the above code for contains compiles just fine, but it has a fatal bug: if Element isn’t Hashable, then it always returns false. Oops!

Oh dear. Lesson of the day: be very, very careful when using Objective-C APIs from Swift. Automatic bridging of Swift values to NSObject instances is extremely convenient, but it has subtle pitfalls. There’s nothing in the code that explicitly warns about this problem: no exclamation marks, no explicit cast, nothing.

Well, now we know we can’t rely on NSOrderedSet’s own lookup methods in our case. Instead we have to look for some alternative APIs to find elements. Thankfully, NSOrderedSet also includes a method that was specifically designed for looking up elements when we know they’re sorted according to some comparator function:

class NSOrderedSet: NSObject { // in Foundation
...
func index(of object: Any, inSortedRange range: NSRange, options: NSBinarySearchingOptions = [], usingComparator: (Any, Any) -> ComparisonResult) -> Int
...
}

Presumably this implements some form of binary search, so it should be fast enough. Our elements are sorted according to their Comparable implementation, so we can use Swift’s < and > operators to define a suitable comparator function:

extension OrderedSet {
fileprivate static func compare(_ a: Any, _ b: Any) -> ComparisonResult
{
let a = a as! Element, b = b as! Element
if a < b { return .orderedAscending }
if a > b { return .orderedDescending }
return .orderedSame
}
}

We can use this comparator to define a method for getting the index for a particular element. This happens to be what Collection’s index(of:) method is supposed to do, so let’s make sure our definition refines the default implementation of it:

extension OrderedSet {
public func index(of element: Element) -> Int? {
let index = storage.index(
of: element,
inSortedRange: NSRange(0 ..< storage.count),
usingComparator: OrderedSet.compare)
return index == NSNotFound ? nil : index
}
}

Once we have this function, contains reduces to a tiny transformation of its result:

extension OrderedSet {
public func contains(_ element: Element) -> Bool {
return index(of: element) != nil
}
}

I don’t know about you, but I found this a lot more complicated than I originally expected. The fine details of how values are bridged into Objective-C sometimes have far-reaching consequences that may break our code in subtle but fatal ways. If we aren’t aware of these wrinkles, we may be unpleasantly surprised.

NSOrderedSet’s flagship feature is that its contains implementation is super fast — so it’s rather sad that we can’t use it. But there’s still hope! Consider that while NSOrderedSet.contains may mistakenly report false for certain types, it never returns true if the value isn’t in fact in the set. Therefore, we can write a variant of OrderedSet.contains that still calls it as a shortcut, possibly eliminating the need for binary search:

extension OrderedSet {
public func contains2(_ element: Element) -> Bool {
return storage.contains(element) || index(of: element) != nil
}
}

For Hashable elements, this variant will return true faster than index(of:). However, it’s slightly slower for values that aren’t members of the set and for types that aren’t hashable.

3.2 Implementing Collection

NSOrderedSet only conforms to Sequence, not to Collection. (This isn’t some unique quirk; its better-known friends NSArray and NSSet do the same.) Nevertheless, NSOrderedSet does provide some integer-based indexing methods we can use to implement RandomAccessCollection in our OrderedSet type:

extension OrderedSet: RandomAccessCollection {
public typealias Index = Int
public typealias Indices = CountableRange<Int>
 
public var startIndex: Int { return 0 }
public var endIndex: Int { return storage.count }
public subscript(i: Int) -> Element { return storage[i] as! Element }
}

This turned out to be refreshingly simple.

3.3 Guaranteeing Value Semantics

3.4 Insertion

3.5 Let’s See If This Thing Works

3.6 Performance

4 Red-Black Trees

Self-balancing binary search trees provide efficient algorithms for implementing sorted collections. In particular, these data structures can be used to implement sorted sets so that the insertion of any element only takes logarithmic time. This is a really attractive feature — remember, our SortedArray implemented insertions in linear time.

The expression “self-balancing binary search tree” looks kind of technical. Each word has a specific meaning, so I’ll quickly explain them.

A tree is a data structure with data stored inside nodes, arranged in a branching, tree-like structure. Each tree has a single node at the top called the root node. (The root is put on the top because computer scientists have been drawing trees upside down for decades. This is because it’s more practical to draw tree diagrams this way, and not because they don’t know what an actual tree looks like. Or at least I hope so, anyway.) If a node has no children, then it’s called a leaf node; otherwise it’s an internal node. A tree usually has many leaf nodes.

In general, internal nodes may have any number of child nodes, but in binary trees, nodes may only have two children, called left and right. Some nodes may have two children, while some may have only a left child, only a right child, or no child at all.

Figure 4.1: A binary search tree. Node 6 is the root node; nodes 6, 3, and 8 are internal nodes, while nodes 2, 4, and 9 are leaves.
Figure 4.1: A binary search tree. Node 6 is the root node; nodes 6, 3, and 8 are internal nodes, while nodes 2, 4, and 9 are leaves.

In search trees, the values inside nodes are comparable in some way, and nodes are arranged such that for every node in the tree, all values inside the left subtree are less than the value of the node, and all values to the right are greater than it. This makes it easy to find any particular element.

By self-balancing, we mean that the data structure has some sort of mechanism that guarantees the tree remains nice and bushy with a short height, no matter which values it contains and in what order the values are inserted. If the tree was allowed to grow lopsided, even simple operations could become inefficient. (As an extreme example, a tree where each node has at most one child works like a linked list — not very efficient at all.)

There are many ways to create self-balancing binary trees; in this section, we’re going to implement the variant called red-black trees. This particular flavor needs to store a single extra bit of information in every node to implement the self-balancing part. That extra bit is the node’s color, which can be either red or black.

Figure 4.2: An example red-black tree.
Figure 4.2: An example red-black tree.

A red-black tree always keeps its nodes arranged and colored so that the following properties are all true:

  1. The root node is black.
  2. Red nodes only have black children. (If they have any, that is.)
  3. Every path from the root node to an empty spot in the tree contains the same number of black nodes.

Empty spots are places where new nodes can be added to the tree, i.e. where a node has no left or right child. To grow a tree one node larger, we just need to replace one of its empty spots with a new node.

The first property is there to make our algorithms a little bit simpler; it doesn’t really affect the tree’s shape at all. The last two properties ensure that the tree remains nicely dense, where no empty spot in the tree is more than twice as far from the root node as any other.

To fully understand these balancing properties, it’s useful to play with them a little, exploring their edge cases. For example, it’s possible to construct a red-black tree that consists solely of black nodes; the tree in figure 4.3 is one example.

Figure 4.3: An example red-black tree where every node is black.
Figure 4.3: An example red-black tree where every node is black.

If we try to construct more examples, we’ll soon realize that the third red-black property restricts such trees into a particular shape where all internal nodes have two children and all leaf nodes are on the same level. Trees with this shape are called perfect trees because they’re so perfectly balanced and symmetrical. This is the ideal shape all balanced trees aspire to, because it puts every node as close to the root as possible.

Unfortunately, it’d be unreasonable to require balancing algorithms to maintain perfect trees: indeed, it’s only possible to construct perfect binary trees with certain node counts. There’s no perfect binary tree with four nodes, for example.

To make red-black trees practical, the third red-black property therefore uses a weaker definition of balancing where red nodes aren’t counted. However, to make sure things don’t get too unruly, the second property limits the number of red nodes to something reasonable: it ensures that on any particular path in the tree, the number of red nodes will never exceed the number of black ones.

4.1 Algebraic Data Types

4.2 Pattern Matching and Recursion

4.3 Tree Diagrams

4.4 Insertion

4.5 Balancing

4.6 Collection

4.7 Performance

5 The Copy-On-Write Optimization

RedBlackTree creates a brand-new tree every time we insert a new element. The new tree shares some of its nodes with the original, but nodes on the entire path from the root to the new node are replaced with newly allocated nodes. This is the “easy” way to implement value semantics, but it’s a bit wasteful.

When the nodes of a tree value aren’t shared between other values, it’s OK to mutate them directly — nobody will mind, because nobody else knows about that particular tree instance. Direct mutation will eliminate most of the copying and allocation, often resulting in a dramatic speedup.

We’ve already seen how Swift supports implementing optimized copy-on-write value semantics for reference types by providing the isKnownUniquelyReferenced function. Unfortunately, Swift 3 doesn’t give us tools to implement COW for algebraic data types. We don’t have access to the private reference type Swift uses to box up nodes, so we have no way to determine if a particular node is uniquely referenced. (The compiler itself doesn’t have the smarts (yet?) to do the COW optimization on its own either.) There’s also no easy way to directly access a value stored inside an enum case without extracting a separate copy of it. (Note that Optional does provide direct access to the value stored in it via its forced unwrapping ! operator. However, the tools for implementing similar in-place access for our own enum types aren’t available for use outside the standard library.)

So to implement COW, we need to give up on our beloved algebraic data types for now and rewrite everything into a more conventional (dare I say tedious?) imperative style, using boring old structs and classes, with a sprinkle of optionals.

5.1 Basic Definitions

First, we need to define a public struct that’ll represent sorted set values. The RedBlackTree2 type below is just a small wrapper around a reference to a tree node that serves as its storage. OrderedSet worked the same way, so this pattern is already quite familiar to us by now:

public struct RedBlackTree2<Element: Comparable>: SortedSet {
fileprivate var root: Node? = nil
 
public init() {}
}

Next, let’s take a look at the definition of our tree nodes:

extension RedBlackTree2 {
class Node {
var color: Color
var value: Element
var left: Node? = nil
var right: Node? = nil
var mutationCount: Int64 = 0
 
init(_ color: Color, _ value: Element, _ left: Node?, _ right: Node?) {
self.color = color
self.value = value
self.left = left
self.right = right
}
}
}

Beyond the fields we had in the original enum case for RedBlackTree.node, this class also includes a new property called mutationCount. Its value is a count of how many times the node was mutated since its creation. This will be useful in our Collection implementation, where we’ll use it to create a new design for indices. (We explicitly use a 64-bit integer so that it’s unlikely the counter will ever overflow, even on 32-bit systems. Storing an 8-byte counter in every node isn’t really necessary, as we’ll only use the value stored inside root nodes. Let’s ignore this fact for now; it’s simpler to do it this way, and we’ll find a way to make this less wasteful in the next chapter.)

But let’s not jump ahead!

Using a separate type to represent nodes vs. trees means we can make the node type a private implementation detail to the public RedBlackTree2. External users of our collection won’t be able to mess around with it, which is a nice bonus — everybody was able to see the internals of RedBlackTree, and anyone could use Swift’s enum literal syntax to create any tree they wanted, which would easily break our red-black tree invariants.

The Node class is an implementation detail of the RedBlackTree2 struct; nesting the former in the latter expresses this relationship neatly. This also prevents naming conflicts with other types named Node in the same module. This also simplifies the syntax a little bit: Node automatically inherits RedBlackTree2’s Element type parameter, so we don’t have to explicitly specify it.

Again, tradition dictates that the 1-bit color property should be packed into an unused bit in the binary representation of one of Node’s reference properties; but it’d be unsafe, fragile, and fiddly to do that in Swift. It’s better to simply keep color as a standalone stored property and to let the compiler set up storage for it.

Note that we essentially transformed the two-case RedBlackTree enum into optional references to the Node type. The .empty case is represented by a nil optional, while a non-nil value represents a .node. Node is a heap-allocated reference type, so we’ve made the previous solution’s implicit boxing an explicit feature, giving us direct access to the heap reference and enabling the use of isKnownUniquelyReferenced.

5.2 Rewriting Simple Lookup Methods

5.3 Tree Diagrams

5.4 Implementing Copy-On-Write

5.5 Insertion

5.6 Implementing Collection

5.7 Examples

5.8 Performance Benchmarks

6 B-Trees

The raw performance of a small sorted array seems impossible to beat. So let’s not even try; instead, let’s build a sorted set entirely out of small sorted arrays!

It’s easy enough to get started: we just insert elements into a single array until we reach some predefined maximum size. Let’s say we want to keep our arrays below five elements and we’ve already inserted four values:

 

If we now need to insert the value 10, we get an array that’s too large:

 

We can’t keep it like this, so we have to do something. One option is to cut the array into two halves, extracting the middle element to serve as a sort of separator value between them:

 

Now we have a cute little tree structure with leaf nodes containing small sorted arrays. This seems like a promising approach: it combines sorted arrays and search trees into a single unified data structure, hopefully giving us array-like super-fast insertions at small sizes, all while retaining tree-like logarithmic performance for large sets. Synergy!

Let’s see what happens when we try adding more elements. We can keep inserting new values in the two leaf arrays until one of them gets too large again:

 

When this happens, we need to do another split, giving us three arrays and two separators:

 

What should we do with those two separator values? We store all other elements in sorted arrays, so it seems like a reasonable idea to put these separators in a sorted array of their own:

 

So apparently each node in our new array-tree hybrid will hold a small sorted array. Nice and consistent; I love it so far.

Now let’s say we want to insert the value 20 next. This goes in the rightmost array, which already has four elements — so we need to do another split, extracting 16 from the middle as a new separator. No biggie, we can just insert it into the array at the top:

 

Lovely! We can keep going like this; inserting 25, 26, and 27 overflows the rightmost array again, extracting the new separator, 25:

 

However, the top array is now full, which is worrying. Let’s say we insert 18, 21, and 22 next, so we end up with the situation below:

 

What next? We can’t leave the separator array bloated like this. Previously, we handled oversized arrays by splitting them — why not try doing the same now?

 

Oh, how neat: splitting the second-level array just adds a third level to our tree. This gives us a way to keep inserting new elements indefinitely; when the third-level array fills up, we can just add a fourth level, and so on, ad infinitum:

 

We’ve just invented a brand-new data structure! This is so exciting.

Unfortunately, it turns out that Rudolf Bayer and Ed McCreight already had the same idea way back in 1971, and they named their invention B-tree. What a bummer! I wrote an entire book to introduce a thing that’s almost half a century old. How embarrassing.

Fun fact: red-black trees were actually derived from a special case of B-trees in 1978. These are all ancient data structures.

6.1 B-Tree Properties

As we’ve seen, B-trees are search trees like red-black trees, but their layout is different. In a B-tree, nodes may have hundreds (or even thousands) of children, not just two. The number of children isn’t entirely unrestricted, though: this number must fit inside a certain range.

The maximum number of children a node may have is determined when the tree is created — this number is called the order of a B-tree. (The order of a B-tree has nothing to do with the ordering of its elements.) Note that the maximum number of elements stored in a node is one less than the tree’s order. This can be a source of off-by-one indexing errors, so it’s important to keep in mind as we work on B-tree internals.

In the previous section, we built an example B-tree of order 5. In practice, the order is typically between 100 and 2,000, so 5 is unusually low. However, nodes with 1,000 children wouldn’t easily fit on the page; it’s easier to understand B-trees by drawing toy examples.

To help in finding our way inside the tree, each internal node stores one element between each of its children, similar to how a node in a red-black tree contains a single value sorted between its left and right subtrees. So a B-tree node with 1,000 children will contain 999 elements neatly sorted in increasing order.

To keep lookup operations fast, B-trees maintain the following balancing invariants:

  1. Maximum size: All nodes store a maximum of order - 1 elements, sorted in increasing order.

  2. Minimum size: Non-root nodes must never be less than half full, i.e. the number of elements in all nodes except the root node must be at least (order - 1) / 2.

  3. Uniform depth: All leaf nodes must be at the same depth in the tree, counting from the root node at the top.

Note that the last two properties are a natural consequence of the way we do insertions; we didn’t have to do anything special to prevent nodes from getting too small or to make sure all leaves are on the same level. (These properties are much harder to maintain in other B-tree operations, though. For instance, removing an element may result in an undersized node that needs to be fixed.)

According to these rules, nodes in a B-tree of order 1,000 will hold anywhere between 499 and 999 elements (inclusive). The single exception is the root node, which isn’t constrained by the lower bound: it may contain 0 to 999 elements. (This is so we can create B-trees with, say, less than 499 elements.) Therefore, a single B-tree node on its own may have the same number of elements as a red-black tree that is 10–20 levels deep!

Storing so many elements in a single node has two major advantages:

  1. Reduced memory overhead. Each value in a red-black tree is stored in a separate, heap-allocated node that also includes a pointer to a method lookup table, its reference count, and the two references for the node’s left and right children. Nodes in a B-tree store elements in bulk, which can considerably reduce memory allocation overhead. (The exact savings depends on the size of the elements and the order of the tree.)

  2. Access patterns better suited for memory caching. B-tree nodes store their elements in small contiguous buffers. If the buffers are sized so that they fit well within the L1 cache of the CPU, operations working on them can be considerably faster than the equivalent code operating on red-black trees, the values of which are scattered all over the heap.

To understand how dense B-trees really are, consider that adding an extra level to a B-tree multiplies the maximum number of elements it may store by its order. For example, here’s how the minimum and maximum element counts grow as we increase the depth of a B-tree of order 1,000:

Depth Minimum size Maximum size
1 0 999
2 999 999 999
3 499 999 999 999 999
4 249 999 999 999 999 999 999
5 124 999 999 999 999 999 999 999 999
6 62 499 999 999 999 999 999 999 999 999 999
\(n\) \(2 {\left\lceil\frac{order}{2}\right\rceil}^{n-1}-1\) \(order^{n} - 1\)

Clearly, we’ll rarely, if ever, work with B-trees that are more than a handful of levels deep. Theoretically, the depth of a B-tree is \(O(\log n)\), where \(n\) is the number of elements. But the base of this logarithm is so large that, given the limited amount of memory available in any real-world computer, it’s in fact not too much of a stretch to say that B-trees have \(O(1)\) depth for any input size we may reasonably expect to see in practice.

(This last sentence somehow seems to make sense, even though the same line of thinking would lead me to say that any practical algorithm runs in \(O(1)\) time, with the constant limit being my remaining lifetime — obviously, I don’t find algorithms that finish after my death very practical at all. On the other hand, you could arrange all particles in the observable universe into a 1,000-order B-tree that’s just 30 levels or so deep; so it does seem rather pointless and nitpicky to keep track of the logarithm.)

Another interesting consequence of such a large fan-out is that in B-trees, the vast majority of elements are stored in leaf nodes. Indeed, in a B-tree of order 1,000, at least 99.8% of elements will always reside in leaves. Therefore, for operations that process B-tree elements in bulk (such as iteration), we mostly need to concentrate our optimization efforts on making sure leaf nodes are processed fast: the time spent on internal nodes often won’t even register in benchmarking results.

Weirdly, despite this, the number of nodes in a B-tree is still technically proportional to the number of its elements, and most B-tree algorithms have the same “big-O” performance as the corresponding binary tree code. In a way, B-trees play tricks with the simplifying assumptions behind the big-O notation: constant factors often do matter in practice. This should be no surprise by now. After all, if we didn’t care about constant factors, we could’ve ended this book after the chapter on RedBlackTree!

6.2 Basic Definitions

6.3 The Default Initializer

6.4 Iteration with forEach

6.5 Lookup Methods

6.6 Implementing Copy-On-Write

6.7 Utility Predicates

6.8 Insertion

6.9 Implementing Collection

6.10 Examples

6.11 Performance Summary

7 Additional Optimizations

In this chapter, we’re going to concentrate on optimizing BTree.insert even further, squeezing our code to get the last few drops of sweet performance out.

We’ll create three more SortedSet implementations, creatively naming them BTree2, BTree3, and BTree4. To keep this book at a manageable length, we aren’t going to include the full source code of these three B-tree variants; we’ll just describe the changes through a few representative code snippets. You can find the full source code for all three B-tree variants in the book’s GitHub repository.

If you’re bored with SortedSets, it’s OK to skip this chapter, as the advanced techniques it describes are rarely applicable to everyday app development work.

7.1 Inlining Array’s Methods

BTree stores elements and children of a Node in standard Arrays. In the last chapter, this made the code relatively easy to understand, which helped us explain B-trees. However, Array includes logic for index validation and COW that is redundant inside BTree — supposing we didn’t make any coding mistakes, BTree never subscripts an array with an out-of-range index, and it implements its own COW behavior.

Looking at our insertion benchmark chart (figure 7.1), we see that while BTree.insert is extremely close to SortedArray.insert at small set sizes, there’s still a 10–20% performance gap between them. Would eliminating Array’s (tiny) overhead be enough to close this gap? Let’s find out!

Figure 7.1: Comparing the performance of five different implementations for SortedSet.insert.
Figure 7.1: Comparing the performance of five different implementations for SortedSet.insert.

The Swift standard library includes UnsafeMutablePointer and UnsafeMutableBufferPointer types we can use to implement our own storage buffers. They deserve their scary names; working with these types is only slightly easier than working with C pointers — the slightest mistake in the code that handles them may result in hard-to-debug memory corruption bugs, memory leaks, or even worse! On the other hand, if we promise to be super careful, we might be able to use these to slightly improve performance.

7.2 Optimizing Insertion into Shared Storage

So far, whenever we’ve measured insertion performance, we’ve always assumed the storage of our sorted set is entirely unique to it. But we never benchmarked what happens when we insert elements into sets with shared storage.

Let’s devise a benchmark for measuring shared insertions! One way to do this is to measure how much time it takes to insert a bunch of elements when we make a full copy of the entire set after every insertion:

extension SortedSet {
func sharedInsertBenchmark(_ input: [Element]) {
var set = Self()
var copy = set
for value in input {
set.insert(value)
copy = set
}
_ = copy // Prevents warning about the variable never being read.
}
}

Figure 7.4 displays the results of running this new benchmark on the SortedSet implementations we’ve made thus far.

Figure 7.4: Amortized time of a single insertion into shared storage.
Figure 7.4: Amortized time of a single insertion into shared storage.

Clearly, our NSOrderedSet wrapper wasn’t built for this kind of abuse; once we pass a couple of thousand values, it becomes roughly a thousand times slower than SortedArray. And SortedArray doesn’t fare much better either: to dissociate their storage from their copies, both of these array-based sorted sets need to make a full copy of every single value stored in them. This doesn’t change their insertion implementation’s asymptotic performance (it’s still \(O(n)\)), but it does add a sizeable constant factor. In the case of SortedArray, the sharedInsert benchmark is about 3.5 times slower than plain insert.

Our two red-black tree implementations behave much better: for each insertion, they only need to make a copy of nodes that happen to fall on the path toward the newly inserted element. RedBlackTree does this anyway: its insertion performance is exactly the same whether or not its nodes are shared. But RedblackTree2 is unable to use its fancy in-place mutations: all of its isKnownUniquelyReferenced calls return false, so it becomes slightly slower than RedBlackTree at this particular benchmark.

BTree2 starts off really promising, but it slows down rather dramatically at around 64,000 elements. At this stage, the tree is getting close to growing a third level; its (second-level) root node contains enough elements that having to make a copy of it considerably weighs down insertions. This gets much worse as the tree gains its third level, ending up on a performance level that’s about 6 times slower than red-black trees. (BTree.insert’s asymptotic performance remains \(O(\log n)\); the slowdown just adds a huge constant factor.)

We want our B-trees to be much faster than red-black trees on every chart. Can we somehow prevent this bump from happening in the shared storage case? Well, of course we can!

Our theory is that the slowdown occurs because of large internal nodes. We also know that internal nodes usually don’t contribute much to B-tree performance, because most values are stored in leaf nodes. Therefore, we’re probably able to mess with internal nodes whichever way we like — it won’t affect the charts much. So what if we drastically restricted the maximum size of our internal nodes while leaving leaf nodes alone?

It turns out this is shockingly easy to do: we just need to make a tiny one-line change in BTree2.insert:

extension BTree3 {
@discardableResult
public mutating func insert(_ element: Element) -> (inserted: Bool, memberAfterInsert: Element) {
let root = makeRootUnique()
let (old, splinter) = root.insert(element)
if let s = splinter {
let root = BTree3<Element>.Node(order: 16) // <--
root.elementCount = 1
root.elements.initialize(to: s.separator)
root.children = [self.root, s.node]
self.root = root
}
return (inserted: old == nil, memberAfterInsert: old ?? element)
}
}

The marked line is inside the code block that adds a new level to the tree. The order number we use for the new root also gets applied to all nodes that’ll be split off of it — so by using a small order number here instead of self.order, we ensure that all internal nodes get initialized with this order instead of with the value originally given to BTree’s initializer. (New leaf nodes are always split off from existing leaves, so this number won’t ever get applied to them.)

Naming this new variant BTree3 and rerunning our benchmark gets us the chart in figure 7.5. Our theory was correct: limiting the size of internal nodes did in fact eliminate the slowdown! BTree3 is about 2–2.5 times faster than RedBlackTree, even for huge sets. (Creating a BTree3 of 4 million elements in this laborious manner takes just 15 seconds; BTree2 took 10 times as long.)

Figure 7.5: Amortized time of a single insertion into shared storage.
Figure 7.5: Amortized time of a single insertion into shared storage.

Limiting the size of internal nodes increases the typical height of the tree, so it does affect some B-tree operations. Thankfully though, the effect is mostly negligible: it slows down contains and the in-place mutating flavor of insert only by about 10%, and it doesn’t affect iteration methods at all. For instance, figure 7.6 compares BTree3’s in-place insertion performance to our other implementations.

Figure 7.6: Amortized time of a single insertion into shared storage.
Figure 7.6: Amortized time of a single insertion into shared storage.

That was surprisingly easy, wasn’t it?

7.3 Eliminating Redundant Copying

Whenever we had to implement COW, we did it by implementing makeFooUnique methods that create clones of shared storage when necessary. Our code then proceeded to mutate the resulting unique storage in a separate, second step.

If you think about it, this is a rather inefficient way of doing this: for example, when we have to insert a new element at the start of a shared OrderedSet value, makeUnique first makes a brand-new NSOrderedSet with the exact same elements, and then insert proceeds to immediately move all of them one slot to the right to make room for the new item.

No wonder sharedInsert was so slow for OrderedSets! It would be much faster to create the clone of the shared storage such that it already has the new element inserted at the correct place. We won’t implement this for OrderedSet though, as it isn’t worth the effort: it would still remain far behind our other SortedSet implementations.

However, we can observe the same suboptimal behavior in all three BTree variants we’ve made thus far. When we insert an element into a shared B-tree node, we start by copying all existing elements into a new buffer; we insert the new value in a separate step. What’s even worse is the insertion may result in an oversized node, which then needs to be split — and this is done as yet another separate operation, moving half of the elements into a third buffer. A single insertion may therefore copy/move several hundred elements not once, not twice, but three different times! That seems inefficient.

It’s possible to eliminate all this needless copying/moving by merging makeUnique, insert, and split into a single, rather complicated operation. Merging makeUnique and insert can be done by implementing a non-mutating variant of insertion and switching to it when we detect the node is shared. To unify this hybrid insertion with split, we need to detect that a split will be necessary before doing the insertion, and if so, create two nodes from scratch so that the new element is already in the correct place. Note that the new element may end up inside the first node, or inside the second node, or — if it happens to fall exactly in the middle — it may become the separator value between the two halves of the original node. Thus, to unify all three operations, we need to write \(2 \times 4 = 8\) separate variants of insertion.

I did my best to implement this; unfortunately, the result was a wash. Shared insertion became a little bit faster for certain element counts, and it became a little bit slower for others. In-place insertion didn’t change much at all.

I won’t include the code for my attempt for BTree4 here, but you’ll find it in the GitHub repository. Try tweaking it; maybe you’ll succeed at making it meaningfully faster!

8 Conclusion

In this book, we’ve discussed seven different ways to implement the same simple collection type. Every time we created a new solution, our code became incrementally more complicated, but in exchange, we gained substantial performance improvements. This is perhaps best demonstrated by the steady downward progress of our successive insertion implementations on the benchmark chart.

Figure 8.1: Comparing insertion performance of five SortedSet implementations to the amortized per-element cost of Array.sort.
Figure 8.1: Comparing insertion performance of five SortedSet implementations to the amortized per-element cost of Array.sort.

Is it possible to implement an even faster SortedSet.insert? Well, BTree3 certainly has room for some additional micro-optimizations; I think a 5–10% improvement is definitely possible. Investing even more effort would perhaps get us 20%.

But is it possible we missed some clever optimization trick that would get as another huge performance jump of 200–400%? I don’t believe so.

First of all, note that by inserting a bunch of elements into a sorted set, we’re essentially sorting them. We could also do that by simply calling Array.sort, which implements the super speedy Introsort sorting algorithm. The last line in the chart above depicts the amortized time Array.sort spends on each element. In a very real sense, Array.sort sets a hard upper limit on the performance we can expect out of any sorted set.

At typical sizes, sorting elements by calling BTree3.insert in a loop is just 3.5 times slower than Array.sort. This is already amazingly close! Consider that the BTree3 benchmark processes each element individually and keeps existing elements neatly sorted after each insertion. I find it surprising that BTree3.insert gets so close despite such a huge handicap, and I’d be shocked to learn about a new SortedSet implementation that improves on B-trees by even 50%.

8.1 Implementing Insertion in Constant Time

Although it may not be possible to drastically improve on BTree3 while fully implementing all SortedSet requirements, we can always speed things up by cheating!

For example, SillySet in the code below implements the syntactic requirements of the SortedSet protocol with an insert method that takes \(O(1)\) time. It runs circles around Array.sort in the insert benchmark above without even breaking a sweat:

struct SillySet<Element: Hashable & Comparable>: SortedSet, RandomAccessCollection {
typealias Indices = CountableRange<Int>
 
class Storage {
var v: [Element]
var s: Set<Element>
var extras: Set<Element> = []
 
init(_ v: [Element]) {
self.v = v
self.s = Set(v)
}
 
func commit() {
guard !extras.isEmpty else { return }
s.formUnion(extras)
v += extras
v.sort()
extras = []
}
}
 
private var storage = Storage([])
 
var startIndex: Int { return 0 }
 
var endIndex: Int { return storage.s.count + storage.extras.count }
 
// Complexity: `O(n*log(n))`, where `n` is the number of insertions since the last time `subscript` was called.
subscript(i: Int) -> Element {
storage.commit()
return storage.v[i]
}
 
// Complexity: O(1)
func contains(_ element: Element) -> Bool {
return storage.s.contains(element) || storage.extras.contains(element)
}
 
// Complexity: O(1) unless storage is shared.
mutating func insert(_ element: Element) -> (inserted: Bool, memberAfterInsert: Element) {
if !isKnownUniquelyReferenced(&storage) {
storage = Storage(storage.v)
}
if let i = storage.s.index(of: element) { return (false, storage.s[i]) }
return storage.extras.insert(element)
}
}

Of course, there are many problems with this code: for example, it requires Element to be hashable, and it violates Collection requirements by implementing subscript in \(O(n \log n)\) time — which is absurdly long. But the problem I find most vexing is that SillySet’s indexing subscript modifies its underlying storage by side effect — this breaks my assumptions about what value semantics mean in Swift. (For example, it makes it dangerous to pass SillySet values between threads; even read-only concurrent access may result in data races.)

This particular example might be silly, but the idea of deferring the execution of successive insertions by collecting such elements into a separate buffer has a lot of merit. Creating a sorted set by individually inserting a bunch of elements in a loop isn’t very efficient at all. It’s a lot quicker to first sort the elements in a separate buffer and then to use special bulk loading initializers to convert the buffer into a B-tree in linear time.

Bulk loading works by exploiting the fact that we don’t care whether its intermediate states satisfy sorted set requirements — we never look at elements of a half-loaded set.

It’s important to recognize opportunities for this type of optimization, because they allow us to temporarily escape our usual constraints, often resulting in performance boosts that wouldn’t otherwise be possible.

8.2 Farewell

I hope you enjoyed reading this book! I had a lot of fun putting it together. Along the way, I learned a great deal about implementing collections in Swift, and hopefully you picked up a couple of new tricks too.

Throughout this book, we explored a number of different ways to solve the particular problem of building a sorted set, concentrating on benchmarking our solutions in order to find ways to improve their performance.

However, none of our implementations were complete, as the code we wrote was never really good enough for production use. To keep the book relatively short and to the point, we made some shortcuts that would be inappropriate to take in a real package.

Indeed, even our SortedSet protocol was simplified to the barest minimum: we cut most of the methods of SetAlgebra. For instance, we never discussed how to implement the remove operation. Somewhat surprisingly, it’s usually much harder to remove elements from a balanced tree than it is to insert them. (Try it!)

We didn’t spend time examining all the other data types we can build out of balanced search trees. Tree-based sorted maps, lists, and straightforward variants like multisets and multimaps are just as important as sorted sets; it would’ve been interesting to see how our code could be adapted to implement them.

We also didn’t explain how these implementations could be tested. This is a particularly painful omission, because the code we wrote was often tricky, and we sometimes used unsafe constructs, where the slightest mistake could result in scary memory corruption issues and days of frustrating debugging work.

Testing is hugely important; unit tests in particular provide a safety net against regressions and are pretty much a prerequisite to any kind of optimization work. Data structures lend themselves to unit testing especially well: their operations take easy-to-generate input and they produce well-defined, easily validatable output. Powerful packages like SwiftCheck provide easy-to-use tools for providing full test coverage.

That said, testing COW implementations can be a challenging task. If we aren’t careful enough about not making accidental strong references before calling isKnownUniquelyReferenced, our code will still produce correct results — it will just do it a lot slower than we expect. We don’t normally check for performance issues like this in unit tests, and we need to specially instrument our code to easily catch them.

On the other hand, if we simply forget to ensure we don’t mutate shared storage, our code will also affect variables holding unmutated copies. This kind of unexpected action at a distance can be extremely hard to track down, because our operation breaks values that aren’t explicitly part of its input or output. To ensure we catch this error, we need to write unit tests that specifically check for it — normal input/output checks won’t necessarily detect it, even if we otherwise have 100% test coverage.

We did briefly mention that by adding element counts to the nodes of a search tree, we can find the \(i\)th smallest or largest element in the tree in \(O(\log n)\) time. This trick can in fact be generalized: search trees can be augmented to speed up the calculation of any associative binary operation over arbitrary ranges of elements. Augmentation is something of a secret weapon in algorithmic problem solving: it enables us to easily produce efficient solutions to many complicated-looking problems. We didn’t have time to explain how to implement augmented trees or how we might solve problems using them.

Still, this seems like a good point to end the book. We’ve found what seems to be the best data structure for our problem, and we’re now ready to start working on building it up and polishing it into a complete, production-ready package. This is by no means a trivial task: we looked at the implementation of half a dozen or so operations, but we need to write, test, and document dozens more!

If you liked this book and you’d like to try your hand at optimizing production-quality collection code, take a look my BTree package on GitHub. At the time of writing, the most recent version of this package doesn’t even implement some of the optimizations in our original B-tree code, much less any of the advanced stuff in Chapter 7. There’s lots of room for improvement, and your contributions are always welcome.

How This Book Was Made

This book was generated by bookie, my tool for creating books about Swift. (Bookie is of course the informal name for a bookmaker, so the name is a perfect fit.)

Bookie is a command-line tool written in Swift that takes Markdown text files as input, and produces nicely formatted Xcode Playground, GitHub-flavored Markdown, EPUB, HTML, LaTeX and PDF files, along with a standalone Swift package containing all the source code. It knows how to generate playgrounds, Markdown and source code directly; the other formats are generated by Pandoc, after converting the text into Pandoc’s own Markdown dialect.

To verify code samples, bookie extracts all Swift code samples into a special Swift package (carefully annotated with #sourceLocation directives), and builds it using the Swift Package Manager. The resulting command-line app is then run to print the return value of all lines of code that are to be evaluated. This output is then split and each individual result is inserted into printed versions of the book after its corresponding line of code:

func factorial(_ n: Int) -> Int {
return (1 ... max(1, n)).reduce(1, *)
}
factorial(4)
24
factorial(10)
3628800

(In playgrounds, such output is generated dynamically; but the printout has to be explicitly included into the other formats.)

Syntax coloring is done using SourceKit, like in Xcode. SourceKit uses the official Swift grammar—so contextual keywords are always correctly highlighted:

var set = Set<Int>() // "set" is also a keyword for defining property setters
set.insert(42)
set.contains(42)
true

Ebook and print versions of this book are typeset in Tiempos Text by the Klim Type Foundry. Code samples are set in Laurenz Brunner’s excellent Akkurat.

Bookie is not (yet?) free/open source software; you need to contact us directly if you’re interested in using it in your own projects.