Now that emoji are common everywhere, we need to be aware of unicode, even without an international userbase. For example, the emoji 👨👩👧👦 (a family of four glyph) has a very different length across String implementations:
("👨👩👧👦" as NSString).length
"👨👩👧👦".length evaluates to 7, and in Python 2
len("👨👩👧👦") evaluates to 25 (depending on your settings). One string, four different lengths.
Perhaps even more surprising: none of these implementations are wrong. They're all counting different things. In Swift, we get
1 as the answer because Swift counts the characters -- 👨👩👧👦 is a single character. The
We can also see how Python gets to 25 -- in this case, it counts the UTF-8 code units:
And finally, Ruby and Python 3 evalute to 7 because they count the unicode scalars, and 👨👩👧👦 consists of the following scalars: 👨 + zero width joiner + 👩 + zero width joiner + 👧 + zero width joiner + 👦.
When you're dealing with strings where length is significant, keep this in mind. To learn more, watch last week's Swift Talk episode or read the transcript. If you'd like to learn more about unicode and how it's implemented in Swift, read our book Advanced Swift.