Now that emoji are common everywhere, we need to be aware of unicode, even without an international userbase. For example, the emoji 👨👩👧👦 (a family of four glyph) has a very different length across String implementations:
("👨👩👧👦" as NSString).length
evaluates to 7, and in Python 2
evaluates to 25 (depending on your settings). One string, four different lengths.
Perhaps even more surprising: none of these implementations are wrong. They're all counting different things. In Swift, we get
as the answer because Swift counts the characters -- 👨👩👧👦 is a single character. The
We can also see how Python gets to 25 -- in this case, it counts the UTF-8 code units:
And finally, Ruby and Python 3 evalute to 7 because they count the unicode scalars, and 👨👩👧👦 consists of the following scalars: 👨 + zero width joiner + 👩 + zero width joiner + 👧 + zero width joiner + 👦.
When you're dealing with strings where length is significant, keep this in mind. To learn more, watch last week's Swift Talk episode
or read the transcript
. If you'd like to learn more about unicode and how it's implemented in Swift, read our book Advanced Swift