Swift Tip: Decomposing Emoji
Now that emoji are common everywhere, we need to be aware of unicode, even without an international userbase. For example, the emoji ๐จโ๐ฉโ๐งโ๐ฆ (a family of four glyph) has a very different length across String implementations:
"๐จโ๐ฉโ๐งโ๐ฆ".count // 1
("๐จโ๐ฉโ๐งโ๐ฆ" as NSString).length // 11
Javascript also evaluates to 11. In Ruby "๐จโ๐ฉโ๐งโ๐ฆ".length
evaluates to 7, and in Python 2 len("๐จโ๐ฉโ๐งโ๐ฆ")
evaluates to 25 (depending on your settings). One string, four different lengths.
Perhaps even more surprising: none of these implementations are wrong. They're all counting different things. In Swift, we get 1
as the answer because Swift counts the characters -- ๐จโ๐ฉโ๐งโ๐ฆ is a single character. The NSString
variant and Javascript evaluate to 11 because they're counting the number of UTF-16 code units. We can replicate this in Swift:
"๐จโ๐ฉโ๐งโ๐ฆ".utf16.count // 11
We can also see how Python gets to 25 -- in this case, it counts the UTF-8 code units:
"๐จโ๐ฉโ๐งโ๐ฆ".utf8.count // 25
And finally, Ruby and Python 3 evalute to 7 because they count the unicode scalars, and ๐จโ๐ฉโ๐งโ๐ฆ consists of the following scalars: ๐จ + zero width joiner + ๐ฉ + zero width joiner + ๐ง + zero width joiner + ๐ฆ.
"๐จโ๐ฉโ๐งโ๐ฆ".unicodeScalars.count // 7
When you're dealing with strings where length is significant, keep this in mind. To learn more, watch last week's Swift Talk episode or read the transcript. If you'd like to learn more about unicode and how it's implemented in Swift, read our book Advanced Swift.