Blog

Decomposing Emoji

Now that emoji are common everywhere, we need to be aware of unicode, even without an international userbase. For example, the emoji πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ (a family of four glyph) has a very different length across String implementations:

"πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".count // 1
("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦" as NSString).length // 11

Javascript also evaluates to 11. In Ruby "πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".length evaluates to 7, and in Python 2 len("πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦") evaluates to 25 (depending on your settings). One string, four different lengths.

Perhaps even more surprising: none of these implementations are wrong. They’re all counting different things. In Swift, we get 1 as the answer because Swift counts the characters – πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ is a single character. The NSString variant and Javascript evaluate to 11 because they’re counting the number of UTF-16 code units. We can replicate this in Swift:

"πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".utf16.count // 11

We can also see how Python gets to 25 – in this case, it counts the UTF-8 code units:

"πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".utf8.count // 25

And finally, Ruby and Python 3 evalute to 7 because they count the unicode scalars, and πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ consists of the following scalars: πŸ‘¨ + zero width joiner + πŸ‘© + zero width joiner + πŸ‘§ + zero width joiner + πŸ‘¦.

"πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦".unicodeScalars.count // 7

When you’re dealing with strings where length is significant, keep this in mind. To learn more, watch last week’s Swift Talk episode or read the transcript. If you’d like to learn more about unicode and how it’s implemented in Swift, read our book Advanced Swift.

Stay up-to-date with our newsletter or follow us on Twitter.

Back to the Blog

recent posts