Blog

Swift Tip: Decomposing Emoji

Now that emoji are common everywhere, we need to be aware of unicode, even without an international userbase. For example, the emoji ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ (a family of four glyph) has a very different length across String implementations:

								"๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".count // 1
("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ" as NSString).length // 11

							

Javascript also evaluates to 11. In Ruby "๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".length evaluates to 7, and in Python 2 len("๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ") evaluates to 25 (depending on your settings). One string, four different lengths.

Perhaps even more surprising: none of these implementations are wrong. They're all counting different things. In Swift, we get 1 as the answer because Swift counts the characters -- ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ is a single character. The NSString variant and Javascript evaluate to 11 because they're counting the number of UTF-16 code units. We can replicate this in Swift:

								"๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".utf16.count // 11

							

We can also see how Python gets to 25 -- in this case, it counts the UTF-8 code units:

								"๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".utf8.count // 25

							

And finally, Ruby and Python 3 evalute to 7 because they count the unicode scalars, and ๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ consists of the following scalars: ๐Ÿ‘จ + zero width joiner + ๐Ÿ‘ฉ + zero width joiner + ๐Ÿ‘ง + zero width joiner + ๐Ÿ‘ฆ.

								"๐Ÿ‘จโ€๐Ÿ‘ฉโ€๐Ÿ‘งโ€๐Ÿ‘ฆ".unicodeScalars.count // 7

							

When you're dealing with strings where length is significant, keep this in mind. To learn more, watch last week's Swift Talk episode or read the transcript . If you'd like to learn more about unicode and how it's implemented in Swift, read our book Advanced Swift .

Stay up-to-date with our newsletter or follow us on Twitter .

Back to the Blog

Recent Posts