Unicode

Unicode is one of those things people don't commonly think about, but benefit from immensely. To explain why Unicode was necessary, we need to look back at the early days of computing.

Before the 80s, if you owned a IBM mainframe, you were pretty much stuck only buying IBM computers. The reason for this is that IBM computers could only talk to other IBM computers, due to there being no "standard" way to encode characters. This meant that even files made on an IBM likely couldn't be read by competitor's machines.

Most everyone has heard "computers only work with ones and zeroes", but have never had it explained what this means, or how this works. I'll attempt to cover what this means for data storage in this section.

Individual characters (typically letters and symbols) have to be stored in binary somehow, which is the main purpose of a "character encoding". Character encodings are pretty much a reference which defines "this character is stored as this binary representation". Unicode UTF-8 (one of the most common character encodings), has their reference on their website at https://www.unicode.org/charts/ (Complete PDF is located at https://www.unicode.org/Public/UCD/latest/charts/CodeCharts.pdf).

For clarification, let's walk through an example of how this works.

In this picture, you can see that I have a text document with the text Hello!. I've opened this text file in a hex editor called "Bless". Bless is used to open a file and view the raw data in the file in a hexadecimal form.

Notice that I have the letter H selected on the right hand side of Bless, which corresponds to the hexadecimal number 48, which is stored on disk.

We can look at the Unicode charts and see that the hexadecimal number 48 corresponds to a H.

That's exactly how Unicode works. There's data that needs to get stored, and when the data is saved, it looks up each character in the tables and writes that to disk.

Even emoji work that way! Their data charts are located at https://www.unicode.org/emoji/charts/full-emoji-list.html.

Love them or hate them, emoji are here to stay. Due to their prevalence in common communication mediums today, they too have become standardized. Although this seems to have the exact same solution as before, they have some interesting quirks of their own.

Introducing the "Zero Width Joiner" character, a completely different beast than typical text. The ZWJ character is a non-printing (invisible) character that pulls two (or more) separate characters together. This works in an interesting manner, which I'll walk us through an example of below.

There's the standard person emoji (hex code 1F468), which we've all seen, and that works pretty well for many of us.

However, those of us that are of a darker complexion, would likely prefer to have an emoji that accurately portrays our skin tone.

This works by having a generic emoji (a man in this case), and a skin tone emoji (which most of us never really see). There are many of these, and I'm going to use the "EMOJI MODIFIER FITZPATRICK TYPE-5" emoji for the skin tone with the hex code 1F3FE.

This allows us to specify that we want a darker skin tone on the emoji man by using the hex code 1F468 1F3FE.

Person `1F468 1F3FE` — `1F468 1F3FE`: Emoji Man with darker skin tone

If we want our emoji to be even more specific, we have the option of using the ZWJ (hex code 200D) and a specific modifier. I'm going to use the "red hair" modifier emoji (hex code 1F9B0).

Combining all of these, we get the hex code 1F468 1F3FE 200D 1F9B0, which is an emoji man with a darker skin tone and red hair.

`1F468 1F3FE 200D 1F9B0`: Emoji man with darker skin tone and red hair

If you'd like to look into all of the ZWJ sequences, they're available at https://unicode.org/Public/emoji/12.1/emoji-zwj-sequences.txt.

Note that standard keyboards on phones or computers do not allow using the ZWJ character.

Encoding?

How did the standardization help?

A Note on Emoji

Wrapping Up

Leave a ReplyCancel reply