This started as a series of notes entitled “The Tinkerer’s Guide to Unicode”, but I got stuck into analysing Windows’ behaviour via Python. The steps involved may be useful to others, so I’ve written up the process.
- ASCII: old-school character set favouring American-English characters. Uses 7 bits, so 128 entries from 0-127. Full table
- ANSI: newer character set and encoding, superset of ASCII using the eighth bit (so values 128-255). Different code pages determine what’s in the eighth bit’s range.
- Unicode: the modern character set, now completely distinct from an encoding. Values look like
U+0023(#). - UTF-8: encoding system for Unicode. Code points are one or more bytes. Named for 8 bits.
- UTF-16: encoding system for Unicode. Code points are two or more bytes. Named for 16 bits.
- UCS-2: a predecessor to UTF-16. Named for 2 bytes. They’re slightly different. 😩
Inserting characters
We’ll use Đ as a sample character. It has Unicode value U+0110.
charmap for visual exploration
Run charmap. Tick “advanced view” down the bottom. Ensure “Character set:” is “Unicode”. Don’t use “Search for:” at the bottom; instead “Go to Unicode:” on the right. Entering “0110” will get you your U+0110 ‘Đ’. Press “Select” then “Copy”, and paste into your application.
Copy+paste the character around everywhere.
Direct input with Alt++0110
I seem to have decimal input enabled by default, but, it’s not Unicode input without a registry hack. Holding Alt then pressing the numpad + produces a bell. Holding Alt then typing 0110 on the numpad produces n, which is decimal 110 in ASCII.
The registry hack requires a reboot, but changes the behaviour above to enable Alt++ and now I’m alting unicode.
We can also insert non-numpad digits, ie A through F, as seen in U+00CF Ï.
But wait, there’s more! U+1F629, the ‘weary’ emoji 😩, is five hex digits. There doesn’t seem to be a built-in way to enter this.
Let’s just… use Emojipedia for this one, to copy+paste it…
Remember to press + when inputting Unicode.
Debugging characters
Suppose you go to insert U+1F629, our weary friend, and get a box: ‘’. It’s not ‘😩’ because you have one of those, rendered correctly, in the same session. So what is it? How does it relate to U+1F629?
Windows 10
A quick Google doesn’t reveal any utilities.
Fortunately, I’ve installed Bash on Windows and it has Python 3.4.3. I know the ord() function will tell you the Unicode code point for a single-character string, so let’s try ord("").
It turns out, in this terminal, you can’t enter characters with Alt, even our faithful #/U+0023. You can copy+paste “regular” text (for some value of “regular”) but not ‘😩’ or the box we’re trying to debug. If you paste a sentence including those characters, they simply won’t register.
1 | |
So the Bash for Windows terminal is dumb. Whatever.
It turns out that both cmd.exe and powershell.exe behave the same in these next steps, so I’ll stick with PowerShell:
1 | |
Now we’re getting somewhere. The terminal only has limited support for our fancy Unicode characters, but we can operate on them.
It seems that rendering a box ‘□’ is Windows’ standard response to “unknown”, not eg the replacement character ‘�’ which often looks like ‘<?>’. 1
So, why is one 128553 and the other 63017?
Let’s use the unicodedata library.
1 | |
Sidetrack: what’s the actual code point for “COMBINING ACUTE ACCENT”? I found a PDF of combining diacritics which tells us the answer is U+301, but how do you learn that from Python?
1 | |
Incidentally, copy+paste is doing much better than the terminal. Notepad++ is taking “boxes” from the terminal and displaying them correctly.
Anyway, how do we go from ‘769’ (from ord()) to our U+0301 code point?
1 | |
Oh. Right. I’d forgotten Unicode code points are specified in hex.
Sidetrack complete. Back to our mysterious \uf629 character, aka bugbox.
1 | |
This is the key. The first hex digit is dropped when entering the five-digit character into PowerShell. What about other characters?
I went to write a loop test, but ran into something else. You can’t simply specify ‘weary’ as \u1f629.
1 | |
Turns out, in Python 3 strings, \u is for 16-bit hex values and \U is for 32-bit hex values.
1 | |
Great. I’m satisfied that this all fits. Let’s write some looped tests.
1 | |
Pictured: the best, dumbest Python I’ve ever written.
Now I’ll populate the Emoji.character by typing the escape codes on my keyboard. This is it: the main test.
1 | |
Oh no! They’re all wrong. Let’s take a closer look.
1 | |
This is showing a familiar pattern. All the entered code points are missing the leading 1.
This is a very limited test data set, of course. The main emojis block is only approximately U+1F300 to U+1F9C0. Let’s broaden it, and update the model while we’re at it.
1 | |
As soon as we start entering 32-bit values, the terminal starts trimming them.
1 | |
Let’s look at the subtraction between correct and entered:
1 | |
All nicely formatted:
1 | |
So we’ve been dropping the first hex digits in the larger entries, and only attending to the last four.
Well, shit. That’s probably why.
Wrapping up
What a rollercoaster. Let’s summarise:
- PowerShell and cmd have a limited range of characters visible in their available fonts (no emoji!).
- Bash for Windows’ terminal will discard unfamiliar characters when pasted in. It also doesn’t accept the
Alt++0023format even for familiar characters (eg#). - PowerShell only preserves the last four hex digits from
Altinput. SoAlt++1F629is trimmed toU+F629.
Useful techniques:
- Wrap a character in
hex(ord())to view it in the same format as its Unicode identifier. Eghex(ord('😩')will produce'0x1f629' - Python 3’s strings use
\ufor 4-digit unicode hex, but\Ufor 8-digit hex. - Putting together test data and manipulating it with list comprehensions and print formatting is great for getting a picture of things.
References
The spirit of experiment in this post was inspired by Fluent Python, which I would strongly recommend to anyone seeking to become confident in the language. Chapter 4 deals with Unicode.
-
Incidentally, I tried the Lucida Console, Consolas, DejaVu Sans Mono, and Source Code Pro fonts. They all showed the same boxes. ↩