This started as a series of notes entitled “The Tinkerer’s Guide to Unicode”, but I got stuck into analysing Windows’ behaviour via Python. The steps involved may be useful to others, so I’ve written up the process.
- ASCII: old-school character set favouring American-English characters. Uses 7 bits, so 128 entries from 0-127. Full table
- ANSI: newer character set and encoding, superset of ASCII using the eighth bit (so values 128-255). Different code pages determine what’s in the eighth bit’s range.
- Unicode: the modern character set, now completely distinct from an encoding. Values look like
U+0023
(#
). - UTF-8: encoding system for Unicode. Code points are one or more bytes. Named for 8 bits.
- UTF-16: encoding system for Unicode. Code points are two or more bytes. Named for 16 bits.
- UCS-2: a predecessor to UTF-16. Named for 2 bytes. They’re slightly different. 😩
Inserting characters
We’ll use Đ
as a sample character. It has Unicode value U+0110
.
charmap for visual exploration
Run charmap
. Tick “advanced view” down the bottom. Ensure “Character set:” is “Unicode”. Don’t use “Search for:” at the bottom; instead “Go to Unicode:” on the right. Entering “0110” will get you your U+0110 ‘Đ’. Press “Select” then “Copy”, and paste into your application.
Copy+paste the character around everywhere.
Direct input with Alt
++0110
I seem to have decimal input enabled by default, but, it’s not Unicode input without a registry hack. Holding Alt
then pressing the numpad +
produces a bell. Holding Alt
then typing 0110
on the numpad produces n
, which is decimal 110 in ASCII.
The registry hack requires a reboot, but changes the behaviour above to enable Alt
++
and now I’m alting unicode.
We can also insert non-numpad digits, ie A through F, as seen in U+00CF
Ï
.
But wait, there’s more! U+1F629
, the ‘weary’ emoji 😩
, is five hex digits. There doesn’t seem to be a built-in way to enter this.
Let’s just… use Emojipedia for this one, to copy+paste it…
Remember to press +
when inputting Unicode.
Debugging characters
Suppose you go to insert U+1F629
, our weary friend, and get a box: ‘’. It’s not ‘😩’ because you have one of those, rendered correctly, in the same session. So what is it? How does it relate to U+1F629
?
Windows 10
A quick Google doesn’t reveal any utilities.
Fortunately, I’ve installed Bash on Windows and it has Python 3.4.3. I know the ord()
function will tell you the Unicode code point for a single-character string, so let’s try ord("")
.
It turns out, in this terminal, you can’t enter characters with Alt
, even our faithful #
/U+0023
. You can copy+paste “regular” text (for some value of “regular”) but not ‘😩’ or the box we’re trying to debug. If you paste a sentence including those characters, they simply won’t register.
1 |
|
So the Bash for Windows terminal is dumb. Whatever.
It turns out that both cmd.exe
and powershell.exe
behave the same in these next steps, so I’ll stick with PowerShell:
1 |
|
Now we’re getting somewhere. The terminal only has limited support for our fancy Unicode characters, but we can operate on them.
It seems that rendering a box ‘□’ is Windows’ standard response to “unknown”, not eg the replacement character ‘�’ which often looks like ‘<?>’. 1
So, why is one 128553 and the other 63017?
Let’s use the unicodedata library.
1 |
|
Sidetrack: what’s the actual code point for “COMBINING ACUTE ACCENT”? I found a PDF of combining diacritics which tells us the answer is U+301
, but how do you learn that from Python?
1 |
|
Incidentally, copy+paste is doing much better than the terminal. Notepad++ is taking “boxes” from the terminal and displaying them correctly.
Anyway, how do we go from ‘769’ (from ord()
) to our U+0301
code point?
1 |
|
Oh. Right. I’d forgotten Unicode code points are specified in hex.
Sidetrack complete. Back to our mysterious \uf629
character, aka bugbox
.
1 |
|
This is the key. The first hex digit is dropped when entering the five-digit character into PowerShell. What about other characters?
I went to write a loop test, but ran into something else. You can’t simply specify ‘weary’ as \u1f629
.
1 |
|
Turns out, in Python 3 strings, \u
is for 16-bit hex values and \U
is for 32-bit hex values.
1 |
|
Great. I’m satisfied that this all fits. Let’s write some looped tests.
1 |
|
Pictured: the best, dumbest Python I’ve ever written.
Now I’ll populate the Emoji.character
by typing the escape codes on my keyboard. This is it: the main test.
1 |
|
Oh no! They’re all wrong. Let’s take a closer look.
1 |
|
This is showing a familiar pattern. All the entered code points are missing the leading 1
.
This is a very limited test data set, of course. The main emojis block is only approximately U+1F300
to U+1F9C0
. Let’s broaden it, and update the model while we’re at it.
1 |
|
As soon as we start entering 32-bit values, the terminal starts trimming them.
1 |
|
Let’s look at the subtraction between correct and entered:
1 |
|
All nicely formatted:
1 |
|
So we’ve been dropping the first hex digits in the larger entries, and only attending to the last four.
Well, shit. That’s probably why.
Wrapping up
What a rollercoaster. Let’s summarise:
- PowerShell and cmd have a limited range of characters visible in their available fonts (no emoji!).
- Bash for Windows’ terminal will discard unfamiliar characters when pasted in. It also doesn’t accept the
Alt
++0023
format even for familiar characters (eg#
). - PowerShell only preserves the last four hex digits from
Alt
input. SoAlt
++1F629
is trimmed toU+F629
.
Useful techniques:
- Wrap a character in
hex(ord())
to view it in the same format as its Unicode identifier. Eghex(ord('😩')
will produce'0x1f629'
- Python 3’s strings use
\u
for 4-digit unicode hex, but\U
for 8-digit hex. - Putting together test data and manipulating it with list comprehensions and print formatting is great for getting a picture of things.
References
The spirit of experiment in this post was inspired by Fluent Python, which I would strongly recommend to anyone seeking to become confident in the language. Chapter 4 deals with Unicode.
-
Incidentally, I tried the Lucida Console, Consolas, DejaVu Sans Mono, and Source Code Pro fonts. They all showed the same boxes. ↩