I dunno, a personal webpage being annoying just to have some fun, as opposed to trying to trick me into clicking in an ad or making some engagement metric go up, feels pretty refreshing to me in 2023.
Any time people talk about string length being some unicode pain, I ask them: what is the length of the tab character? Or the backspace character? Or is the length of MM equal to ll? Those are present in plain ascii and the answer is basically always “it depends”.
And quite frankly, the correct answer almost never comes from counting graphemes. Again, consider those cases: the grapheme count of MM and ll are the same. But why does that ever matter? If you have some backend storage limit, it probably cares more about the byte count than the grapheme count. And if you have some frontend display limit, you probably want to know the display size in a given font. Sure, the font size is gonna depend on graphemes (among other things) but that’s more an internal detail to the majority of users of a string.
If you’re writing the guts of an editor, knowing if the current byte is part of a code point or if the code point is part of a grapheme cluster matters… but most people are probably better off treating strings as opaque blobs and asking for a different length for different tasks, like byte count or display pixel size.
The problem is not so much length, it’s things like assuming that if two strings are the same then they must be the same length, but that isn’t actually true for unicode unless you do a canonicalisation pass first.
If you’re writing a text renderer, there is no “length” measure that will give you the answer you want.
But there are many cases where asking the length of a string is both a sensible and a necessary question, and needs a definition. For much of the web platform, the answer is “the number of 16-bit code units required to encode the string in UTF-16”, for example, because JavaScript is a UTF-16 language and web platform interfaces need to be defined in a JS-compatible way.
It’s useful to know the number of codeunits, bytes or (rarely) codepoints if you’re writing an algorithm that deals in those, but I can’t think of a time when it would have been specifically helpful for me to know the “length of a string”.
I also dispute that “the number of graphemes” is the best option for the length of a string. It’s just one of many lengths that will sometimes be relevant.
Measuring the length of a string, and indexing into arbitrary locations in it, are extremely common operations in validating user input, to such a degree that I am consistently astonished by how many people insist they are never used and never needed.
For example, text input widgets which can be configured with a minimum and maximum input length are quite common, and without some agreement on how to measure and calculate length, they simply would not function.
For indexing strings in JS, this is straightforwardly the codeunit index into the underlying array. This is obviously an occasionally useful thing to be able to do, though I don’t do it often and usually I would rather index by byte or “character” anyway.
Min and maxlen on text inputs are a bit weird. I think they’re mostly used as kinda bad proxies for number of bytes (database limits, etc), or text presentation length, or text complexity. Again, I would prefer to be able to easily specify what I mean.
At a previous job I had to write code which would validate and process government-issued license identifiers.
One of the validation rules was a rule about length.
Telling the government that the length of a string is a meaningless concept and that they shouldn’t be asking for it was not one of the available options for compliance.
Perhaps it is the case that you have never in your entire career encountered a data type which involved length validation, but that does not mean they do not exist, or that they can be hand-waved away.
The government would ask for the licenses to match a certain pattern and I would validate that, including checking that the identifiers contained a permittable number of characters.
The government would be unlikely to ask me about string lengths or UTF16, they would talk about numbers of characters or identifier lengths or sizes or something.
Obviously lengths of things are important. I have given a bunch of example lengths and sizes that I would care about.
I’m saying that “length of a string” is an implementation detail and that the real requirement will have context and specificity (e.g. “registration plates must contain exactly have 4-7 characters” or “titles must not wrap across lines” or “intro paragraph must be short” or “names must fit in LDAP”).
Yes, you can of course define “length of a string” to have a specific meaning in your system, but there is no universal definition because text is too complicated and there are too many reasonable ways of measuring a string.
JS picked length to mean “number UTF16 codeunits”. Julia (and Go?) define it as “number of codepoints”. Other languages use “number of graphemes” or “number of bytes”.
We can avoid confusion and bugs by being specific about what quantity we want.
I wish people would stop crapping on Han unification if they can’t read Hanzi/Kanji. It is totally appropriate that those characters in the article are the same. If they were if different, it would be like 4 joined up and 4 in two strokes being separated, or 7 with a slash in the middle being separated. They’re different ways to write the same thing, and have all been used in Japan within the last 100 years.
There were serious issues. Unicode has eventually added dedicated code points undoing worst cases of the unification.
It’s not just about readability, but a cultural issue. People can read a foreign way of writing a character, but it’s not how they write it.
China and Japan culturally care a lot about calligraphy. To them it’s not just a font. Number of strokes does matter, even the direction of strokes is significant. Japan even has a set of traditional variants of characters that are used only in people’s names.
As a Chinese person, my cultural feeling is that I would identify Kanji characters with Hanzi characters. Many characters do look the same and are written the same. Differences in stroke order or the direction of certain strokes feel more like “inessential” font differences. Visibly different forms between Kanji and Hanzi are similar to simplified vs traditional: more elaborate font differences, but still essentially the same characters.
One interesting angle is how Kanji characters are read in Chinese: they’re just read like native Chinese characters, completely ignoring the Japanese pronunciation and any shape differences. For example, the protagonist of Slam Dunk, 桜木花道 is always read as yīng mù huā dào in Mandarin Chinese, despite that (1) 桜 is written 樱 in Chinese (2) 花 is written slightly differently and (3) the Japanese pronunciation, Sakuragi Hanamichi, being kunyomi, bears no resemblance to yīng mù huā dào.
On a meta level, there’s no distinction between Hanzi and Kanji in Chinese: they’re both 汉字, pronounced hàn zì. I don’t know for sure whether Japanese people have this distinction, but it’s probably illuminating to see that the Japanese wiki page 漢字 encompasses the Chinese character system in all its regional variants.
Thanks for your input. There are 新字体 and 国字 (和製漢字), but as far as I know this distinction is only academic. Kunyomi/onyomi/etc. distinction is impossible to ignore, but that’s more related to words’ etymology than writing.
Visibly different forms between Kanji and Hanzi are similar to simplified vs traditional: more elaborate font differences, but still essentially the same characters.
Normally, I’m the first to say that the differences between simplified and traditional are overblown; however, I think it’s also eliding a bit to claim they’re essentially the same.
My mental model is that simplified originally was a surjective function. (That’s not true anymore.) But, while characters like 電/电 are onto and 只/隻 are grammatically awkward, characters like 复 can be downright misleading.
n.b. these differences matter less to Mandarin speakers, since simplified was made for it. (e.g. characters homophonous in Mandarin were merged) But the Japanese (and Korean, but that’s a different story) simplification projects came to different conclusions because they’re for different cultures and languages.
There were a few bugs and ghost characters in the process, which is to be expected when you’re digitizing tens of thousands of characters, but the basic idea of unification is sound. I had a friend who wrote her family name 櫻木 instead of the usual 桜木. Well sure, enough, that comes through on Lobsters because both variants are encoded. So too is the common 高 vs 髙 variation. The point is to be able to encode both variants where you would need them both in a single text without having to duplicate everything or worse (do we need variants each of Japanese, Korean, and Vietnamese? for pre-War and post-War Japanese? for various levels of cursive? etc.). It was a success.
Calligraphy can’t be represented in text at all. For that you need a vector image format, not a text encoding.
You’re saying unification is sound, but you can spell your friend’s name correctly only because these characters weren’t unified.
do we need variants each of Japanese, Korean, and Vietnamese? for pre-War and post-War Japanese
Yes! Unicode has Middle English, Old Church Slavonic with its one-off ꙮ, even Phoenician and Hieroglyphs. There’s old Kana in there already. East Asia should be able to encode their historical texts too.
UCS-2 was meant to be limited to contemporary characters due to 16-bit limit, but Unicode changed course to include everything.
CJK having a font dependency is reminiscent of legacy code pages.
I’ve worked on a website developed in Hong Kong, and it did track locale and set lang and language-specific font stacks to distinguish between the zh-xx and jp variants. They do care.
I’ve mentioned calligraphy not in technical sense, but cultural. The characters and their strokes are a valued tradition.
People may disagree about strokes of some complex characters, or there may be older and newer ways to draw a character, but that doesn’t mean the differences don’t matter.
I think the technical solution of mapping “characters” to code points to glyphs, applied to both alphabets and logograms suggests a perspective/commonality that isn’t the best way to view the issue.
You could also think of the character variants as differences in spelling. In English you have US and GB spellings, as well as historical spellings, and also some words with multiple spellings co-existing. These are the same mutually intelligible words, but if you did “English word unification”, it’d annoy some people.
I’m not familiar with this issue, but why not just add markers for similar characters to distinguish the cultural variant to be used where it is relevant?
Re Han unification - what do native speakers think? I assume there’s a diversity of opinion.
I’ve also thought for a while that “Greco” unification would be good - we would lose the attacks where words usually written in one script are written in the identical looking letter from another script.
Last I looked into the discussion about Han unification, I got the feeling that people in China (and maybe Japan) were annoyed that their input was not specifically requested before and during the discussion to proceed with Han unification. But I really don’t know enough about these scripts to have an opinion.
Regarding Greek letters, is this a common attack vector? What characters are most often used? From the Cyrillic set?
Painting “East Asian engineers” as a unitary body here is doing a lot of lifting.
The vast majority of pre-Unicode encoding schemes were both under unique resource constraints (ASCII compat? Fixed / variable length encoding? National language policy?) and designed for specific domains.
But to wit: Big5 unified some characters, HKSCS then deunified them because they weren’t the same in Cantonese.
Painting “East Asian engineers” as a unitary body here is doing a lot of lifting.
Apologies, that was not my intent. I meant that a lot of unification decisions have been made by engineers who are familiar with the issues, rather than, say, American engineers with little knowledge of Han characters.
I think Han unification was basically a good idea, but the problem is that it’s unclear where to draw the line. In fact, many Chinese characters that seem like they should be unified are separate in Unicode just because they are separate in JIS X 0208/0212/0213. Hooray for round-trip convertibility (sarcasm intended)!
From what I understand, Asian people get it much worse: many Chinese, Japanese, and Korean logograms that are written very differently get assigned the same code point
The logograms are not “very different”. Any educated person would see them as the same.
Here I thought I was some enlightened, internationalist programmer for iterating through codepoints instead of bytes in rust, and now I learn I’m a decade behind yet again. Good post!
In 2023, it’s no longer a question: with a 98% probability, it’s UTF-8
mildly interesting: source code tends to be around 99.99% ASCII. That can come in handy when writing efficient parsers, because you get a cheap predictable branch.
For programming languages, the beautiful thing about UTF-8 is that it doesn’t matter if it’s UTF-8 or ASCII.
I’d go as far as to say that if you’re implementing an interpreter or compiler, UTF-8 is ASCII.
You handle bytes, and your lexer is still going to lex. {}[]()"" and all other punctuation that languages use, are always 1 byte in UTF-8.
This is the whole point of UTF-8 – the ASCII chars are never reused as part of larger ones – and it seems like a bunch of people I’ve conversed with lately don’t appreciate that.
So you just handle " and \ in your string literals, and pass whatever else is through as bytes, and now you have full unicode support.
An exception that makes life hard is when you’re implementing a standardized language like JavaScript with UTF-16 legacy. JS source code may be UTF-16 encoded, and it is sent over the network.
But an interesting exception is that JSON cannot be UTF-16 – it must be UTF-8.
BUT, you can still have UTF-8 identifiers, like Raku or Julia, without even validating UTF-8! There is no conflict between {}[]()"" and unicode chars.
This fact is explained in the article, but I find that even when I point people to articles like this, they don’t get it. They say “no we have to decode every code point to have proper unicode support !!!”
In particular this part of the article
Wouldn’t UTF-32 be easier for everything?
NO.
UTF-32 is great for operating on code points. Indeed, if every code point is always 4 bytes, then strlen(s) == sizeof(s) / 4, substring(0, 3) == bytes[0, 12], etc.
The problem is, you don’t want to operate on code points. A code point is not a unit of writing; one code point is not always a single character. What you should be iterating on is called “extended grapheme clusters”, or graphemes for short.
Really the issue is that the “edges” of a the system need to understand unicode in a different way than the interior. The font renderer in the OS is a very different kind of program than a compiler or interpreter.
len() in bytes is generally a much more useful operation in 99.9% of programs, followed by display width of a string of grapheme clusters (which requires a database).
len() in grapheme clusters doesn’t seem that useful.
But len() in code points is a distant last for applications. It’s really only for Unicode algorithms that operate on code points, which you generally don’t write yourself, because they require a database. That is, you don’t write your own case folding, etc.
Go basically got it right – len() is in bytes, whereas string iteration is in done in code points. len() is O(1) whereas iteration is O(n).
I’d be interested if any non-Raku language actually respects non-ASCII whitespace. (omitting Raku because it’s “maximalist”)
Julia makes good use of Unicode, and languages like Mathematica and APL probably do. But I don’t see why they would need to support unicode whitespace.
Not that it’s hard to do it if you want to – it’s probably another 10 lines on top of UTF-8 decoding, which itself is like 50 lines.
What I’m really arguing against is the complexity of C/JavaScript source encodings, which arose before we agreed upon a good way to encode text.
I’d be interested if any non-Raku language actually respects non-ASCII whitespace.
Haskell, Julia, Clojure, and Racket support treating EN SPACE as whitespace at least.
Anyway my point wasn’t that every language should do this. I’m just saying you shouldn’t claim to have “full” Unicode support and then be like “well, except for the parts of Unicode we decided not to do”.
I would modify it to say “full Unicode” is a questionable requirement.
I think many languages probably have it by accident – Clojure might use JVM libraries to detect space, and languages implemented in C might use isspace() in their lexer, which I think is locale dependent
I dislike those functions because they rely on mutable, global, system-wide variables, which I’ve been bitten by in the past
There COULD be real user requirements for unicode space, but I haven’t seen them yet. Perhaps in some settings, you don’t want the spaces to be part of unicode variable names
I guess I don’t like having code “just in case” – especially code which is not tested
(Reminds me of a similar discussion about hashing floats – I got at least 5 replies across lobste.rs and reddit, and most use cases boiled down to hashing ints, and the others didn’t convince me)
That probably depends a lot on the source language. In C/C++, you need to be incredibly careful embedding non-ASCII strings because GCC hates you (it will interpret them as the current locale and convert them, so you may get a corrupted binary if someone with a non-C, non-UTF-8 locale compiles your code). Unicode in comments is also not allowed. All identifiers will be ASCII, so you can represent identifiers as a compressed index + offset in the source and get their length from the displacement. Comments and strings need special handling anyway, so just storing an ‘is this ASCII’ bit lets you fall back to handling other encodings in the less common cases (or totally ignoring it for comments if you’re compiling, because the compiler doesn’t read comments).
This is an excellent article! I’ve been looking for a succinct, well written introductory reference like this. Will come in handy in future code reviews. Thanks :)
Mostly. If you have a source string S and normalize it with two different unicode versions A and B then the two normalisations will be identical if S contains only codepoints that are assigned in both A and B (with a few very rare exceptions).
I ran into an nice example of characters (graphemes) and bytes coming apart recently. For my editor (neovim), I use a great little plugin that makes it easier to write common pairs (e.g., "", (), {}, etc.). I wanted to add support for curly quotes (“”), and that was…not straightforward.
The existing plugin grabs two characters (to check if the cursor is at the end of a pair to jump over or deleting the first item in a pair that is empty) in a wonderfully simple way: line:sub(col, col + 1). This works because col is the position of the cursor, so it grabs (a) the character at the cursor and (b) the next character. The scripting language here is Lua (5.1 via LuaJIT), however, so this doesn’t work at all if characters are potentially multibyte. (The string library in Lua 5.1 only works on bytes not characters.)
In the end, it wasn’t too bad to handle since the relevant host environment for Lua (namely, neovim) has Unicode information to share. But it took me most of a weekend to find the Unicode supporting functions in neovim and figure out how to put them together with Lua’s native form of string handling.
-- Before (no multibyte support)
local function insert_get_pair()
-- add "_" to let close function work in the first col
local line = "_" .. vim.api.nvim_get_current_line()
local col = vim.api.nvim_win_get_cursor(0)[2] + 1
return line:sub(col, col + 1)
end
-- After (multibyte pair characters now supported)
-- Note: this version returns two items rather than
-- one because pairs are now stored as {"[", "]"}
-- rather than "[]".
-- Neovim can tell us the starting byte position of each new character.
-- This allows us to build a table of characters keyed to their
-- first byte position in the line.
local function chars_by_position(line, char_positions)
local chars = {}
for i, pos in ipairs(char_positions) do
if char_positions[i + 1] then
chars[i] = line:sub(pos, char_positions[i + 1] - 1)
else
chars[i] = line:sub(pos, #line)
end
end
return chars
end
local function insert_get_pair()
-- add "_" to let close function work in the first col
local line = "_" .. vim.api.nvim_get_current_line()
local col = vim.api.nvim_win_get_cursor(0)[2] + 1
local char_positions = vim.str_utf_pos(line)
-- If there are no multibyte characters, we can avoid the following work.
if #line == #char_positions then
return { line:sub(col, col), line:sub(col + 1, col + 1) }
end
chars = chars_by_position(line, char_positions)
local cursor_char_pos = vim.fn.getcursorcharpos(0)[3]
return { chars[cursor_char_pos], chars[cursor_char_pos + 1] }
end
I’m still beta-testing this now. If anyone needs autoclose for multibyte characters in neovim (a very, very small club?), check out the branch here.
the mouse thing is one of the more annoying things you can do in 2023
I dunno, a personal webpage being annoying just to have some fun, as opposed to trying to trick me into clicking in an ad or making some engagement metric go up, feels pretty refreshing to me in 2023.
well, it tanks performance too
idk I just want to read the post without downloading the mouse movements of every other bloke who’s doing the same
thank god for reader mode
Having JavaScript disabled by default (with per-site opt-in) also seems to work.
Firefox’s reader-view gets rid of it (along with the distractingly coloured background)
Any time people talk about string length being some unicode pain, I ask them: what is the length of the tab character? Or the backspace character? Or is the length of MM equal to ll? Those are present in plain ascii and the answer is basically always “it depends”.
And quite frankly, the correct answer almost never comes from counting graphemes. Again, consider those cases: the grapheme count of MM and ll are the same. But why does that ever matter? If you have some backend storage limit, it probably cares more about the byte count than the grapheme count. And if you have some frontend display limit, you probably want to know the display size in a given font. Sure, the font size is gonna depend on graphemes (among other things) but that’s more an internal detail to the majority of users of a string.
If you’re writing the guts of an editor, knowing if the current byte is part of a code point or if the code point is part of a grapheme cluster matters… but most people are probably better off treating strings as opaque blobs and asking for a different length for different tasks, like byte count or display pixel size.
The problem is not so much length, it’s things like assuming that if two strings are the same then they must be the same length, but that isn’t actually true for unicode unless you do a canonicalisation pass first.
If you’re writing a text renderer, there is no “length” measure that will give you the answer you want.
But there are many cases where asking the length of a string is both a sensible and a necessary question, and needs a definition. For much of the web platform, the answer is “the number of 16-bit code units required to encode the string in UTF-16”, for example, because JavaScript is a UTF-16 language and web platform interfaces need to be defined in a JS-compatible way.
It’s useful to know the number of codeunits, bytes or (rarely) codepoints if you’re writing an algorithm that deals in those, but I can’t think of a time when it would have been specifically helpful for me to know the “length of a string”.
I also dispute that “the number of graphemes” is the best option for the length of a string. It’s just one of many lengths that will sometimes be relevant.
Measuring the length of a string, and indexing into arbitrary locations in it, are extremely common operations in validating user input, to such a degree that I am consistently astonished by how many people insist they are never used and never needed.
For example, text input widgets which can be configured with a minimum and maximum input length are quite common, and without some agreement on how to measure and calculate length, they simply would not function.
For indexing strings in JS, this is straightforwardly the codeunit index into the underlying array. This is obviously an occasionally useful thing to be able to do, though I don’t do it often and usually I would rather index by byte or “character” anyway.
Min and maxlen on text inputs are a bit weird. I think they’re mostly used as kinda bad proxies for number of bytes (database limits, etc), or text presentation length, or text complexity. Again, I would prefer to be able to easily specify what I mean.
At a previous job I had to write code which would validate and process government-issued license identifiers.
One of the validation rules was a rule about length.
Telling the government that the length of a string is a meaningless concept and that they shouldn’t be asking for it was not one of the available options for compliance.
Perhaps it is the case that you have never in your entire career encountered a data type which involved length validation, but that does not mean they do not exist, or that they can be hand-waved away.
The government would ask for the licenses to match a certain pattern and I would validate that, including checking that the identifiers contained a permittable number of characters.
The government would be unlikely to ask me about string lengths or UTF16, they would talk about numbers of characters or identifier lengths or sizes or something.
Obviously lengths of things are important. I have given a bunch of example lengths and sizes that I would care about.
I’m saying that “length of a string” is an implementation detail and that the real requirement will have context and specificity (e.g. “registration plates must contain exactly have 4-7 characters” or “titles must not wrap across lines” or “intro paragraph must be short” or “names must fit in LDAP”).
Yes, you can of course define “length of a string” to have a specific meaning in your system, but there is no universal definition because text is too complicated and there are too many reasonable ways of measuring a string.
JS picked length to mean “number UTF16 codeunits”. Julia (and Go?) define it as “number of codepoints”. Other languages use “number of graphemes” or “number of bytes”.
We can avoid confusion and bugs by being specific about what quantity we want.
You will get the length measure after shaping when writing a text rendered. Together with most of the graphical representation.
I wish people would stop crapping on Han unification if they can’t read Hanzi/Kanji. It is totally appropriate that those characters in the article are the same. If they were if different, it would be like 4 joined up and 4 in two strokes being separated, or 7 with a slash in the middle being separated. They’re different ways to write the same thing, and have all been used in Japan within the last 100 years.
There were serious issues. Unicode has eventually added dedicated code points undoing worst cases of the unification.
It’s not just about readability, but a cultural issue. People can read a foreign way of writing a character, but it’s not how they write it.
China and Japan culturally care a lot about calligraphy. To them it’s not just a font. Number of strokes does matter, even the direction of strokes is significant. Japan even has a set of traditional variants of characters that are used only in people’s names.
(I can read Kanji)
As a Chinese person, my cultural feeling is that I would identify Kanji characters with Hanzi characters. Many characters do look the same and are written the same. Differences in stroke order or the direction of certain strokes feel more like “inessential” font differences. Visibly different forms between Kanji and Hanzi are similar to simplified vs traditional: more elaborate font differences, but still essentially the same characters.
One interesting angle is how Kanji characters are read in Chinese: they’re just read like native Chinese characters, completely ignoring the Japanese pronunciation and any shape differences. For example, the protagonist of Slam Dunk, 桜木花道 is always read as yīng mù huā dào in Mandarin Chinese, despite that (1) 桜 is written 樱 in Chinese (2) 花 is written slightly differently and (3) the Japanese pronunciation, Sakuragi Hanamichi, being kunyomi, bears no resemblance to yīng mù huā dào.
On a meta level, there’s no distinction between Hanzi and Kanji in Chinese: they’re both 汉字, pronounced hàn zì. I don’t know for sure whether Japanese people have this distinction, but it’s probably illuminating to see that the Japanese wiki page 漢字 encompasses the Chinese character system in all its regional variants.
Thanks for your input. There are 新字体 and 国字 (和製漢字), but as far as I know this distinction is only academic. Kunyomi/onyomi/etc. distinction is impossible to ignore, but that’s more related to words’ etymology than writing.
Normally, I’m the first to say that the differences between simplified and traditional are overblown; however, I think it’s also eliding a bit to claim they’re essentially the same.
My mental model is that simplified originally was a surjective function. (That’s not true anymore.) But, while characters like 電/电 are onto and 只/隻 are grammatically awkward, characters like 复 can be downright misleading.
n.b. these differences matter less to Mandarin speakers, since simplified was made for it. (e.g. characters homophonous in Mandarin were merged) But the Japanese (and Korean, but that’s a different story) simplification projects came to different conclusions because they’re for different cultures and languages.
There were a few bugs and ghost characters in the process, which is to be expected when you’re digitizing tens of thousands of characters, but the basic idea of unification is sound. I had a friend who wrote her family name 櫻木 instead of the usual 桜木. Well sure, enough, that comes through on Lobsters because both variants are encoded. So too is the common 高 vs 髙 variation. The point is to be able to encode both variants where you would need them both in a single text without having to duplicate everything or worse (do we need variants each of Japanese, Korean, and Vietnamese? for pre-War and post-War Japanese? for various levels of cursive? etc.). It was a success.
Calligraphy can’t be represented in text at all. For that you need a vector image format, not a text encoding.
As for numbers of strokes, read https://languagelog.ldc.upenn.edu/nll/?p=40492 People don’t always agree how many strokes a character has.
You’re saying unification is sound, but you can spell your friend’s name correctly only because these characters weren’t unified.
Yes! Unicode has Middle English, Old Church Slavonic with its one-off ꙮ, even Phoenician and Hieroglyphs. There’s old Kana in there already. East Asia should be able to encode their historical texts too.
UCS-2 was meant to be limited to contemporary characters due to 16-bit limit, but Unicode changed course to include everything.
CJK having a font dependency is reminiscent of legacy code pages.
I’ve worked on a website developed in Hong Kong, and it did track locale and set
lang
and language-specific font stacks to distinguish between the zh-xx and jp variants. They do care.I’ve mentioned calligraphy not in technical sense, but cultural. The characters and their strokes are a valued tradition.
People may disagree about strokes of some complex characters, or there may be older and newer ways to draw a character, but that doesn’t mean the differences don’t matter.
I think the technical solution of mapping “characters” to code points to glyphs, applied to both alphabets and logograms suggests a perspective/commonality that isn’t the best way to view the issue.
You could also think of the character variants as differences in spelling. In English you have US and GB spellings, as well as historical spellings, and also some words with multiple spellings co-existing. These are the same mutually intelligible words, but if you did “English word unification”, it’d annoy some people.
Huh, my iPad does not have the fixed glyph for the multiocular O: it still has 7 eyes instead of 10 https://en.m.wikipedia.org/wiki/Multiocular_O
I’m not familiar with this issue, but why not just add markers for similar characters to distinguish the cultural variant to be used where it is relevant?
That is, in fact, what Unicode does https://en.m.wikipedia.org/wiki/Variation_Selectors_(Unicode_block)
Thanks
Re Han unification - what do native speakers think? I assume there’s a diversity of opinion.
I’ve also thought for a while that “Greco” unification would be good - we would lose the attacks where words usually written in one script are written in the identical looking letter from another script.
Last I looked into the discussion about Han unification, I got the feeling that people in China (and maybe Japan) were annoyed that their input was not specifically requested before and during the discussion to proceed with Han unification. But I really don’t know enough about these scripts to have an opinion.
Regarding Greek letters, is this a common attack vector? What characters are most often used? From the Cyrillic set?
Every east Asian text encoding scheme does unification. The decision to unify was made by east Asian engineers.
Painting “East Asian engineers” as a unitary body here is doing a lot of lifting.
The vast majority of pre-Unicode encoding schemes were both under unique resource constraints (ASCII compat? Fixed / variable length encoding? National language policy?) and designed for specific domains.
But to wit: Big5 unified some characters, HKSCS then deunified them because they weren’t the same in Cantonese.
Apologies, that was not my intent. I meant that a lot of unification decisions have been made by engineers who are familiar with the issues, rather than, say, American engineers with little knowledge of Han characters.
Yep usually Cyrillic, but Greek’s two Os can work nicely.
I think Han unification was basically a good idea, but the problem is that it’s unclear where to draw the line. In fact, many Chinese characters that seem like they should be unified are separate in Unicode just because they are separate in JIS X 0208/0212/0213. Hooray for round-trip convertibility (sarcasm intended)!
Is this mentioned in the linked article? Or did you just get reminded of it because Unicode?
The logograms are not “very different”. Any educated person would see them as the same.
Thanks. I searched the page for “Han” and “unification” and got no hits.
[Comment removed by author]
Here I thought I was some enlightened, internationalist programmer for iterating through codepoints instead of bytes in rust, and now I learn I’m a decade behind yet again. Good post!
I repeat my claim: emoji are a globalist conspiracy to make English speaking programmers care about Unicode.
Big Unicode is coming for your compilers!
mildly interesting: source code tends to be around 99.99% ASCII. That can come in handy when writing efficient parsers, because you get a cheap predictable branch.
For programming languages, the beautiful thing about UTF-8 is that it doesn’t matter if it’s UTF-8 or ASCII.
I’d go as far as to say that if you’re implementing an interpreter or compiler, UTF-8 is ASCII.
You handle bytes, and your lexer is still going to lex.
{}[]()""
and all other punctuation that languages use, are always 1 byte in UTF-8.This is the whole point of UTF-8 – the ASCII chars are never reused as part of larger ones – and it seems like a bunch of people I’ve conversed with lately don’t appreciate that.
So you just handle
"
and\
in your string literals, and pass whatever else is through as bytes, and now you have full unicode support.An exception that makes life hard is when you’re implementing a standardized language like JavaScript with UTF-16 legacy. JS source code may be UTF-16 encoded, and it is sent over the network.
But an interesting exception is that JSON cannot be UTF-16 – it must be UTF-8.
Also I believe C and C++ have a whole bunch of Unicode legacy, because it was invented long before people knew how to encode it properly :-P (c.f. https://unascribed.com/b/2019-08-02-the-tragedy-of-ucs2.html )
Also, if you want to validate UTF-8, it’s <50 lines of code, and there’s also a nice utility that will do it for you:
https://manpages.debian.org/unstable/moreutils/isutf8.1.en.html
BUT, you can still have UTF-8 identifiers, like Raku or Julia, without even validating UTF-8! There is no conflict between
{}[]()""
and unicode chars.This fact is explained in the article, but I find that even when I point people to articles like this, they don’t get it. They say “no we have to decode every code point to have proper unicode support !!!”
In particular this part of the article
Yes, too many people don’t understand this
Also see this discussion from May: Why does “👩🏾🌾” have a length of 7 in JavaScript?
https://lobste.rs/s/gqh9tt/why_does_farmer_emoji_have_length_7
Really the issue is that the “edges” of a the system need to understand unicode in a different way than the interior. The font renderer in the OS is a very different kind of program than a compiler or interpreter.
len()
in bytes is generally a much more useful operation in 99.9% of programs, followed by display width of a string of grapheme clusters (which requires a database).len() in grapheme clusters doesn’t seem that useful.
But
len()
in code points is a distant last for applications. It’s really only for Unicode algorithms that operate on code points, which you generally don’t write yourself, because they require a database. That is, you don’t write your own case folding, etc.Go basically got it right – len() is in bytes, whereas string iteration is in done in code points. len() is O(1) whereas iteration is O(n).
This is… almost true. But if you claim to support UTF-8 identifiers yet don’t treat non-ASCII whitespace as whitespace, then your job isn’t done!
Most of the spaces look like typographic things that don’t apply to code:
https://jkorpela.fi/chars/spaces.html
I’d be interested if any non-Raku language actually respects non-ASCII whitespace. (omitting Raku because it’s “maximalist”)
Julia makes good use of Unicode, and languages like Mathematica and APL probably do. But I don’t see why they would need to support unicode whitespace.
Not that it’s hard to do it if you want to – it’s probably another 10 lines on top of UTF-8 decoding, which itself is like 50 lines.
What I’m really arguing against is the complexity of C/JavaScript source encodings, which arose before we agreed upon a good way to encode text.
JavaScript mostly uses Unicode character properties to define its lexical syntax, so whitespace is Zs plus a few explicitly enumerated characters. https://262.ecma-international.org/#sec-white-space
Haskell, Julia, Clojure, and Racket support treating EN SPACE as whitespace at least.
Anyway my point wasn’t that every language should do this. I’m just saying you shouldn’t claim to have “full” Unicode support and then be like “well, except for the parts of Unicode we decided not to do”.
Yeah that’s a fair point
I would modify it to say “full Unicode” is a questionable requirement.
I think many languages probably have it by accident – Clojure might use JVM libraries to detect space, and languages implemented in C might use isspace() in their lexer, which I think is locale dependent
I dislike those functions because they rely on mutable, global, system-wide variables, which I’ve been bitten by in the past
There COULD be real user requirements for unicode space, but I haven’t seen them yet. Perhaps in some settings, you don’t want the spaces to be part of unicode variable names
I guess I don’t like having code “just in case” – especially code which is not tested
(Reminds me of a similar discussion about hashing floats – I got at least 5 replies across lobste.rs and reddit, and most use cases boiled down to hashing ints, and the others didn’t convince me)
–
Notably JSON supports only ASCII spaces - https://www.json.org/json-en.html
Never heard anyone complain about that!
Not in my experience. Source code often includes human readable strings which often includes non-ascii characters.
That probably depends a lot on the source language. In C/C++, you need to be incredibly careful embedding non-ASCII strings because GCC hates you (it will interpret them as the current locale and convert them, so you may get a corrupted binary if someone with a non-C, non-UTF-8 locale compiles your code). Unicode in comments is also not allowed. All identifiers will be ASCII, so you can represent identifiers as a compressed index + offset in the source and get their length from the displacement. Comments and strings need special handling anyway, so just storing an ‘is this ASCII’ bit lets you fall back to handling other encodings in the less common cases (or totally ignoring it for comments if you’re compiling, because the compiler doesn’t read comments).
C23 fortunately fixes the string literal issue.
More than 0.01% characters?
“what encoding is this file” is the generally much more useful question than “what encoding is this character”
This is an excellent article! I’ve been looking for a succinct, well written introductory reference like this. Will come in handy in future code reviews. Thanks :)
But:
So how stable is this normalization? Can I persist a normalized string, then upgrade ICU and trust that it will normalize the same way?
Mostly. If you have a source string S and normalize it with two different unicode versions A and B then the two normalisations will be identical if S contains only codepoints that are assigned in both A and B (with a few very rare exceptions).
See https://lobste.rs/s/bkavdb/unicode_overview#c_z7adlg
No, you should persist the string as it was entered, and normalize at the point of use as required.
Are you sure? I thought the normalization forms to be well-defined and standardized.
With this in mind, I would normalize at ingress because that makes all following code much simpler and removes a huge surface for bugs.
But now I’m curious as to how the normalization forms are inconsistent
It’s safe to store normalized text provided it contains no unassigned codepoints. https://www.unicode.org/faq/normalization.html#15
I don’t know if the usual NF* implementations ensure this: for instance, I can’t see anything in the ICU documentation that explains how it deals with unassigned codepoints. https://unicode-org.github.io/icu-docs/apidoc/released/icu4c/unorm2_8h.html
I ran into an nice example of characters (graphemes) and bytes coming apart recently. For my editor (neovim), I use a great little plugin that makes it easier to write common pairs (e.g.,
""
,()
,{}
, etc.). I wanted to add support for curly quotes (“”), and that was…not straightforward.The existing plugin grabs two characters (to check if the cursor is at the end of a pair to jump over or deleting the first item in a pair that is empty) in a wonderfully simple way:
line:sub(col, col + 1)
. This works becausecol
is the position of the cursor, so it grabs (a) the character at the cursor and (b) the next character. The scripting language here is Lua (5.1 via LuaJIT), however, so this doesn’t work at all if characters are potentially multibyte. (The string library in Lua 5.1 only works on bytes not characters.)In the end, it wasn’t too bad to handle since the relevant host environment for Lua (namely, neovim) has Unicode information to share. But it took me most of a weekend to find the Unicode supporting functions in neovim and figure out how to put them together with Lua’s native form of string handling.
I’m still beta-testing this now. If anyone needs autoclose for multibyte characters in neovim (a very, very small club?), check out the branch here.