You would think that maintaining a mildly popular package such as tview
would consist mostly of adding cool new features, maybe with the odd bugfixing sprinkled in. You would be wrong. If I had to guess, I would say 70% of my work on it was dedicated to making Unicode characters work. It’s all very simple when you just have to support the English language. But with the first requests for Chinese, Thai, Arabic, or emoji support, I realized I was in for a lot of trouble. (And I dread the day when users will ask for Hebrew which is written from right to left.)
One of the main issues when dealing with Unicode characters is to actually determine what is a character. In the English world, one character is one byte:
String | Bytes |
---|---|
“Hello” | 48 65 6c 6c 6f |
This dates back to the ASCII times when 7 bits was all we needed to represent the characters used on most computer systems. Then Unicode came along which introduced a lot more characters than would fit into 7 bits. Now we don’t have just bytes anymore but the so-called “code points”. And UTF-8 is a backwards-compatible way to convert bytes into code points (and vice versa). It looks like this (all hexadecimal numbers):
String | Bytes | Code points |
---|---|---|
“Hello” | 48 65 6c 6c 6f | 48 65 6c 6c 6f |
“Hello๐” | 48 65 6c 6c 6f f0 9f 98 89 | 48 65 6c 6c 6f 1f609 |
The 0x1f609
code point represents the “winking face emoji”. Luckily, Go provides easy ways to translate strings (which are byte slices) into code points called “runes”, either by using a for
loop:
for _, r := range "Hello๐" {
fmt.Printf("%x ", r)
}
Or by simply converting a string:
runes := []rune("Hello๐")
You would think that you’re all set now, i.e. one code point = one character. But no, there are characters which consist of multiple code points. There are code points which themselves don’t do much but modify the code point before them. One example is the German umlaut “รค” which has its own one-byte code point 0xe4
but can also be composed of an “a” (0x61
) and the two dots on top, or “combining diaresis” (0x308
). Two code points, one character:
String | Bytes | Code points |
---|---|---|
“รค” | e4 | e4 |
“aฬ” | 61 cc 88 | 61 308 |
Then there are zero-width joiners, regional indicators (for flags), and all kinds of special characters from languages such as Korean and Arabic:
String | Bytes | Code points |
---|---|---|
“๐ณ๏ธโ๐” | f0 9f 8f b3 ef b8 8f e2 80 8d f0 9f 8c 88 | 1f3f3 fe0f 200d 1f308 |
“๐ฉ๐ช” | f0 9f 87 a9 f0 9f 87 aa | 1f1e9 1f1ea |
Thus, simply combining a base code point with modifier code points does not work here anymore. Luckily, it turns out that Unicode defines the rules of what constitutes a “user-perceived character”. And they call these “grapheme clusters”. The rules are defined in Unicode Standard Annex #29 (section 3.1.1) and they provide all the data needed to split a string into “characters”.
I could not find a Golang library that implements these rules so I wrote a new one: github.com/rivo/uniseg
. A “character” is now a slice of runes (or, alternatively, a string or byte slice):
gr := uniseg.NewGraphemes("๐๐ผ!")
for gr.Next() {
fmt.Printf("%x ", gr.Runes())
// Alternatively, use gr.Str() or gr.Bytes().
}
A common task is to count the number of characters in a string. This can now be easily done as follows:
fmt.Println(uniseg.GraphemeClusterCount("๐ณ๏ธโ๐๐ฉ๐ช")) // Outputs "2".
As an implementation detail, in my package, I need to classify each rune before I can apply the rules. This is done with a binary search on a lookup table which has a fixed size. Iteration itself is implemented using a finite automata algorithm which is O(n). The package should be quite efficient, although it will of course never be as fast as counting the bytes or code points.
Annex #29 also describes the determination of word and sentence boundaries. I have no immediate need for this but I may still add it to the package in the future. Here is the link again:
https://github.com/rivo/uniseg
Update Sep 4, 2022: THe uniseg
package has received a major update and now includes detection of word and sentence boundaries, as well as line-breaking / word-wrapping. More on that here.