It may seem that “word-wrapping” or “line-breaking” — the process of breaking a section of text into lines such that it will fit in the available width of a display area — is a simple thing to do: Find the last space character in a line of text and break after it.
If you’re dealing with simple English text, that is, all ASCII characters, this will work reasonably well. But as soon as other languages or special characters like emojis need to be supported, you are dealing with the full complexity of Unicode. And that’s a whole different ball game.
Falsehoods Go Programmers Believe About Strings
It’s easy to think that reading the Golang team’s blog entry about strings will be enough to learn everything you need to know. But I’d say it’s a good start at best. The rabbit hole starts with this sentence:
In general, a character may be represented by a number of different sequences of code points, and therefore different sequences of UTF-8 bytes.
A “code point” is called a “rune” in Go. And the important thing to know is this:
A character can be represented by more than one rune.
One of my favourite examples is the rainbow flag emoji: 🏳️🌈. Everyone would agree that this is one character. But it is made up of 14 bytes! In Go, len
gives you the number of bytes in a string, thus len("🏳️🌈") == 14
. What about runes, i.e. code points? How many runes make up this one rainbow emoji character? The answer is 4, len([]rune("🏳️🌈")) == 4
.
So now you’re in trouble. If you need to decompose a string into characters, or simply just count the number of characters in a string, you can’t just count the number of runes. The standard library doesn’t give you any functions to do this. utf8.RuneCountInString
will give you 4, not 1.
Luckily, the Unicode specification describes in detail what makes up a character (they call it “grapheme cluster”) and provides an algorithm on how to break a string into them. I’ve written about this before while announcing my Go library github.com/rivo/uniseg
which implements this algorithm.
Breaking! There’s an Algorithm for That!
In addition to determining grapheme cluster boundaries, Unicode Standard Annex #29 also comes with instructions on how to determine word boundaries as well as sentence boundaries. Very useful if you want to select words (or sentences) in a text. Again, all of this while considering all those special characters and non-English languages.
But what we wanted in the first place is break over text into the next line. You may have some luck in trying to use the word boundary algorithm, but it’s not really designed for that and can lead to unexpected results. Luckily, there is Unicode Standard Annex #14 which describes the “Unicode Line Breaking Algorithm”. Line breaking, also referred to as word wrapping, determines the positions in a string where a line break must occur (e.g. after newline characters) and where a line break may occur.
As with grapheme, word, and sentence breaking, there is a lot of code point classification and state handling going on. The Unicode consortium provides all the tables needed to classify code points, the rules are described in detail, and there is even a test suite to verify your implementation.
I spent a good amount of time in the summer of 2022 to implement the full specification (UAX #29 and UAX #14) and add it to github.com/rivo/uniseg
. Here’s a quick example of how to use it:
str := `This code wraps words and breaks lines.
Use at your own risk!`
state := -1
for len(str) > 0 {
var c string
c, str, boundaries, state = uniseg.StepString(str, state)
// c is the next character in str.
// boundaries&uniseg.MaskLine == uniseg.LineCanBreak (optional line breaks)
// boundaries&uniseg.MaskLine == uniseg.LineMustBreak (mandatory line breaks)
// boundaries&uniseg.MaskWord != 0 (word boundary)
// boundaries&uniseg.MaskSentence != 0 (sentence boundary)
}
Roll Your Own Text Editor
Or maybe don’t. There are lots of good text editors out there already. But sometimes, you cannot avoid having to program your own editor, as was the case for me when I had to implement the TextArea
widget for my terminal UI library github.com/rivo/tview
.
So this is what triggered my deep dive into the world of Unicode text segmentation. I guess I wouldn’t have done this just for fun. But now it’s out there and if you find yourself having to implement word wrapping, grapheme counting, word finding, or sentence identifying, you may find that github.com/rivo/uniseg
proves useful to you.
A Final Note
One thing that’s also needed in some cases is to determine how wide a character is on a terminal screen or when using a monospace font. That’s because not all characters have a width of 1. But this is a topic for a future blog post. So stay tuned!