Unicode and LiveCode

Unicode in LiveCode

Using Unicode in LiveCode: A Primer

by Devin Asay
Brigham Young University

If you have ever tried to create stacks in a language other than English and the more common West European languages you have run into the problem of how to produce all the character glyphs that the language requires. Of course, this is a problem in any programming environment, not just in LiveCode. Fortunately, Unicode is there to help us out. The bad news is that using Unicode is not always as straightforward as we would like. The good news is that LiveCode handles Unicode just fine in almost all circumstances. Just by learning a few tips and tricks you can be on your way to multilingual bliss in no time!

A Few Basic Character Encoding Concepts

Before we get into Unicode itself, we need to understand the basics of how characters are represented in computer memory and on the screen. As you know, at the most basic level computers are nothing more than very powerful number-crunching machines. The computer's Central Processing Unit, or CPU, is composed of millions of tiny transistors, each of which can be set to one of two states--off or on. Because of this binary nature, CPU's manipulate numbers using base 2, or binary, arithmetic; in other words, all numbers are represented by 0's and 1's. That means that any time you press a key on your keyboard you are ultimately sending a binary code to the computer's CPU.

Each 0 or 1 stored in a computer's memory is known as a bit. Computers handle and store data in chunks of 8 bits called bytes. The earliest micro-computers used 8-bit processors--CPU's that could process one 8-bit byte of data at a time. In the early days of computing, one bit in every byte was used for internal "housekeeping" purposes, leaving only 7 bits available for storing data. For these and other historical reasons, characters of text could only be represented by a maximum of 7 bits.

Because computers needed to be able to pass text data to one another, in the early 1960s the American Standards Association developed a standard character encoding system known as ASCII--the American Standard Code for Information Interchange. Due to the 7-bit limitation, this standard only specified 128 (2⁷) code points, numbered from 0 to 127. Upper and lower case Latin letters, numbers and common punctuation were assigned unique codes in this table. For instance, upper-case A is ASCII 64, the numeral 3 is ASCII 51, a space character is ASCII 32, and so forth. This standard is still widely used today and is 100% reliably consistent across all operating systems.

In the 1970s and 1980s several things happened that brought increasing complexity and inconsistency to the original ASCII standard. More efficient microprocessor designs freed up the 8th bit in the byte so that the size of the character encoding table could be doubled to 256 (2⁸) code points. However, at the same time computer use outside of the U.S. and Europe skyrocketed, while intense competition emerged between various operating systems. This meant that scores of different uses for the upper 128 code points emerged. IBM, Apple Computer, and Microsoft each developed a standard character mapping for what is sometimes called "extended ASCII". The Microsoft version became the most commonly used and became the basis for what is known as the ISO-8859-1 (Latin 1) character set. Apple's extended ASCII is mapped to completely different characters and is known as Mac Roman.

At the same time, people who needed to express non-Latin character sets created scores of fonts that mapped the upper 128 code points to other alphabets, such as Cyrillic, Hebrew, Greek, or East European Latin alphabets. To further confuse matters, character based writing systems such as Chinese and Japanese could not be expressed in only 256 code points, so systems were devised under which pairs of two ASCII characters were combined, which could be mapped to large character lookup tables that could handle the thousands of characters needed for these writing systems.

During the 1980s and 1990s it was a common occurrence for people exchanging documents electronically internationally to find that what started as, say, Cyrillic text on one end came out as hopeless gobbledygook on the other end, because the recipient didn't have a compatible character font. Even relatively common punctuation and typographic characters, such as European currency symbols, "curled" quotes, and dashes could be rendered incorrectly on the recipient's system, due to incompatible encoding schemes between the sender and recipient systems.

Unicode to the Rescue

In the 1980's the Unicode consortium emerged to try to address this confusing situation. The character encoding standard that emerged is designed to provide a way to display all of the world's languages by using a larger, 16-bit character table. The goal in Unicode was to assign each character in all the world's languages a unique code number. The Unicode consortium's credo is:

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

The road to the Unicode utopia, however, has been long and arduous, and even today many of the most common computing applications rely on ASCII text for storage and transmission of text data. LiveCode is one of these, but fortunately for us, it also supports Unicode and provides us reasonably robust, if not 100% complete, support for Unicode.

In order to understand how to use Unicode text in LiveCode, it is important to be familiar with a few terms and concepts. Unlike ASCII, Unicode is much more than simply a collection of character codes. It also defines things like sort order, writing direction, when a character is represented by a specific glyph, and much more. Therefore implementing Unicode is much more complex than implementing ASCII. You can read more about how LiveCode implements Unicode in section 6.4 of the LiveCode User Guide.

Characters vs. Glyphs

The old ASCII standard is simple and reliable. Because of the restricted technical and language environment in which it operates, there is a simple, one-to-one relationship between the character codes stored in your computer's memory and the actual letter shapes drawn on your screen. However, Unicode requires much greater flexibility, because in many languages there is no simple correspondence between a single unit of text with a specific semantic identity and the way that unit is visually represented. In Unicode this difference is represented by the terms 'character' and 'glyph.' A character is an abstract concept, referring to the single semantic unit, while a glyph is the visual representation of a character in a given environment.

Unicode and LiveCode

With this background we can finally talk about Unicode as it is implemented in LiveCode. Under the Unicode standard, there are several encoding systems, most notably UTF-8, UTF-16, and UTF-32. As the names imply, they are suited for for 8-bit, 16-bit, and 32-bit computing systems respectively. LiveCode uses the UTF-16 encoding. However, LiveCode has the ability to transcode between UTF-16 and several other common encodings. Thus, the first Important Thing to remember when using Unicode in LiveCode is:

1. Unicode in LiveCode is always UTF-16.

Another complication comes from the fact that different CPU's store sequences of bytes in different orders. Since Unicode characters typically are made up of two bytes each (often referred to as double-byte characters,) that means that the order in which Unicode characters are stored can be different when comparing, say, Power PC (PPC) processors with Intel processors. Motorola and PPC processors store the most significant byte first in the sequence and so are called "big endian"; Intel processors store the most significant byte last in the sequence and are called "little endian". The details of why and how this works aren't important here, but what you need to remember is Important Thing number 2:

2. LiveCode Uses the unicode byte order determined by the host processor.

What does this mean for you as a developer? Let's consider the case of two users. User 1 works on an older Mac system that runs on a PPC processor. User 2 works in Windows on an Intel processor. If User 1 creates unicode text and saves it to a file and sends it to User 2, when user two tries to read the file it will come out scrambled because it came from a big-endian system. Even though it is possible to convert big-endian unicode to little-endian, it adds pain, complexity and uncertainty. That's why I recommend Corollary 1 to Important Thing number 2:

2.1. Use UTF 8 to store and transfer Unicode text in LiveCode.

It's not hard, and I explain how to do this later on.

Tips for Using Unicode in LiveCode

Typing Unicode text in fields.

This is a good place to start because it's the easiest. LiveCode fields can handle Unicode text input without any intervention by the developer. That is because LiveCode simply uses the text input methods supplied by the host operating system. So if you want to type Japanese characters into a field, you simply select the Japanese text input system you want to use and start typing. LiveCode knows how to render it properly in the field, and it is then ready for use. Bottom line: if you want to learn how to select the text input method on your OS, see the help documentation for that OS.

However, Unicode text input in LiveCode is not perfect. LiveCode still has trouble rendering right-to-left languages like Hebrew and Arabic while you are typing them. Specifically, it will properly render characters in a word from right to left, but when you type a space to begin a new word, the new word is inserted to the right of the previous word, not to the left as it should be. For this reason I recommend creating Hebrew and Arabic texts outside of LiveCode and importing them, rather than trying to type them within LiveCode.

Examining numeric codes for characters.

By default text in LiveCode is ASCII text. So let's first look at some of the ways LiveCode provides for working with ASCII text encoding. We're all familiar with the rich collection of tools that LiveCode provides for working with text. Among them are two functions, charToNum() and numToChar(), that allow us to work with the ASCII value for any character. They work like this (try it in LiveCode's message box):

put charToNum("a") -- returns 97
put numToChar(97) -- returns the letter 'a'

In fact, you can use the numToChar() function to create a rudimentary ASCII table. Just create a new field, name it "ascii" and run this routine:

  put empty into fld "ascii"
  repeat with i = 0 to 255
    put i & tab & numToChar(i) & cr after fld "ascii"
  end repeat

That's how these two functions work by default. But you can tell LiveCode to expect Unicode values for these two functions by first setting the useUnicode property to true.

This brings up Important Thing number 3:

3. The useUnicode property only affects the charToNum() and numToChar() functions.

There is a common misconception among LiveCode developers who are new to Unicode that the useUnicode property is a kind of magic switch that will automatically change all of your text operations into Unicode. It's not. In fact, the useUnicode property might be more accurately named the useTwoByteCharsWithCharToNumAndNumToChar property. You can see why they went with useUnicode.

Let's look at how this works in practice. Let's say you have a field "russText" containing the sentence Я люблю тебя. The sentence begins with the upper case Russian letter 'Я'. If you wanted to find out which Unicode code point corresponds to that letter you would do this:

set the useUnicode to true
put charToNum(char 1 to 2 of fld "russText") -- returns 1071

Conversely, to render a Unicode character using its code point do this:

set the useUnicode to true
set the unicodeText of fld "russLetter" to numToChar(1071) -- the letter 'Я' should appear in the field

The `unicodeText` property.

The previous example is a good way to introduce another important tool for using Unicode in LiveCode: the unicodeText property. If you want to move unicode text from field to field, you have to use this property. In the normal ASCII world you can just do this:

put field 1 into field 2

But in the brave new Unicode world if you want to put Unicode text into a field you have to set its unicodeText property:

set the unicodeText of fld "newPlace" to the unicodeText of fld "oldPlace"

This all leads to Important Thing number 4:

4. The secret to manipulating Unicode text in fields lies in the unicodeText of the field.

So if you want to move chunks of text, you have to refer to chunks of the unicodeText:

-- Copying a Unicode character to another field
set the unicodeText of fld "letter" to char 1 to 2 of fld "sentence"

-- Moving words
set the unicodeText of fld "other" to \
  word 1 to 2 of the unicodeText of fld "this"

-- Inserting Unicode text from one field into another
get the unicodeText of fld "info"
set the unicodeText of fld "info" to \
  it && word 2 of line 2 of the unicodeText of fld "bottom"

Converting between single and double-byte encodings

When using Unicode text, especially if you are importing or exporting text from or to other systems or environments, you may need to convert your Unicode to a single-byte encoding system, or vice-versa. The most common reason for doing this is reading and writing UTF-8 files. As I mentioned above, I recommend storing your Unicode text in UTF-8 format if you are planning to share it with others or send it over the internet. UTF-8 is part of the Unicode standard, and is a way to store Unicode (double-byte) text in an ASCII (single-byte) text file. UTF-8 is especially important for encoding Unicode text for use in web browsers and email.

The keys to using UTF-8 text in LiveCode are the uniEncode() and uniDecode() functions. Let's say you've gotten some UTF-8 text from a web site and you want to display it in your LiveCode stack. You store it in a file called myUniText.ut8. This is how you would read it in:

put url ("binfile:/path/to/file/myUniText.ut8") into tRawTxt
set the unicodetext of fld "display" to uniencode(tRawTxt,"UTF8")

Conversely, to save Unicode text from LiveCode to a UTF-8 file, use uniDecode():

get the unicodeText of fld "myUniText"
put unidecode(it,"utf8") into url "binfile:/path/to/file/myUniFile.ut8"

Here's Important Thing number 5:

5. For reliably transporting Unicode text, convert it and store it as UTF-8 text.

What about Unicode in buttons and menus?

So far, we've only been talking about Unicode text in fields. Almost none of that applies to buttons, primarily because buttons have no unicodeText property. Instead, the basic approach for displaying Unicode text in buttons and menus consists of two steps:

Set the textFont of the button to a Unicode font;
Set the label of the button to the desired Unicode text.

Unicode font names in LiveCode take the form Font Name,language, where Font Name is the name of any font installed on the system, and language is the name of the language you want, or the term "unicode". For example, for Russian Cyrillic text I might use "Arial,Russian" as the font name; for Japanese, "Osaka,Japanese"; and for Greek, "Geneva,Unicode". Not every language can be used as the second part of a Unicode font name. For a complete list of valid language names see the LiveCode Dictionary entry for uniEncode.

One way to assign a Unicode label to a button is to reference some existing Unicode text in a hidden field. Let's say, for example, that we are making a stack for Mandarin Chinese speakers and we want to give our Start button a Chinese label, 開始. We could type or import the Unicode text to a field and use that field as the source text for the button label:

set the textFont of button "start" to "BiauKai,Chinese"
set the label of button "start" to the unicodeText of fld "hiddenChinText"

One technique I often use for creating Unicode button labels is to store the Unicode label text in a custom property of the button. When I do this I store it as UTF-8 text to avoid the byte order problem when moving the stack from machine to machine. So first I would store the unicode text in a custom property:

set the chinLabel of button "start" \
  to unidecode(the unicodeText of fld "hiddenChinText","UTF8")

Once that was in place I would use the custom property as the source of the Unicode text:

set the textFont of button "start" to "BiauKai,Chinese"
set the label of button "start" \
  to uniencode(the chinLabel of btn "start","UTF8")

One more note on Unicode buttons: Because Unicode text doesn't always "travel" well from platform to platform, I usually set Unicode button labels and menu contents each time I go to the card, in a preOpenCard handler.

Unicode Ask and Answer dialogs

Ask and answer dialog prompts can have Unicode prompts, but you can't pass Unicode text in the ask and answer command arguments. Instead you use another handy technique for setting Unicode text—store the Unicode as entities in HTML text. Storing the htmlText of a field that contains Unicode text is another reliable way of keeping the Unicode text intact during transfers. It also is the only way to display Unicode text in ask and answer dialog prompts.

To see what I mean, let's look at the Chinese start button example above. In the first case I had the Unicode Chinese text 開始 in a text field "hiddenChinText". If I were to examine the htmlText of this field it would look something like this:

<font face="BiauKai" size="14" lang="zh-TW">&#38283;&#22987;</font>

Notice that the two Chinese characters are embedded in the htmlText as Unicode entities: 開 and 始. HTML Unicode entities like this will reliably render as the proper Unicode characters in LiveCode, regardless of the operating system the stack is running on. So to use Unicode characters in ask and answer prompts, do something like this:

put the htmlText of fld "hiddenChinText" into tChinPrompt
answer tChinPrompt  with "Cancel" or "OK"

There is one other advantage of saving Unicode text as HTML entities—it is the best way to save Unicode text with text styles like bold and italic and font attributes like size and color.

Unicode stack title

I'll finish up this primer by mentioning one Unicode feature that is new in LiveCode as of version 2.9—the ability to use Unicode text for title of the stack window. Just set the unicodeTitle property of the stack to a valid unicode string. Here's an example:

set the unicodeTitle of this stack to the unicodeText of fld "russTitle"

Conclusion

In describing the state of LiveCode's Unicode implementation, I would say this—Unicode in LiveCode is not perfect, but it is perfectly usable. If you master the basic concepts I've described here, and remember the Important Things I have listed you will have the tools you need to diagnose and solve just about any problem that arises, and you'll be on your way to being able to produce LiveCode applications for almost any language.

References

Gillam, Richard Unicode Demystified. Addison-Wesley, 2003.

LiveCode User Guide, section 6.4. Runtime Revolution, Ltd., 2008.

Unicode Consortium Web Site, http://www.unicode.org.

A stack with all of the examples in this article, along with many others, can be accessed at http://livecode.byu.edu/unicode/UnicodeInRev.rev.

Notes on Unicode in LiveCode, including expanded descriptions of LiveCode Unicode language elements.

Back BYU LiveCode Lessons Gateway

Using Unicode in LiveCode: A Primer

A Few Basic Character Encoding Concepts

Unicode to the Rescue

Characters vs. Glyphs

Unicode and LiveCode

Tips for Using Unicode in LiveCode

Typing Unicode text in fields.

Examining numeric codes for characters.

The unicodeText property.

Converting between single and double-byte encodings

What about Unicode in buttons and menus?

Unicode Ask and Answer dialogs

Unicode stack title

Conclusion

References

The `unicodeText` property.