Digital Humanities & Technology 310

Notes on Unicode in LiveCode

In the 1980s work was begun to develop a single, 16-bit (= 2 byte) multilingual character encoding system that can represent nearly all characters used in the major languages of the world. The resulting standard was called Unicode.

What is Unicode?
(from the Unicode consortium web site.)

Unicode provides a unique number for every character,
no matter what the platform,
no matter what the program,
no matter what the language.

Fundamentally, computers just deal with numbers. They store letters and other characters by assigning a number for each one. Before Unicode was invented, there were hundreds of different encoding systems for assigning these numbers. No single encoding could contain enough characters: for example, the European Union alone requires several different encodings to cover all its languages. Even for a single language like English no single encoding was adequate for all the letters, punctuation, and technical symbols in common use.

These encoding systems also conflict with one another. That is, two encodings can use the same number for two different characters, or use different numbers for the same character. Any given computer (especially servers) needs to support many different encodings; yet whenever data is passed between different encodings or platforms, that data always runs the risk of corruption.

Unicode is changing all that!

Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. The Unicode Standard has been adopted by such industry leaders as Apple, HP, IBM, JustSystem, Microsoft, Oracle, SAP, Sun, Sybase, Unisys and many others. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646. It is supported in many operating systems, all modern browsers, and many other products. The emergence of the Unicode Standard, and the availability of tools supporting it, are among the most significant recent global software technology trends.

Incorporating Unicode into client-server or multi-tiered applications and websites offers significant cost savings over the use of legacy character sets. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering. It allows data to be transported through many different systems without corruption.

There is a good general introduction to Unicode that is not overly technical on the SIL International web site.

Do the Exercise "Exploring Unicode ".


Useful Information for Using Unicode in LiveCode

Definitions

ASCII
American Standard Code for Information Interchange.
character
A single text symbol: a letter, number, punctuation mark, or control character. Characters can be single-byte or double-byte (Unicode).
When using the word “character” in a chunk expression, single-byte characters are assumed.
unicode
A standard for representing all characters from all known writing systems in a single character set. There are three encoding forms that are part of the Unicode specification: UTF-8 (single-byte), UTF-16 (two-byte) and UTF-32 (four-byte). UTF-16 is the most commonly-implemented form (and the form implemented in LiveCode.)
double-byte font
A font in which each character is represented by 2 bytes. Languages such as Japanese, Chinese, and Korean, which contain more symbols than can be represented by 256 code points, require double-byte character sets.
Double-byte fonts usually also contain a full complement of alphabetic characters occupying the first 256 positions in the font, so you can display Roman-alphabet text and non-Roman text using the same font.
double-byte character
A character from a character set that supports more than 256 characters; a Unicode character.
The numeric equivalent of Unicode characters is between zero and 65,535, for 65,536 total characters. 65,536 is 216, so it takes 16 bits (two bytes) to store a Unicode character.
UTF-8
Unicode Transformation Format, 8-bit encoding form.
The UTF-8 encoding form was developed to work with existing software implementations that were designed for processing 8-bit text data. Most web pages that use unicode text use UTF-8 format. Since LiveCode fields use UTF-16 primarily, you must convert UTF-8 unicode text to UTF-16 before displaying it in a LiveCode field.
UTF-16
Unicode Transformation Format, 16-bit encoding form.
The 16-bit (2 byte) implementation of Unicode most commonly supported by applications that support Unicode. It is the format represented by the unicodeText property of LiveCode fields.
byte order
The order in which the bytes of two and four-byte characters are stored in the computer's memory. It is determined by the CPU. Intel processors store the most significant byte last ("little-endian"), whereas Motorola and IBM PowerPC processors store the most significant byte first ("big-endian"). Because this can be reflected in the way the data is stored in files, unicode (UTF-16) data strings produced on one processor may not transfer properly to a system that uses a different byte order. This issue affects UTF-16, but not UTF-8.

Unicode fonts commonly included in Mac and Windows systems

Mac
see http://www.alanwood.net/unicode/fonts_macosx.html for an exhaustive list.
Lucida Grande (Latin, Cyrillic, Greek...)
Fang Song (Chinese)
Times New Roman
(Can use Windows Unicode fonts)

Windows
see http://www.alanwood.net/unicode/fonts.html for an exhaustive list.
Arial
Lucida family (Latin, Cyrillic, Greek...)
Tahoma
MS Hei, MS Song (Chinese)
Times New Roman

How Tos and Abouts

From “About chunk expressions”:
Important!
Characters in chunk expressions are assumed to be single-byte characters. To successfully use chunk expressions with Unicode (double-byte) text, you must treat each double-byte character as a set of two single-byte characters. For example, to get the numeric value of the third Unicode character in a field, use statements like the following:

set the useUnicode to true
get charToNum(char 5 to 6 of field "Chinese Text")

How to enter or display Unicode text in a field.
You display double-byte text in its correct language by setting its textFont property to a Unicode font. You can either put the text into the field and set the textFont in a handler or the message box, or manually enter the text after using the operating system’s built-in text entry tools to choose a language.

For example, to display double-byte Japanese characters that are on line 12 of a field, use a statement like the following:

set the textFont of line 12 of field 1 to "Osaka,Japanese"

When you manually enter text in a language that does not use the Roman alphabet, using the operating system’s tools, LiveCode automatically sets the textFont of the text you enter to the appropriate font for the language you have chosen.

How to find out whether text in a field is Unicode
You find out whether text in a field is Unicode text by examining its textFont property. The textFont of Unicode text consists of the font name, a comma, and either “Unicode” or the language the text is in. The following example statement checks whether line 3 of a field is Unicode:

if the effective textFont of line 3 of field 1 contains comma then answer "It’s Unicode!"

Note: Characters in chunk expressions are assumed to be single-byte characters. To check a Unicode character’s textFont using a chunk expression, treat it as two single-byte characters. For example, to check the fifth character in a field consisting of double-byte characters, use the expression the effective textFont of char 9 to 10 of field 1.

How to convert between Unicode (UTF-16) and UTF-8 text.
LiveCode displays non-Roman-alphabet languages using Unicode (UTF-16). You use the uniDecode and uniEncode functions to convert between UTF-16 and UTF-8.

The following statement converts a variable’s contents from UTF-8 to UTF-16, and places the resulting Unicode text in a field:

put uniEncode(myVariable,"UTF8") into field "My Field"

How to convert between Unicode and ASCII text.
You use the uniEncode and uniDecode functions to convert text from double-byte (Unicode) to single-byte ASCII, or vice versa.

To convert a string of single-byte characters to Unicode text, use a statement like the following:

put uniEncode(field "Text") into myUnicodeText

To convert a string of double-byte characters to single-byte, use a statement like the following:

put uniDecode(the unicodeText of field "Japanese Text") into convertedText

How to import a Unicode text file.
You use the unicodeText property to import a file that contains Unicode text. To put the text from a Unicode file into a field, use a statement like the following in a handler or the message box:

set the unicodeText of field "Text" to URL "binfile:my.txt"

If the file contains text in multiple languages, LiveCode automatically sets the textFont of language runs to the appropriate Unicode font.

Important! This method works only if the file you are importing contains Unicode (UTF-16) data. It will not work for other encoding methods such as UTF-8 or Shift-JIS.

Transcript language elements

useUnicode property: Specifies whether the charToNum and numToChar functions assume a character is double-byte.

unicodeText property: Specifies the text in a field, represented as Unicode (double-byte characters).

uniDecode function: Converts a string from Unicode to single-byte text.

uniEncode function: Converts a string from single-byte text to Unicode.

fontLanguage function: Returns the language associated with a Unicode font.

Example Stacks

Stack "unicodeFldRoutines.rev" As the name implies, some examples of how to move unicode text between fields. (Stack obtained from the LiveCode Mail List.)

Stack "unicodeTrials.rev" Examples of how to use unicode in LiveCode, including referring to chunks, reading unicode text from files, converting between UTF-8 and UTF-16, etc.

GENERAL INFO ABOUT UNICODE

Alan Wood’s Unicode Resources: http://www.alanwood.net/unicode/index.html
Unicode consortium: http://www.unicode.org
Unicode charts: http://www.unicode.org/charts/