Character Encoding and Unicode

Character Encoding

What is ASCII?
(from Wikipedia.org)

"ASCII (American Standard Code for Information Interchange) ... is a character encoding based on the English alphabet. ASCII codes represent text in computers, communications equipment, and other devices that work with text. Most modern character encodings have a historical basis in ASCII.
ASCII was first published as a standard in 1967 and was last updated in 1986. It currently defines codes for 33 non-printing, mostly obsolete control characters that affect how text is processed, plus ... 95 printable characters (starting with the space character)."

The lower range of ASCII is based on a 7 bit byte, so can hold 128 (2⁷) characters: 0 - 127
The upper range is based on an 8-bit byte and holds 256 (2⁸) characters : 0 - 255.

Extended ASCII (upper ASCII) was still being discussed in the early 1980s when personal computers started becoming popular. As a result, while ASCII 32 – 127 are reliably standard—common punctuation and symbols, numerals 0 – 9, upper and lower case alpha—upper ASCII 128 – 255 characters vary widely between operating systems. IBM was producing PCs running MS-DOS with a command line-based interface, so they populated upper ASCII with some Western European accented, diacritical characters, but also many graphical symbols This original IBM PC encoding is now called Code Page 437. During the same time period the first GUIs (Xerox, Lisa, Mac, Windows) were being created. As a GUI, the Macintosh didn't need graphic symbols, so Apple opted to use upper ASCII for Western European characters and other common symbols; this character set is called MacRoman. Meanwhile, Microsoft developed yet another variant of extended ASCII, called Windows-1252, for its emerging Windows OS.

In later years, with the emerging influence of the internet and the increased need for exchanging data, the International Standards Organization (ISO) developed a standard called ISO 8859-1, commonly called Latin-1, as a standard encoding of the Latin alphabet. It closely resembles Windows-1252, and early on became the default character set for World Wide Web pages.

LiveCode Character Encoding Tools

NOTE: As of version 7, all text in the LiveCode environment is based on Unicode (UTF-16) encoding, rather than on ASCII as in earlier versions. Therefore, much of the following information is outdated, and many of the referenced functions and commands are deprecated.

LiveCode gives you two functions that allow you to examine the ASCII values of characters, charToNum() and numToChar(). The charToNum() function takes a single character as an argument and returns the ASCII value of that character:

put charToNum("a") into myVar --puts 97, the ASCII equivalent of lower case 'a', into the variable

The charToNum() function takes a single character as an argument and returns the ASCII value of that character:

put numToChar(10) after char 25 of fld "myFld" --inserts a linefeed after character 25 of the field *

*note that the constants linefeed, LF, return, and CR are all shortcut ways of writing of numToChar(10)

Do the Exercise "Exploring Character Encoding".

Dealing with Cross-platform Character Problems

Since it is a cross-platform development environment, LiveCode has some built-in capabiltiies to help developers deal with the legacy of conflicting ASCII standards and conventions. These conflicts can be classified into two broad categories: Line breaks and upper ASCII.

Line breaks. Historically, the three major operating system families in use today took different approaches to the problem of defining the end of a logical line of text. Mac OS and OS X use a carriage return character (ASCII 13), Unix systems use a linefeed character (ASCII 10), and Windows systems use a carriage return followed by a line feed. (ASCII 13 + ASCII 10). LiveCode, true to its Unix roots, uses ASCII 10 as the end-of-line marker.* When opening files in text mode (the "file:" protocol), LiveCode recognizes your system's end-of-line marker and translates as needed for its own internal use. Thus, when any file is read in to LiveCode via

open file <filename> read from file <filename> close file <filename>
OR
get url "file:<filepath>"

the line break characters will be translated into linefeed (ASCII 10). Conversely, when they are written back out by

open file <filename> write data to file <filename> close file <filename>
OR
put mydata into url "file:<filepath>"

the line break characters are converted to whatever is required by the operating system.

In a few situations, you must pay attention to the native end-of-line character. The most likely case you will encounter is when you use the binary file (binfile:) or http: protocol to read files into your stack:

get url "binfile:filepath"
    # OR #
get url "http://url"

In these cases, if you are reading files created by Mac or Windows systems, you might need to replace the OS's end-of-line marker with LiveCode's:

replace numToChar(13) with linefeed in mydata --for binary reads of Mac files
replace numToChar(13)&numToChar(10) with linefeed in mydata --for binary reads of Windows files

Similarly, if you write data to a file in binary mode, or if you put data into a binfile or http URL, line endings are not automatically translated and you must do the translation yourself.

* For a brief discussion of the use of line ending characters in LiveCode, see this article by Richard Gaskin.

Upper ASCII. LiveCode automatically encodes upper ASCII characters in fields and scripts properly for the operating systems. So you don't have to worry about that. But what if you read a file that contains upper ASCII characters that was produced on a Mac system into your Windows-based stack? Or vice-versa? Again, LiveCode provides you a pair of functions to do that conversion: macToISO() and ISOtoMac() . Say your Mac-based stack has read a text file from a Windows system and you get this:

Run the following command to convert the text to Mac ASCII:

 put ISOtoMac(fld "myfield") into fld "myfield" -- you get:

In the converse situation—Mac-produced upper ASCII being read on a Windows-based stack—you would use:

put macToISO(fld "myfield") into fld "myfield"

The insufficiency of ASCII

As GUI OS's became more prevalent, so did the use of non-Latin/non-western fonts. At the same time the need for sharing data and computer files between users speaking various languages increased dramatically. But still there were only 256 characters to work with. To address this need, many fonts emerged that reassigned ASCII codes to other letters (e.g., Cyrillic). Often just the upper ASCII characters were reassigned so one could easily mix Latin and non-Latin characters without changing fonts. The result was many different encoding systems that could conflict with one another. For example, just for Cyrillic alphabet there were four common "standards": MacCyrillic - a Macintosh-native standard; KOI8 - most widely used on internet, pioneered on UNIX; Windows 1251 - MS Windows standard; and CodePage 866 (Alternative) - MS-DOS standard. Throw in Greek, Hebrew, Turkish, Arabic, not to mention ideographic writing systems like Chinese and Japanese, and the result was a veritable Babel of encoding standards that often conflicted with one another.

One solution was to increase the memory space the system uses to define the character set from 1 byte to 2 bytes; i.e. 8 to 16 bits. This means that your character table could now include 2¹⁶ (65,536) characters instead of only 256. This standard is called Unicode.

Go to Unicode in LiveCode lecture.

Back BYU LiveCode Lessons Gateway