Chum 240

Flat-files

Electronic computers store persistent data in files on
secondary storage, such as disk drives and flash drives.

The data in these files can be made to serve as purpose-made
databases for small quantities of information that only needs
minimal organization.

The layout of data in these files is limited only by what the
programmer can conceive, but simpler is often better.

Sometimes binary, or raw computer data, is used. Using binary
data may give a few small advantages by making the file
smaller or quicker for software to decipher, but it also makes
it difficult to change the layout at a later time and almost
impossible for another person or program to use without
documentation.

More often the data in flat-files is human-readable text.
While a person can easily read the information, its structure
and use may not be readily apparent.

To overcome this limitation, text-based flat-files often allow
textual comments that are ignored when used by a program, but
instructive to a person reading the text. By following the
directions in these comments, a person may be able to make
changes or even additions and deletions in the database using
only a simple text editor.

Today, text-based flat-files are widely used for configuration
and customization of programs on Unix-based operating systems.
But, true to the ad-hoc origin of the format, there are few
true standards.

Delimiters and Separators

There are two important terms we use when talking about how to
read data files, both as a human and as a computer programmer:
delimiter and separator.

A delimiter is anything that surrounds or sets bounds on a
collection. In English text, for example, quote marks delimit
book titles, dialog, or words or phrases used verbatim. It is
important to remember that the opening or beginning delimiter
need not be the same as the ending or closing one. Sentences,
for example, are delimited by a capitalized word and terminal
punctuation.

A separator is anything that marks where one thing ends and
another begins. Another example from English is when commas
are used to separate items in a list.

The structure of all flat-files rely on humans or computers
being able to recognize delimiters and separators. The most
often used separator is the end-of-line sequence, which,
unfortunately, is different for each operating system.
Fortunately, the programs that we use to look at text files
often take care of these differences for us, and we see
lines of text just as in this document you are reading now.

Finally, delimiters and separators are not considered part
of the data; except for finding where the data starts and
ends they are ignored.

INI File Format

One format that has persisted from the early versions of
Microsoft Windows is the INI file. Before introducing the
registry system in Windows 95, the operating system used these
files to store configuration options for application programs
as well as the operating system itself. Because of the
wide spread use of the Windows operating system, this format
is well-known and easily recognized.

The "INI" in the INI file designation comes from two facts:

First, the data in the file was used to "initialize" a program
to a known state when it started.

And second, file names were allowed a three-character
"extension" that identified its purpose; so "initialization"
was shortened to "ini". (Aside: Programmers and other
technical people have an almost pathological penchant for
shortening names. It has its origins in mathematical
notation.)

A typical INI file might look like this:

; editor.ini

[settings]
; various settings for text indenting 
automatic indent = true
indent size = 4
use tabs = false
tab stops = 8/16/42/51

; Settings for font
font-family = courier
font-size = 12

[recent-files]
; Recently edited files.
recent1 = flat-files.html
recent2 = syllabus.html

[recent1 detail]
last edit = 10/01/2001
editor = bockholt
lines changed = 51

[recent2 detail]
last edit = 10/15/2001
editor = larson
lines changed = 251

The structure imposes a three-level hierarchy on the data.
The levels are:

* File
* Section
* Label/Value pair

Inside the file, single lines may contain either a comment, a
section heading, or a label/value pair. Empty lines or lines
containing only blank spaces are also allowed.

Sections are separated by a line containing the name of the
section delimited by square brackets. Sections with names that
duplicate the names of other sections lead to undefined
behavior; perhaps one of the sections will be ignored, or
perhaps they will be merged.

Under a section heading are lines that start with a label
or name followed by an equal sign, after which comes text that
represents the value of that section/label combination. As
with sections, data with a duplicate label may be lost or
corrupted.

The value associated with a label is delimited by the equal
sign (and any spaces that follow it) and the end of the line.

The three level hierarchy might seem to be a bit of a
limitation, but a label or value may be used as a section name
(or part of one) or as label in another section. In this way
the hierarchical depth can be extended indefinitely, if a
little awkwardly.

For example, in the sample ini-file above, the labels for data
in the recent-files section are combined with the word
"detail" to create a section with details on the file that was
edited.

Another approach suitable for small quantities of data
associated with a label is to use a separator within the data
to make it into a list. An example of this is the "tab stops"
value in the sample ini-file above, where a slash (/) is used
to separate the values.

- - - -
Copyright Š2007 Brigham Young University