9.4 Transformation Techniques 199
Event-based parsing has an intuitive appeal. Most programs in bioinfor-
matics act upon files that have a flat structure where each line of the file
represents one record or event. The program consists of an operation that
is performed on each input record. This works well for many problems, es-
pecially those that involve computation of statistics. However, event-based
parsing can be very difficult to use for any nontrivial transformation task,
such as the microarray example in section 9.3. The difficulty is that the trans-
formation may require information that is not immediately available at the
time the event occurs. Thus one must save information for later use. Creat-
ing data structures that serve this function requires a great deal of time and
experience.
The second approach is calledtree-based processing. In this approach the en-
tire document is read into memory using a standard data structure. The data
structure is known as a “tree” to computer scientists, which is why this form
of processing is called tree-based. The most commonly used standard for the
data structure is called the document object model (DOM). The advantage of
this approach is that all information in the document is available at all times.
No additional data structures need to be developed just for the sake of ensur-
ing that information is always available when needed. However, the DOM
model is complicated and takes some time to understand.
Although traditional programming languages are an effective means of
processing documents, most transformation tasks can be accomplished much
more easily by using languages designed specifically for this task. The dis-
advantage is that one must learn yet another language. This can be a very
serious disadvantage if one is not going to be using the language very often.
However, if one is performing relatively simple tasks, then one does not need
to know very much of the language.
Specialized transformation languages have the advantage that they em-
phasize the meaning of the document (itssemantics) rather than its appear-
ance (itssyntax). This is done by using rule-based (declarative) programming
rather than the more traditional procedural (imperative) programming style.
By focusing on the content rather than low-level details, one can develop
transformations much more effectively.
This approach has a long history going back to the style files of LaTeX that
are still in use today. The idea was to allow the writer of a document to focus
on its meaning rather than typesetting details. The typesetting details were
specified in a separatestyle file. In a LaTeX file one can specify the overall
style of the document as well as the style to be used for more specialized
purposes such as for the bibliography. One can change the style of a docu-