206 Kerstin Kunz and Erich Steiner
1.3.2 Layers of annotation
The GECCo corpus is encoded with various layers of information:
- All corpora are annotated on word level (morphology, parts of speech), chunk
level (syntactic function and form of constituents), sentence level (segmenta-
tion), and contain extralinguistic information to register and metadata such
as source and publication date. - For the written corpora, alignment of parallel corpora exists on sentence level.
- The spoken corpora additionally contain annotations in terms of speech pro-
duction and speaker information.
For the analysis of cohesion, we elaborate fine-grained extraction rules that allow
combined multilevel queries with the corpus query processor CQP (Evert 2005).
The annotation layers mentioned above are sufficient for the analysis of most
cohesive devices. The investigation of semantic/logical/conceptual relations, such
as co-reference, substitution, conjunctive relations or lexical cohesion requires
an implementation of additional annotation layers. For this purpose, we com-
bine automatic tools with manual post annotation or disambiguation. Currently,
we elaborate processing mechanisms for the semi-automatic annotation of co-
reference chains and lexical cohesion (semantic relations and cohesive chains).
For the corpus-linguistic analysis of substitution we extract information on
the word and chunk level in combination with a string-based search. A set of
cascaded extraction procedures was defined in order to disambiguate cohesive
and non-cohesive instances of the same form. Multifunctionality, however, still
remains a problem. Thus the findings for many substitute forms had to be checked
and filtered manually. What became particularly apparent during extraction was
that well-known German-English contrasts in information structure impact on
automatic traceability. While disambiguation can be done on the basis of some
syntactic restrictions for English, positional flexibility and an even richer mul-
tifunctionality often complicate the definition of extraction rules for German.
Furthermore, our extractions reveal that the set of closed class items functioning
as substitute forms in English does not seem to have an exactly corresponding
set of items in German. Thus, in order to permit comparability across languages,
the lexico-grammatical realizations of substitution have to be mapped onto their
semantics/function, so as to have a tertium comparationis (see 2.1).
In the following, we will content ourselves with providing sample extracts
from the corpus when discussing the various types of substitution in detail
(Section 2.2). We attempt a discussion of the empirical findings for the two lan-
guages in Section 2.3, which may have to be refined in the future.