Advances in Corpus-based Contrastive Linguistics - Studies in honour of Stig Johansson

(Joyce) #1

206 Kerstin Kunz and Erich Steiner


1.3.2 Layers of annotation
The GECCo corpus is encoded with various layers of information:


  • All corpora are annotated on word level (morphology, parts of speech), chunk
    level (syntactic function and form of constituents), sentence level (segmenta-
    tion), and contain extralinguistic information to register and metadata such
    as source and publication date.

  • For the written corpora, alignment of parallel corpora exists on sentence level.

  • The spoken corpora additionally contain annotations in terms of speech pro-
    duction and speaker information.
    For the analysis of cohesion, we elaborate fine-grained extraction rules that allow
    combined multilevel queries with the corpus query processor CQP (Evert 2005).
    The annotation layers mentioned above are sufficient for the analysis of most
    cohesive devices. The investigation of semantic/logical/conceptual relations, such
    as co-reference, substitution, conjunctive relations or lexical cohesion requires
    an implementation of additional annotation layers. For this purpose, we com-
    bine automatic tools with manual post annotation or disambiguation. Currently,
    we elaborate processing mechanisms for the semi-automatic annotation of co-
    reference chains and lexical cohesion (semantic relations and cohesive chains).
    For the corpus-linguistic analysis of substitution we extract information on
    the word and chunk level in combination with a string-based search. A set of
    cascaded extraction procedures was defined in order to disambiguate cohesive
    and non-cohesive instances of the same form. Multifunctionality, however, still
    remains a problem. Thus the findings for many substitute forms had to be checked
    and filtered manually. What became particularly apparent during extraction was
    that well-known German-English contrasts in information structure impact on
    automatic traceability. While disambiguation can be done on the basis of some
    syntactic restrictions for English, positional flexibility and an even richer mul-
    tifunctionality often complicate the definition of extraction rules for German.
    Furthermore, our extractions reveal that the set of closed class items functioning
    as substitute forms in English does not seem to have an exactly corresponding
    set of items in German. Thus, in order to permit comparability across languages,
    the lexico-grammatical realizations of substitution have to be mapped onto their
    semantics/function, so as to have a tertium comparationis (see 2.1).
    In the following, we will content ourselves with providing sample extracts
    from the corpus when discussing the various types of substitution in detail
    (Section 2.2). We attempt a discussion of the empirical findings for the two lan-
    guages in Section 2.3, which may have to be refined in the future.

Free download pdf