Header image  

 

A project of the UCLA Center for Medieval and Renaissance Studies in conjunction with the CHLT and the Perseus Project

 
  HOME  :: MORPHOLOGICAL ANALYZER  :: PROJECT HISTORY  :: MEMBERS
   
 
Project History
 

0. An Early History of Computer-aided Research in Early Scandinavian Topics (CREST)


           The current project derives from earlier projects aimed at making Old Icelandic texts available in electronic format. Besides the obvious problems of representing Old Icelandic in a limited ASCII set (see Berkeley conventions), other problems arose related to limited searching (unlemmatized searching), and the accuracy of OCR. Despite some advances in OCR for Old Icelandic during the 1990s, and the emergence of more and more digital texts, Old Icelandic morphology was seen as a very high hurdle, given the morphological complexity of the language. During a conversation in 2000 with Gregory Crane, the editor-in-chief of Perseus, Timothy Tangherlini agreed to put together a team to develop an early version of an Old Icelandic morphological analyzer, and to integrate the analyzer and the extant Old Icelandic texts into the Perseus environment. This was made possible by funding from the National Science Foundation, the European Union, and the Center for Medieval and Rennaissance Studies.

1. Introduction

            Our morphological analyzer produces word form tables upon presentation of a head word from the Zoega lexicon. In addition, it comments on its computations to arrive at the final output. For example, given a head word

barn

the analyzer performs a lexicon lookup to retrieve the following information from its digital copy of the Zoega lexicon:

barn | barn | E | n | (1) bairn, child; vera með barni, to be with child; ganga með barni, to go with child; barns hafandi or hafandi at barni, with child, pregnant; frá blautu barni, from one's tender years; (2)  = mannsbarn; hvert b, every man, every living soul

Each lexicon entry consists of five fields: the headword itself, its original form in the lexicon, declension information (which in this case is empty as signaled by the symbol ‘E’), its part-of-speech, and finally its translation and usages.
            Given this information and an internal representation of the phonology and morphology of the target language Old Icelandic, the morphological analyzer determines and outputs all potential paradigms:

barn, noun, gender: n, a-stem

 

Singular

Plural

Nom

barn

börn

Acc

barn

börn

Gen

barns

barna

Dat

barni

börnum


(1) bairn, child; vera með barni, to be with child; ganga með barni, to go with child; barns hafandi or hafandi at barni, with child, pregnant; frá blautu barni, from one's tender years; (2) = mannsbarn; hvert b, every man, every living soul

In addition, the user can choose to output the analyzer’s internal application of its linguistic rules. For each output form, it lists the phonological and/or morphological rule underlying its change:


Lexeme: barn
Gender (if any): n
Declension info: nom_sg E
Stem (if any): a

The root is barn.

I found a stem: a.

Root consonants: b - r n
Root vowels: - a - -
Root vowels only: a

Sound changes for element Nom Sg:

     None.
Sound changes for element Acc Sg:

     None.
Sound changes for element Gen Sg:

     None.
Sound changes for element Dat Sg:

     None.
Sound changes for element Nom Pl:
     u-mutation to neut a-stem, nom & acc pl ...
Sound changes for element Acc Pl:
     u-mutation to neut a-stem, nom & acc pl ...
Sound changes for element Gen Pl:

     None.
Sound changes for element Dat Pl:
     Regular u-mutation ...

            A test version of our morphological analyzer is currently accessible on this website.

            The design and implementation of our morphological analyzer is guided by two main principles. Its object-oriented layout allows for its adaptation to other languages than Old Icelandic. In addition, its separation between linguistic rules, natural language resources, and the code itself enables the user to add new language resources. Both design principles are of major importance to go forward and expand the usability of the analyzer, as the following sections will illustrate.

2. General design of the morphological analyzer

            The original analyzer code was written in Perl, a programming language particularly suited for manipulation of Unicode and plain text strings. In addition, it allows for the creation of classes, i.e. an object-oriented architecture. Some attractive features of object-oriented programming are the hierarchical structuring of classes, the control over variable declarations and user permissions, and a high degree of convergence between the application design and its problem space. Figure 1 illustrates the general architecture of the analyzer:

Table for Barn
Figure 1: General architecture of the morphological analyzer.

            Currently, the Lexicon module consists of an electronic copy of the Zoega Old Icelandic lexicon, as well as excerpts from Old Icelandic sagas, most of these from the Legendary Sagas (Fornaldar sögur). However, the analyzer has been designed to accept input from various language resources, including diplomatic transcriptions of Old Icelandic texts, such as those produced by Matthew Driscoll, following the conventions outlined by Menota. For that reason, the morphological analyzer expects a normalized form of lexical entries as its input. This is accomplished by the Normalizer module. Compare for example the following entry in Zoega:

barna-börn, n. pl. grandchildren;

with its normalized version which is accessed by the analyzer:

barnabörn | barna-börn | E | n pl |  grandchildren

            The Target Language module contains information regarding the target language such as phonological and morphological rules. For example, the morpho-phonetic rule for the excision of consonants in Old Icelandic is represented as Perl pseudo-code:

RULE:             excision_consonant
CONDITION: rootc(-2) ne ‘-‘ && rootc(-1) ne ‘-‘ && tmp(0) eq rootc(-1)
ACTION:        shift tmp

This rule instructs the morphological analyzer to delete a given consonant if certain conditions regarding the consonantal structure of the lexeme root are being met. The rule set in the Target Language module contains the majority of phonological and morphological rules for Old Icelandic. This means that only few linguistic rules are hard coded into the analyzer. Our goal is to achieve complete separation between the Target Language module and the morphological analyzer itself.

            In addition to a linguistic rule set, the Target Language module consists of several databases for language specific data such as exceptions, umlaut information, word ending paradigms, etc.

            The third module in the architecture is the morphological analyzer itself. Upon being called, it determines the root structure of a word from the Lexicon module based on the rules and definitions in the Target Language module entry for Old Icelandic. Once it determines its part-of-speech, the analyzer creates a paradigm, performs the appropriate morpho-phonetic changes, and finally outputs the paradigm.

3. Shifting from Perl to Haskell / FM

            Although the morphological analyzer written in Perl returned accurate tables for an ever-increasing set of lexical entries, in Summer 2006, the team decided to explore some of the morphological engines that already existed, including Haskell/FM and the Xerox PARC XFST transducers. A discussion of the Haskell programming language and Functional Morphology in Haskell can be found in the Proceedings of the ninth ACM SIGPLAN international conference on Functional programming. An excellent overview of morphological analysis using XFST can be found at the Finite State Morphology homepage. Given that one of the goals of our project is that the analyzer’s structures and modules be easily adaptable to other early Germanic languages, and given the limitations of XFST to help us achieve those goals, the decision was made to transform our original Perl morphological analysis tool to one written in Haskell. As Forsberg and Ranta note, “The goal [of this approach] has been to make it easy for linguists, who are not trained as functional programmers, to apply the ideas to new languages.”
            Another major advantage of writing in the functional language Haskell is that  it allows us to write very dense code, based on finite functions, resulting in elegant and transparent solutions to various problems that arise in Old Icelandic. We expect to have the Perl code fully ported to the new Haskell environment by December, 2006.

4. Integration of additional lexical resources

Regardless of their source document, all normalized lexicon entries share the same structure. The normalization process relies on a library of rule objects. Each object contains the layout rules for a particular lexicon. It allows the Normalizer module to correctly interpret lexicon entries. Currently there exists only one object for the Zoega dictionary. To integrate the Cleasby and Vigfusson lexicon (or any other Old Icelandic lexicon, for that matter), our team will create a new rule object and add it to the library.
            A downloadable edition of the Cleasby and Vigfusson dictionary has been posted at Sean Crist’s Germanic Lexicon Project.

5. Automatic Normalization

6. Automatic Disambiguation

7. Morphological analyzer for Old English

            Initially, the morphological analyzer has been designed specifically to handle Old Icelandic lexemes. By making sure language specific code is not hard-coded in the engine, it will be possible to use the morphological analyzer for languages other than Old Icelandic.
            Due to their close relationship and similar development, Old Icelandic and Old English share many phonological and morphological features, and one of our goals is to show the applicability of our approach to a series of word classes in Old English (Saxon), providing a clear road map for others to follow.
            Similar to the Natural Language component of the analyzer, the ability to handle multiple target languages will be accomplished by adding language objects into the library of target languages. For a specific request, the morphological analyzer accesses the appropriate language object to apply the necessary phonological and morphological rules.

8. Conclusion

            The addition of Old English as acceptable target language and the Cleasby and Vigfusson dictionary in the Natural Language module will facilitate the generic nature of the morphological analyzer and further enhance its usability in the research community.